Time Series Modeling Approach

This document is the second part of the “Time Series Data Modeling” chapter. YMatrix believes that the design of data models directly impacts the value of data consumption and usage. Therefore, in addition to technical introductions, we aim to provide you with a clear understanding of the concepts, applications, and development of time series data models (Time-series Data Model) throughout this chapter.

The first article, “What is a Time-series Data Model?”, answers a series of progressively deeper questions to provide a clear understanding of the concept of time-series data modeling itself.
The second article, “Time-series Modeling Approach” (this document), attempts to provide theoretical guidance for YMatrix's relational model design approach.
The third and fourth articles are examples of data modeling in the context of connected vehicles and smart homes. Guided by “Time Series Modeling Ideas,” they provide best practices for modeling different time series scenarios in YMatrix.

1 First: Understand the time series data model

Before building your own time series data model, please read the first document in this chapter: What is a time series data model? to understand YMatrix's view of time series data models.

2 Second: Understand the requirements of time series scenarios

YMatrix believes that understanding the characteristics of existing time series queries is the key to understanding the core requirements of existing time series scenarios. A practical time series scenario usually has comprehensive query requirements, which may be a combination of several different query requirements in the table below. The application scenarios in the table below are mainly based on the example of the Internet of Vehicles.

Query Features	Description	Application Scenarios
Single Device
Single Device Point Query	Query the value of a single indicator or multiple indicators for a single device at a specific point in time. The most common query is the latest value or non-empty latest value of certain indicators for a specific device.	Real-time speed of a vehicle in motion, etc.
Single Device Detailed Query	Queries detailed data for a specific device over a specific time interval, which may involve a single metric or multiple metrics	Typically used for further detailed analysis, such as fault cause analysis in after-sales service
Single Device Aggregate Values	Queries aggregate values for specific metrics of a specific device over a specific time interval, such as maximum value, minimum value, first value, last value, average value, etc.	Average speed of a vehicle over a specific time period, etc.
Multiple devices
Multiple device point query	Query the metric values of multiple devices at a specific time point, which may involve a single metric or multiple metrics	Combined signal query, status monitoring (e.g., triggering an alarm mechanism when vehicle fuel/battery level is low); query metric a and metric b from 10 seconds prior to determine if they match, to assess whether the vehicle is in a specific state (e.g., vehicle aging)
Multi-device detailed query	Query detailed data for multiple devices over a specific time interval, which may involve a single metric or multiple metrics	Data phenomenon discovery. Such as root cause analysis, new product tracking, etc. Generally targeted at R&D personnel
Multi-device aggregated values	Query aggregated values of certain metrics for multiple devices within a specific time interval, such as maximum value, minimum value, first value, last value, average value, etc., and group by device	Check network status across regions, calculate average network speed to confirm if multiple vehicles arriving at a location all experience poor network connectivity
Other queries
Multi-dimensional query	Query the latest values, details, or aggregated values of devices that meet certain conditions based on tags or attribute information. If multiple devices are involved, the results are grouped by device	Monitor the current status of vehicles
Multi-table Joins	Multi-table join queries can clearly reveal the specific meaning of time-series data. The results of multi-table joins can serve as a set of qualifiers to define the meaning of a specific time-series data point. In YMatrix, this functionality is implemented using the `JOIN` statement	For example, analyzing the association between different vehicle business tables
Advanced analysis	Subqueries, window functions, Cubes, CTEs, cumulative sums, incremental calculations, growth rate calculations, change points, signal jump difference queries, quantile functions, row-column transformations, cumulative distribution functions, `CASE... WHEN...` expressions, sliding windows, linear interpolation, peak detection, recursive queries, etc.	Specific pattern recognition, trend prediction, root cause analysis, threshold correction, anomaly detection, post-sales fault analysis, monitoring dashboard display, ad-hoc analysis, battery alerts, peak value alerts, anomaly data cleaning, null value filling, TOP N peaks of specified metrics for specified devices within a specified time range, etc.
Federated queries	Access MySQL and other databases, access Hive	Some data is stored in MySQL and other databases, supporting associated queries with the database; offline data is stored in Hive and other systems, supporting associated queries with the data within them
Machine Learning	Linear Regression Prediction	Based on data from January 1st to 10th, calculate a linear regression model to predict the metric for the 11th
Spatio-Temporal Specific Queries	First function, last function, window queries, etc., which may also involve queries combined with spatial data	Start oil level, end oil level, etc.

3 Finally: Designing a Time Series Data Model

Now that you have deduced the time series scenario query requirements and data sources (data collection sources) based on the data visualization results you want to obtain, you can enter the formal model design stage. Congratulations!
Based on the above information, you should be able to estimate the number of devices, data scale, number of metrics, and metric data types for this scenario. Indeed, these factors will determine the path of your model design.

3.1 Core Judgment Factors

Core Judgment Factors	Description
Number of Devices	The term “device” may have different meanings in different scenarios. For example, in the context of vehicle networking, “one device” may refer to a vehicle or a sensor.
Metric Type	The data type of the metric. For example, int, float, text, etc.
Metric Count	The number or scale of metrics

YMatrix identifies three core evaluation factors for time-series data modeling: device count, metric type, and metric count.

Device count, metric type, and metric count each have distinct impacts on the initial model design phase:

First, all three factors—device count, metric type, and metric count—relate to the model's data point capacity. Additionally, data collection frequency and retention duration also influence the assessment of the model's capacity. Considering that the number of devices, metric types, and metric counts will continue to grow as the business expands, and that different businesses have varying requirements for data retention duration and collection frequency, it is essential to have a thorough prediction of the model's dynamic capacity for time-series data over a specific period during the initial design phase.
Secondly, the degree of certainty in determining metric types and metric counts during the initial design phase directly impacts our selection of wide table/wide table variant/narrow table/narrow table variant models. When both the certainty of metric data types and the number of metrics are high, the preferred choice is the wide table and wide table variant models, as these are most conducive to maintaining superior query performance. When the number of metrics is unpredictable but the certainty of metric types is high, the narrow table variant model can be used. When both certainty levels are low, the narrow table model can be used to maximize the simplicity of writes.

3.2 Selecting a Time Series Model Type

In the first article of the “Data Modeling” chapter, What is a Time Series Data Model?, we have already provided a detailed introduction to the relationship between non-relational models and relational models, as well as a brief overview of the differences between wide tables, wide table variants, narrow tables, and narrow table variants. In this document, we will focus on how to select specific model types.

3.2.1 Data Metrics and Their Types in Specific Scenarios — Wide Tables

Using a wide table model means you will end up with a time-series table containing many metric name columns and metric value columns. Depending on the number of metrics you have, a wide table can become very wide, as its name suggests. With proper upfront design, wide table queries can be very fast. Seeing 200 or more columns in a single table is common. Here is a simple example of a wide table:

Timestamp	Device ID	Temperature	Wind speed	Humidity
2021-11-16 13:57:13	1	29.2	3.2	55
2021-11-16 13:57:13	1	21.5	8.5	35
2021-11-16 13:57:13	1	22.3	7.2	40
2021-11-16 13:57:13	2	24.1	6.8	21
2021-11-16 13:57:13	2	19.8	2.9	38
2021-11-16 13:57:13	2	20.7	4.5	56
......

Wide table is the advantage model of YMatrix. One row carries more data information, while the timestamp and device ID columns are only one (because there is only one indicator table), which reduces redundancy. For creating indexes, the index overhead of narrow tables is obvious, while because of the small number of rows of wide tables, the cost of indexes is relatively low, which is conducive to query performance. For the above example, if you want to know the temperature data for each device, you can split the tables by creating a Lookup table that maps the relationship between the device and the indicator to achieve single-device/partial multi-device query.
So, if there is only one wide table, has the utilization of storage reached its peak? We believe that this should be judged based on the case where the null value is written in the specific business scenario. The value column is empty or not, mainly depends on the way and frequency of actual indicator data collection. In fact, in scenarios where there is a large amount of null data written (we also call it ultra-wide sparse table), null data also occupies a considerable amount of storage space.
It is not easy to add a new "indicator list + index value column" to the designed wide table model or add a new index value type. Therefore, the early design cost of wide tables is relatively narrow, and it needs to be considered thoroughly. But once a good design is available, it will largely gain good query performance in the later stage.
It is mainly suitable for scenarios where data indicators and types of them are relatively determined in the early design stage, such as certain Internet of Vehicles-related scenarios. #### 3.2.2 The frequency grading of data query is obvious - wide table variation In some scenarios, the query frequency for different indicator data varies greatly. In this case, YMatrix implements a structured + semi-structured wide table variation workflow to write data queried by high frequency and data queried by low frequency to table in different ways, protecting the query performance of high frequency data while balancing the writing convenience of low frequency data. The so-called "structured" means the traditional storage method of wide tables in relational databases, while "semi-structured" means using semi-structured storage method (JSON/MXKV) to store extended metric names and indicator value data in the last column of the existing wide table (that is, it will exceed the maximum number of columns limit), so as to achieve flexible scalability of ultra-wide tables. A simple example is as follows:

Time Stamp	Temperature	Wind	Humidity	Tags
2021-11-16 13:57:13	29.2	3.2	55	Meteorological sensor, model A, Beijing, Huairou District
2021-11-16 13:57:13	21.5	8.5	35	Meteorological sensor, model A, Shanghai, Chongming District
2021-11-16 13:57:13	22.3	7.2	40	Meteorological sensor, model A, Beijing, Haidian District
2021-11-16 13:57:13	24.1	6.8	21	Meteorological sensor, model B, Beijing, Huairou District
2021-11-16 13:57:13	19.8	2.9	38	Meteorological sensor, model B, Shanghai, Chongming District
2021-11-16 13:57:13	20.7	4.5	56	Meteorological sensor, model B, Beijing, Haidian District
......

3.2.3 Scenarios with uncertain data indicators and their types - Narrow table

Narrow table layout is exactly the opposite of wide tables. A common feature of narrow tables is that there are few columns of indicator value, usually only one. Here is a simple narrow table model example:

Timestamp	Device ID	Indicator Name	Indicator Value
2021-11-16 13:57:13	1	Temperature	29.2
2021-11-16 13:57:13	1	Wind	3.2
2021-11-16 13:57:13	2	Temperature	21.5
2021-11-16 13:57:13	2	Wind	8.5
......

As you can see, in the narrow table model, except for the fixed "Timestamp + Tag (Device ID)" combination, there is only one metric name column and one metric value column. The metric name column stores all metric names and the metric value column stores all metric values of the same data type. In other words, if business upgrades require adding indicators or equipment, you only need to add one line of data. If the types of indicator values to be stored increase, you need to add columns of the corresponding indicator value types or create corresponding tables to store other types. Therefore, it is necessary to modify the table structure or create additional tables. Of course, if you often use JSONB as the data type of the value column, you can also choose to store valid JSON values in this column, but Please note that this will add additional JSON parsing at the same time. The narrow table model has an advantage in scenarios where the number, category or indicators of equipment are very uncertain in the early design stage. Data storage, writing, and model expansion are all very simple, but this needs to be based on the premise that the index value type is relatively single. Because every additional index value type means adding a new table with redundant columns except the index value.

3.2.4 Scenarios in which the indicators are uncertain but their type are determined - Narrow table variations

In some practical scenarios, the data type of indicator value will continue to increase with business demand. For this case, the traditional narrow table model can only be expanded by increasing the number of tables. Obviously, this method is very limited. Therefore, we can design a variety of a narrow table: its size is between a narrow table and a wide table, and the basic design principle is to create a column for each necessary data type. That is, if we have two different data types: integers and floats (assuming float8). Then we end up with two different columns of value, as follows:

Timestamp	Device ID	Indicator Name	Indicator Value_int	Indicator Value_float
2021-11-16 13:57:13	1	Temperature		29.2
2021-11-16 13:57:13	1	Wind		3.2
2021-11-16 13:57:13	1	Humidity	55
2021-11-16 13:57:13	2	Temperature		21.5
2021-11-16 13:57:13	2	Wind		8.5
2021-11-16 13:57:13	2	Humidity	35
......

Using this design, it is often necessary to create another lookup table.

Indicator Name	Data Type Column Name
Temperature	Index value_float
Wind	Index value_float
Humidity	Index value_int

When writing or querying specific indicator values, you can connect two tables through the JOIN statement to locate the data location. If you need more metric value types, add more columns. Compared to know all possible indicators in advance, it is relatively easier to understand all possible indicator value types in advance, and you also need to know the mapping relationship between the indicator value types and indicators.

Notice!
For examples of model construction in specific scenarios, please refer to Data Modeling Example in the Internet of Vehicles Scenario and Data Modeling Example in the Smart Home Scenario.

简体中文