Time-series Data Model

This document is the first article in the “Time-series Data Modeling” chapter. YMatrix believes that the design of data models directly impacts the value of data consumption and usage. Therefore, in addition to technical introductions, we aim to provide you with a clear understanding of the concepts, applications, and development of time-series data models (Time-series Data Model) throughout this chapter.

  • The first article, “What is a Time-series Data Model?” (this document), answers a series of progressively deeper questions to provide a clear understanding of the concept of time-series data models.
  • The second article, “Time-series Modeling Approach,” offers theoretical guidance on YMatrix's relational model design principles.
  • The third and fourth articles are examples of data modeling in the context of connected vehicles and smart homes. Guided by “Time Series Modeling Ideas,” they provide best practices for modeling different time series scenarios in YMatrix.

1 How are time series scenarios generated?

The information age is changing at an astonishing rate, which is inseparable from the capture, analysis, and utilization of data by humans. Map navigation software, street surveillance cameras, and heating companies all provide services to our daily lives through precise data statistics. However, as technology advances and the pace of life accelerates, we no longer just want to know where to go and how to get there; we also want to know in real time which roads are clear to avoid traffic congestion. Governments monitor public areas to protect public safety, while we also want to purchase real-time health monitoring devices to monitor our physical condition in real time for health protection. It must be lightweight and portable, preferably with a stylish design; heating companies are no longer satisfied with understanding monthly average daily temperature (MDT) changes but want to join the smart city development trend, using hourly temperature, wind speed, and precipitation changes to enhance their property modeling and optimize energy efficiency...
Clearly, merely conducting data statistics over longer time periods can no longer meet humanity's increasingly “greedy” demands. Detailed, feature-rich data has become one of the most valuable commodities in our information-scarce world. In the process of analyzing demand, we realized that the core of these scenarios is inseparable from one type of data: it is strongly related to time and comes from various devices; it accumulates over time and has rich utilization value; it is very large, easily reaching TB or even PB levels, and has extremely high requirements for the storage performance of the underlying database.
Based on its essential characteristics, early scholars called it time series data.

2 What is time-series data?

Without a doubt, time-series data is dynamic and ever-changing. It is like a dynamic movie playing out in real time within a business system, with no end in sight. It possesses rich and powerful utility value, not only helping businesses reduce costs, improve efficiency, and enhance quality, but also guiding aspiring visionaries toward suitable directions for exploration. In today's world, those who possess and fully utilize it can be said to have seized the initiative of the times.
Specifically, YMatrix believes that time-series data primarily consists of the following components:

  • Tags (Tag): Certain static attributes that remain constant and are unaffected by the passage of time. Examples include refrigerator brand, device serial number, place of origin, place of purchase, and manufacturing date.
  • Metrics: Certain dynamic attributes that change over time. For example, the temperature, humidity, and power consumption of a refrigerator. Sometimes metrics are also referred to as measurement points, i.e., measurable points.
  • Timestamps: The value at a specific point in time, such as 2023-02-10 20:00:00.
  • Point: The value of a metric at a specific point in time, such as the temperature data of a Haier refrigerator at 20:00, which is 6.2.

In summary, we can define time series data in YMatrix as follows: time series data refers to a series of data points that are strongly correlated with time. In applications, it usually manifests as a series of data points collected at different points in time.
Time series data can track changes over different time intervals, such as milliseconds, days, or even years, providing powerful insights. We believe that regardless of the scenario or use case, all time series data sets have several things in common:

  • Collected data is always recorded as a new row.

  • Data typically arrives in the database in chronological order.

  • Time is the primary axis (time intervals can be regular or irregular).

  • Timeliness. The newer the time-series data, the greater its value, with value density gradually decreasing over time.

  • Downsampling. Downsampling involves using a GROUP BY statement to group the raw data into broader time intervals and calculate the key characteristics of each group. Downsampling not only reduces storage overhead but also preserves key data characteristics, making it easier to analyze historical trends and predict future trends.

  • Requires integration with relational data to be valuable. Without structured relational data providing contextual information, time-series data is just a number. We need sufficient structured information to describe a number. For example, for a data point with a value of 36.5, we need to understand whether it represents temperature, humidity, or pressure? What are its units—Celsius or Pascal? What is its source—a boiler, conveyor belt, or bearing? Is the equipment located in Zone 1 or Zone 2 of the factory? And so on. Data with richer associative information often has higher value.

So how does the time series scenario differ from traditional OLTP (Online Transactional Processing) and OLAP (Online Analytical Processing) scenarios? Let's use a table to illustrate:

Business Scenario Data Manipulation Language (DML) Write Method Query Requirements Concurrency
Time Series INSERT / Appendly-only High-frequency streaming write Time-based point query, details, aggregation; Association analysis, complex analysis High concurrency
OLTP INSERT / UPDATE / DELETE High-frequency write Point queries High concurrency
OLAP INSERT / Few UPDATES / Few DELETES Low-frequency batch write (ETL) Association, aggregation Low concurrency

3 What is a time-series data model?

In terms of time-series database development, the data model is the pattern for organizing data and the interface exposed to users. Users must understand the data model on which the database is built in order to know how to use it. In other words, the way a database chooses to model and store data determines what you can do with it.
In terms of user usage, the data model is more inclined to define best practices for table meta information, which is a modeling action that needs to be designed in advance according to business needs.
According to statistics, time series databases have become the fastest growing type of database in the world.
When time series databases dedicated to time series scenarios and time series data services were first created, most of them were designed and built based on non-relational data models in pursuit of rapid expansion of data scale. However, today, as user demand for time series data consumption gradually deepens, the query performance of non-relational databases is no longer sufficient to meet business-level usage in many scenarios. The lack of a unified standard interface has led to rising learning costs for developers, who must continuously modify programs at the business level to meet demand. Meanwhile, the rapid development of distributed databases has significantly improved the write performance of relational databases, which were previously relatively weak in this area. As a result, the industry has gradually returned to relational data models, and YMatrix is part of this trend.

We agree that, in the long term, storing all data in a single system will significantly reduce application development time and costs, and accelerate the speed at which you make critical decisions.

3.1 Non-relational model or relational model?

There are two completely different views on the development and design of time-series data models:

  • Schemaless non-relational data models that do not require table design, represented by non-relational time-series databases such as InfluxDB and OpenTSDB.
  • Relational data models that require prior table schema design, represented by relational time-series databases such as TimescaleDB and YMatrix.

In fact, both of these models have played a very important role in the development of time-series databases. Initially, due to the extremely fast accumulation speed of time-series data, early developers believed that traditional relational models were difficult to handle such large-scale datasets, while non-relational data models with simple data entry performed better in terms of scalability. Therefore, time-series databases developed using non-relational models.
Non-relational database products like InfluxDB decided to take on the challenge of building a database from scratch, and achieved initial success thanks to their advantages in terms of runtime speed and scalability. However, as business requirements continued to evolve, data consumers gradually realized that while the low barrier to entry of non-relational databases (typically requiring minimal upfront data modeling and enabling quick deployment) was initially appealing, these systems became increasingly difficult to manage as they grew larger and more complex. Without a universal query language, both database developers and operations personnel must address this “technical debt,” and may even need to learn more advanced programming languages to meet more complex query and operations requirements.
The increasingly high costs and narrowing user interfaces have made non-relational time-series databases unaffordable for users, prompting them to consider returning to relational databases (where the “burden” on the database itself has grown heavier, while humans have gradually been relieved of the operational burden), and this trend has become a wave.

The "regression" we are talking about here is not a fallback of a database technology, but a design of new variants that conform to the characteristics of timing scenarios based on traditional relationship libraries. For example, a wide table variation combining "structured" and "semi-structured" are shown below. ### 3.2 Narrow table or wide table? Since timing applications are designed to store a large amount of time-related information, data modeling of underlying data storage is essential. The related data model describes information in the externalized form of a related data table. Since YMatrix is ​​​​​built on a relational data model, you can design the DDL of partitioned tables (or Schema of a generally called table) in a variety of ways. Generally speaking, there are two main storage modes: Narrow table (Narrow) and Wide table (Wide).

  • Narrow Table: A common feature of narrow tables is that there are few columns of indicator value, and usually only one data point per row. A timestamp, a label, and an indicator jointly locate a unique data point. Its biggest advantage is that it is easy to get started and expand. The failure is that the redundancy cost is high, resulting in large space overhead and sacrificing certain query performance.
  • Wide table: The wide table happens to be opposite to the narrow table. Designing a wide table means that there are a large number of indicator columns and indicator value columns in the table, that is, there are multiple data points in a row, and these data points share the same set of "time stamp + device labels". These indicator values ​​​​​may be empty or not, and are mainly determined based on the method and frequency of data collection. Although some preparations are required for preliminary design, its query performance is extremely superior and can maximize the potential of data.
  • Narrow table variant: In some practical scenarios, the data type of indicator value will continue to increase with business demand. For this case, the traditional narrow table model can only be expanded by increasing the number of tables. Obviously, this method is very limited. Therefore, we can design a variant of a narrow table: its size is between a narrow table and a wide table, and the basic design principle is to create a column for each necessary data type. That is, a row has multiple data points of different data types. Its advantage is that it is easy to get started, and it is very suitable for scenarios where the expected indicator type and the mapping relationship between the expected indicator type and the indicator is relatively certain.
  • Wide table variant: In YMatrix, we will also adopt a method to take into account the advantages of wide and narrow table models as much as possible. That is, on the basis of traditional wide tables, only the last column is set to "JSON / MXKV column", and indicators, types or low-frequency queries and data that exceed the traditional wide table carrying capacity (the limit of 1600 columns from the upstream PostgreSQL database table construction) are stored in this extended column, realizing flexible variations of the wide table model in the relationship library.

| | Narrow Table | Wide Table | Narrow Table Variation | Wide Table Variation | |--|--|--|--|--|--|--| | Pre-term design cost | Low | Higher | Lower | Higher | | Expand difficulty | Easy | More complex | Easier | Easier | Easier | | Space Overhead | Large | Small | Large | Small | | Query Performance | Low | High | Low | High |

Notice! For detailed introduction to wide tables, narrow tables, narrow table variants, and wide table variants, please refer to [Timing Series Modeling Ideas] (/doc/5.0/datamodel/guidebook); for specific scenarios, please refer to [Data Modeling Examples in the Internet of Vehicles Scenario] (/doc/5.0/datamodel/V2X_best_practice) and [Data Modeling Examples in the Smart Home Scenario] (/doc/5.0/datamodel/SmartHome_best_practice).