Relational model storage solution

As mentioned earlier, MatrixDB uses a relational model storage solution. The following will introduce in detail how to design a timing table under a relational model.

1. Time sequence data composition

The timing data is generated by the equipment cycle, so it contains at least the following information:

  1. Equipment identification
  2. Collect indicators
  3. Time stamp

Among them, timestamps are a relatively unified and standardized data type, such as:

2021-11-16 10:32:29

In comparison, device identifiers and metric data are more complex.

1.1 Device Identifiers

Device identifiers are used to identify and locate a specific device, including the following information:

  1. Device category: What type of device it is, such as a monitor, sensor, or measuring instrument.
  2. Device model: Within a specific category, there are many different models, such as Sensor 1.0, Sensor 2.0, etc.
  3. Device Number: Within the same category and model, there are many devices, and the device number is a unique identifier for each device, such as a phone's serial number.
  4. Device Tags: To facilitate statistical classification of devices, additional tags can be added to devices, such as city tags, region tags, purchase channel tags, etc.

1.2 Collection Metrics

Collection metrics are data collected by devices, and there are significant differences in metric data between different types of devices. For example, monitors collect images and videos, while sensors collect temperature, humidity, wind speed, and other data. Broadly speaking, time-series metric data comes not only from hardware devices but also from time-series data generated by software, such as server logs. Therefore, there are significant differences in the type, form, and generation cycle of metric data.

2. Table Model Design

For the relational model, a fixed schema is required, so modeling must be done first.

2.1 Which tables should be created?

Which tables should be created? Let's start with device identifiers. Devices are categorized into attributes such as category, model, serial number, city, and region. Should a separate table be created for each category, each model, each device, or should all devices be stored in a single table?

It is recommended to create one table for data of the same category. Since the number of devices is significant and will continue to grow, it is impractical to create a separate table for each device, as this would result in high operational costs and metadata management costs. It is also impractical to store all devices in a single table, as different categories of devices collect vastly different metrics, and a single table with too much data would be difficult to analyze and maintain.

Therefore, after categorization, the tables may include the following:

  1. Sensor table
  2. Measurement instrument table
  3. ......

2.2 Device ID Redundancy Removal

After determining which tables to use, the next step is to consider how to design each table. Since the device category has already been determined during table creation, the remaining time-series information is as follows:

Weather Sensor Data Collection Table

Device ID Data Collection Metrics Timestamp
Model 1, ID 0001, Beijing Municipality, Huairou District Temperature: 29.2°C, Wind Speed: 3.2 km/h, Humidity: 55% 2021-11-16 13:57:13
Model 2, ID 0010, Shanghai City, Chongming District Temperature: 21.5°C, Wind Speed: 8.5 km/h, Humidity: 35% 2021-11-16 13:57:13

Device identifiers may occupy a relatively large amount of storage space because they contain a variety of attribute information. In addition to the unique device number, other attributes such as model, city, and region contain a large amount of duplicate information, so it is unnecessary to store each piece of collected information in its entirety. Here, we can use the principles of relational database design to create an additional device identifier table and generate an ID. Then, the collection table and the device identifier table are linked by ID, as follows:

Device Identification Table

ID Model Number City Region
1 Model 1 0001 Beijing Huairou District
2 Model 2 0010 Shanghai Chongming District
......

Weather Sensor Data Collection Table

Device ID Data Points Collected Timestamp
1 Temperature: 29.2°C, Wind Speed: 3.2 km/h, Humidity: 55% 2021-11-16 13:57:13
2 Temperature: 21.5°C, Wind Speed: 8.5 km/h, Humidity: 35% 2021-11-16 13:57:13
......

After this division, the storage space for device identifier information in the collection table is significantly reduced, requiring only one integer. The data volume of the device identifier table is equal to the total number of devices, which is relatively fixed compared to the number of collection metrics.

2.3 Metric Data Storage Scheme

Now that the device identification model has been resolved, let's look at how to model the collected metrics. Collected metrics can be divided into narrow tables and wide tables according to their storage methods:

Narrow Tables

The narrow table scheme stores only one metric data per row, such as:

Device ID Metric Name Metric Value Timestamp
1 Temperature 29.2°C 2021-11-16 13:57:13
1 Wind Speed 3.2 km/h 2021-11-16 13:57:13
1 Humidity 55% 2021-11-16 13:57:13
2 Temperature 21.5°C 2021-11-16 13:57:13
2 Wind speed 8.5 km/h 2021-11-16 13:57:13
2 Humidity 35% 2021-11-16 13:57:13
......

The advantages of this storage scheme are:

  1. Flexibility: new metric values can be directly expanded
  2. Since there is no fixed format, there is no risk of a column being NULL due to missing values

The disadvantages are:

  1. Wastes storage space: Each metric occupies a separate row, and device IDs, timestamps, and metric names all occupy redundant storage space
  2. Unfriendly for analysis: Since different metric values are not completely independent, storing them separately makes it difficult to analyze them together

Wide Table

In contrast to narrow tables, wide tables store all metrics collected by the device in a single row of data, such as:

Device ID Temperature Wind Speed Humidity Timestamp
1 29.2°C 3.2 km/h 55% 2021-11-16 13:57:13
2 21.5℃ 8.5km/h 35% 2021-11-16 13:57:13
......

The advantages of wide tables are:

  1. Data collected from the same device at the same time is stored together, making it convenient for unified retrieval and statistical analysis
  2. Device ID and timestamp information are not redundant

The disadvantages of wide tables are:

  1. Inflexibility; adding new metrics requires modifying the table structure.
  2. Columns without metric values contain NULL values, resulting in additional storage overhead.

In summary, the fundamental difference between wide tables and narrow tables is whether each row stores a single collected metric or all collected metrics. In production environments, narrow table solutions are not recommended. How can the issues caused by wide tables be addressed? Let’s analyze the problems wide tables may encounter and their solutions.

Challenges of Wide Tables

  1. Fixed Schema: Adding new columns to a wide table requires an ALTER TABLE ADD COLUMN operation, which can be slow and block read/write operations when the table is very large, impacting business operations.
  2. Differences and Missing Metrics: Although devices of the same category are stored in the same table, the sets of metrics for different device models may vary. For example, early-purchased devices may not collect humidity data, resulting in NULL values for humidity metrics for such devices. Additionally, data loss may occur during data collection and transmission, such as when a metric is not collected, leading to NULL values. Another scenario is when metrics are collected at different frequencies, which can also result in NULL values. NULL data does not occupy zero storage space. Based on the internal storage structure of Heap tables, once a row contains a NULL value, it will incur additional storage space equivalent to the number of columns multiplied by 8 bytes.
  3. Column limit: YMatrix's HEAP tables inherit from PostgreSQL, where each table is limited to a maximum of 1,600 columns. In scenarios where the number of collected metrics reaches several thousand, this limitation cannot be directly accommodated.

To address the above challenges, YMatrix has developed the mxkv high-performance KV data type, which allows arbitrary expansion of the number of columns and, similar to the JSON type, offers significantly higher performance than JSON.

After using mxkv, the table can be designed as follows:

Device ID Metric Timestamp
1 {“temperature”:29.2,‘wind’:3.2,“humidity”:55} 2021-11-16 13:57:13
2 {“temperature”:21.5,‘wind’:8.5,“humidity”:35} 2021-11-16 13:57:13
......

However, mxkv is not a panacea. While it offers decent read/write performance in HEAP tables, it has the following drawbacks:

  1. It is not conducive to column compression. Cold data ultimately needs to be placed in the MARS2 table for compression, and MARS2 is a mixed row-column storage, while mxkv is still row-based internally.
  2. Currently, mxkv does not support merge operations, so it is not possible to merge new data into the table in a single operation during updates. Instead, the old data must be retrieved and manually merged together.

3. Summary

  1. YMatrix is a relational database that requires the use of a relational model to establish the data model, and it is recommended to use wide tables.
  2. Device information must be stored in a separate device table and associated with the metric table via ID to reduce redundancy.
  3. The mxkv type addresses issues such as fixed relational table schemas, column count limitations, and NULL values, but it is currently not conducive to compressed storage.