As mentioned earlier, MatrixDB uses a relational model storage solution. The following will introduce in detail how to design a timing table under a relational model.
The timing data is generated by the equipment cycle, so it contains at least the following information:
Among them, timestamps are a relatively unified and standardized data type, such as:
2021-11-16 10:32:29
In comparison, device identifiers and metric data are more complex.
Device identifiers are used to identify and locate a specific device, including the following information:
Collection metrics are data collected by devices, and there are significant differences in metric data between different types of devices. For example, monitors collect images and videos, while sensors collect temperature, humidity, wind speed, and other data. Broadly speaking, time-series metric data comes not only from hardware devices but also from time-series data generated by software, such as server logs. Therefore, there are significant differences in the type, form, and generation cycle of metric data.
For the relational model, a fixed schema is required, so modeling must be done first.
Which tables should be created? Let's start with device identifiers. Devices are categorized into attributes such as category, model, serial number, city, and region. Should a separate table be created for each category, each model, each device, or should all devices be stored in a single table?
It is recommended to create one table for data of the same category. Since the number of devices is significant and will continue to grow, it is impractical to create a separate table for each device, as this would result in high operational costs and metadata management costs. It is also impractical to store all devices in a single table, as different categories of devices collect vastly different metrics, and a single table with too much data would be difficult to analyze and maintain.
Therefore, after categorization, the tables may include the following:
After determining which tables to use, the next step is to consider how to design each table. Since the device category has already been determined during table creation, the remaining time-series information is as follows:
Weather Sensor Data Collection Table
Device ID | Data Collection Metrics | Timestamp |
---|---|---|
Model 1, ID 0001, Beijing Municipality, Huairou District | Temperature: 29.2°C, Wind Speed: 3.2 km/h, Humidity: 55% | 2021-11-16 13:57:13 |
Model 2, ID 0010, Shanghai City, Chongming District | Temperature: 21.5°C, Wind Speed: 8.5 km/h, Humidity: 35% | 2021-11-16 13:57:13 |
Device identifiers may occupy a relatively large amount of storage space because they contain a variety of attribute information. In addition to the unique device number, other attributes such as model, city, and region contain a large amount of duplicate information, so it is unnecessary to store each piece of collected information in its entirety. Here, we can use the principles of relational database design to create an additional device identifier table and generate an ID. Then, the collection table and the device identifier table are linked by ID, as follows:
Device Identification Table
ID | Model | Number | City | Region |
---|---|---|---|---|
1 | Model 1 | 0001 | Beijing | Huairou District |
2 | Model 2 | 0010 | Shanghai | Chongming District |
...... |
Weather Sensor Data Collection Table
Device ID | Data Points Collected | Timestamp |
---|---|---|
1 | Temperature: 29.2°C, Wind Speed: 3.2 km/h, Humidity: 55% | 2021-11-16 13:57:13 |
2 | Temperature: 21.5°C, Wind Speed: 8.5 km/h, Humidity: 35% | 2021-11-16 13:57:13 |
...... |
After this division, the storage space for device identifier information in the collection table is significantly reduced, requiring only one integer. The data volume of the device identifier table is equal to the total number of devices, which is relatively fixed compared to the number of collection metrics.
Now that the device identification model has been resolved, let's look at how to model the collected metrics. Collected metrics can be divided into narrow tables and wide tables according to their storage methods:
The narrow table scheme stores only one metric data per row, such as:
Device ID | Metric Name | Metric Value | Timestamp |
---|---|---|---|
1 | Temperature | 29.2°C | 2021-11-16 13:57:13 |
1 | Wind Speed | 3.2 km/h | 2021-11-16 13:57:13 |
1 | Humidity | 55% | 2021-11-16 13:57:13 |
2 | Temperature | 21.5°C | 2021-11-16 13:57:13 |
2 | Wind speed | 8.5 km/h | 2021-11-16 13:57:13 |
2 | Humidity | 35% | 2021-11-16 13:57:13 |
...... |
The advantages of this storage scheme are:
The disadvantages are:
In contrast to narrow tables, wide tables store all metrics collected by the device in a single row of data, such as:
Device ID | Temperature | Wind Speed | Humidity | Timestamp |
---|---|---|---|---|
1 | 29.2°C | 3.2 km/h | 55% | 2021-11-16 13:57:13 |
2 | 21.5℃ | 8.5km/h | 35% | 2021-11-16 13:57:13 |
...... |
The advantages of wide tables are:
The disadvantages of wide tables are:
In summary, the fundamental difference between wide tables and narrow tables is whether each row stores a single collected metric or all collected metrics. In production environments, narrow table solutions are not recommended. How can the issues caused by wide tables be addressed? Let’s analyze the problems wide tables may encounter and their solutions.
To address the above challenges, YMatrix has developed the mxkv high-performance KV data type, which allows arbitrary expansion of the number of columns and, similar to the JSON type, offers significantly higher performance than JSON.
After using mxkv, the table can be designed as follows:
Device ID | Metric | Timestamp |
---|---|---|
1 | {“temperature”:29.2,‘wind’:3.2,“humidity”:55} | 2021-11-16 13:57:13 |
2 | {“temperature”:21.5,‘wind’:8.5,“humidity”:35} | 2021-11-16 13:57:13 |
...... |
However, mxkv is not a panacea. While it offers decent read/write performance in HEAP tables, it has the following drawbacks: