Quick onboard
Deployment
Data Modeling
Connecting
Migration
Query
Operations and Maintenance
Common Maintenance
Partition
Backup and Restore
Expansion
Mirroring
Resource Management
Security
Monitoring
Performance Tuning
Troubleshooting
Reference Guide
Tool guide
Data type
Storage Engine
Executor
Stream
DR (Disaster Recovery)
Configuration
Index
Extension
SQL Reference
This document describes the data ingestion characteristics of time-series workloads and the data ingestion architecture of YMatrix under such scenarios.
Storage is one of the core functions of a database. After completing data modeling and establishing database connections, data must be written into tables.
Data ingestion in time-series scenarios has the following main characteristics:
A typical feature of time-series data ingestion is its massive scale. This manifests in three practical aspects:
As a result, the rapidly growing number of entities combined with high-frequency data collection generates enormous volumes of data. This poses a significant challenge to the database's throughput capability.
YMatrix has developed MatrixGate, a high-speed data ingestion tool. By enabling parallel data ingestion across data nodes (Segments), MatrixGate achieves ingestion rates up to hundreds of millions of data points per second.
Note!
For details on mxgate implementation, see YMatrix - How YMatrix Achieves 50 Million Data Points/Second on a Single Node; for benchmark reports, refer to YMatrix - Time-Series Database Insert Performance Benchmark: YMatrix is 78x Faster than InfluxDB; for working principles, see Data Ingestion Tools.
In real-world applications, data ingestion challenges go beyond large volume and diverse sources. They also include complex edge cases such as:
In certain scenarios, a device does not transmit all collected metrics at once but sends them in multiple batches. These partial transmissions must be merged into a single record rather than stored as separate entries.
To handle this, YMatrix supports the UPSERT feature. For detailed use cases and instructions, see Batched Data Merge Scenario (UPSERT).
Delayed reporting occurs when a device failure or disruption in the data collection pipeline prevents timely transmission. Once the system recovers, the data is transmitted retrospectively. For instance, a vehicle traveling through a signal-dead zone for several days resumes reporting upon reconnection. Such delays can last for days or even weeks.
Out-of-order reporting happens when, after a failure, the system first transmits the most recent data and then gradually fills in missing historical data. As a result, incoming data may have timestamps earlier than those already stored.
Since these two scenarios typically do not require special merge processing by the database, they are not discussed further here.
--- SPLIT ---
1.2.3 Different Frequencies
Different frequencies refer to the scenario where device metrics are collected at varying intervals. For example, some metrics are collected every 1 second, while others every 2 seconds.
Reporting at different frequencies can result in a large number of NULL values in the stored data for metrics with lower collection frequencies. NULL values still consume storage space: for HEAP tables, the overhead is column count / 8 bytes; for MARS2 tables, the overhead is RowGroup row count / 8 bytes. Therefore, solutions should be carefully evaluated based on the extent of NULL values.
YMatrix supports ingesting data from multiple sources and in various formats. The following diagram illustrates common data sources and storage methods.
Note!
Click an icon directly to navigate to its corresponding documentation.