Data Ingestion Characteristics in Time-Series Scenarios

This document describes the data ingestion characteristics of time-series workloads and the data ingestion architecture of YMatrix under such scenarios.

1. Data Ingestion Characteristics in Time-Series Scenarios

Storage is one of the core functions of a database. After completing data modeling and establishing database connections, data must be written into tables.

Data ingestion in time-series scenarios has the following main characteristics:

  • Large data volume, requiring high throughput performance
  • Complex ingestion patterns, such as batch merging, out-of-order writes, and heterogeneous frequency reporting

1.1 Large Data Volume

A typical feature of time-series data ingestion is its massive scale. This manifests in three practical aspects:

  • Large number of entities (devices/customers): Total devices can range from hundreds of thousands to millions, and continue to grow.
  • High number and variety of metrics: For example, in connected vehicle scenarios, each vehicle may generate thousands of metrics.
  • High collection frequency: Metrics are collected at second-level intervals, with some requiring collection as frequently as every 10 milliseconds.

As a result, the rapidly growing number of entities combined with high-frequency data collection generates enormous volumes of data. This poses a significant challenge to the database's throughput capability.

YMatrix has developed MatrixGate, a high-speed data ingestion tool. By enabling parallel data ingestion across data nodes (Segments), MatrixGate achieves ingestion rates up to hundreds of millions of data points per second.

Note!
For details on mxgate implementation, see YMatrix - How YMatrix Achieves 50 Million Data Points/Second on a Single Node; for benchmark reports, refer to YMatrix - Time-Series Database Insert Performance Benchmark: YMatrix is 78x Faster than InfluxDB; for working principles, see Data Ingestion Tools.

1.2 Complex Ingestion Scenarios

In real-world applications, data ingestion challenges go beyond large volume and diverse sources. They also include complex edge cases such as:

  • Batched reporting with automatic merging
  • Out-of-order or delayed reporting
  • Heterogeneous frequency reporting

1.2.1 Batched Reporting

In certain scenarios, a device does not transmit all collected metrics at once but sends them in multiple batches. These partial transmissions must be merged into a single record rather than stored as separate entries.

To handle this, YMatrix supports the UPSERT feature. For detailed use cases and instructions, see Batched Data Merge Scenario (UPSERT).

1.2.2 Out-of-Order and Delayed Reporting

Delayed reporting occurs when a device failure or disruption in the data collection pipeline prevents timely transmission. Once the system recovers, the data is transmitted retrospectively. For instance, a vehicle traveling through a signal-dead zone for several days resumes reporting upon reconnection. Such delays can last for days or even weeks.

Out-of-order reporting happens when, after a failure, the system first transmits the most recent data and then gradually fills in missing historical data. As a result, incoming data may have timestamps earlier than those already stored.

Since these two scenarios typically do not require special merge processing by the database, they are not discussed further here.

--- SPLIT ---

1.2.3 Different Frequencies
Different frequencies refer to the scenario where device metrics are collected at varying intervals. For example, some metrics are collected every 1 second, while others every 2 seconds.

Reporting at different frequencies can result in a large number of NULL values in the stored data for metrics with lower collection frequencies. NULL values still consume storage space: for HEAP tables, the overhead is column count / 8 bytes; for MARS2 tables, the overhead is RowGroup row count / 8 bytes. Therefore, solutions should be carefully evaluated based on the extent of NULL values.

2 Data Ingestion Overview in YMatrix

YMatrix supports ingesting data from multiple sources and in various formats. The following diagram illustrates common data sources and storage methods.

Note!
Click an icon directly to navigate to its corresponding documentation.

MatrixGate YMatrix COPY FDW EMQ PXF S3 Hive HBase HDFS Oracle SQL Server MySQL PostgreSQL MongoDB Kafka document Greenplum MatrixDB RESTful API stdin Apache NiFi JDBC/ODBC/libpq Java Python Golang C/C++