YMatrix

A Common Time-Series Scenario

YMatrix is suitable for IoT time-series scenarios involving devices of various scales. This tutorial uses a concrete connected-vehicle time-series scenario to demonstrate how to load, process, and query time-series data in YMatrix.

1 Introduction to the Time-Series Dataset — Urban Taxi Trip Data

A city with over 8 million residents operates more than 200,000 taxis. The municipal authority collects and publishes trip records from these taxis, including pickup and dropoff times, locations, passenger count, fare, and payment method. What insights can be derived from this data? The answer includes taxi utilization rates and even overall traffic conditions. With such insights, you can improve urban management and enhance the living experience for residents and visitors. This tutorial provides a one-month dataset archive. Click here to begin your urban traffic management journey (password: 1x4u).

2 Metadata and Table Schema

Among the collected data is a field indicating the payment method. Possible values include cash, credit card, free ride, disputed, unknown, and invalid—referred to as static attributes. Create a table named payment_types to store this information for later join operations during queries. Since the "payment method" attribute set is small and subject to updates, use the default HEAP storage engine. In general, unless otherwise specified, tables are created using the HEAP engine by default.

=# CREATE TABLE IF NOT EXISTS payment_types (
    payment_type int,
    description text
)
USING HEAP;

The IF NOT EXISTS clause prevents errors when attempting to create a table that already exists.

=# INSERT INTO payment_types VALUES
(1, 'Credit Card'),
(2, 'Cash'),
(3, 'No Charge'),
(4, 'Disputed'),
(5, 'Unknown'),
(6, 'Invalid Journey');

Another field indicates the rate type, including standard rate, airport #1, airport #2, special zone, negotiated fare, and group ride. Similarly, create a static lookup table rate_codes to store this information, also using the default HEAP storage engine:

=# CREATE TABLE IF NOT EXISTS rate_codes (
    rate_code   int,
    description text
)
USING HEAP;

=# INSERT INTO rate_codes VALUES
(1, 'Standard Rate'),
(2, 'Airport 1'),
(3, 'Airport 2'),
(4, 'Special Zone'),
(5, 'Negotiated Price'),
(6, 'Group');

Next, create a time-series table to store the actual trip data. Below are explanations of key fields:

pickup_datetime / dropoff_datetime: Pickup and dropoff timestamps
pickup_longitude / pickup_latitude: Longitude and latitude at pickup
dropoff_longitude / dropoff_latitude: Longitude and latitude at dropoff
passenger_count: Number of passengers
trip_distance: Trip distance in miles
total_amount: Fare amount
trip_duration: A computed column generated during data loading, representing trip duration in minutes

In this taxi trip time-series scenario, the data in the trip table represents time-varying measurements from devices. Given the diversity and variability of data sources in time-series applications, high performance in data ingestion and storage is required, while updates and deletions are rare. Therefore, MARS3 is the optimal choice, offering significant optimizations for time-series data ingestion, storage, and querying.

MARS3 tables depend on the matrixts extension. Before creating such tables, ensure the extension is installed in the target database. Skip this step if already installed:

=# CREATE EXTENSION matrixts;

Use the USING MARS3 clause to specify the storage engine and WITH to set parameters.

compresstype: Compression algorithm. Supported values are zstd, zlib, lz4; default is lz4.
compresslevel: Compression level. Lower values yield faster compression; higher values yield better compression ratios. Moderate values balance speed and ratio. Valid ranges vary by algorithm:
- zstd: 1–19
- zlib: 1–9
- lz4: 1–20 (default: 1)

=# CREATE TABLE IF NOT EXISTS trip (
  vendor_id text,
  pickup_datetime timestamp without time zone,
  dropoff_datetime timestamp without time zone,
  passenger_count int,
  trip_distance numeric,
  pickup_longitude numeric,
  pickup_latitude numeric,
  rate_code_id int,
  store_and_fwd_flag text,
  dropoff_longitude numeric,
  dropoff_latitude numeric,
  payment_type int,
  fare_amount numeric,
  extra numeric,
  mta_tax numeric,
  tip_amount numeric,
  tolls_amount numeric,
  improvement_surcharge numeric,
  total_amount numeric,
  trip_duration numeric GENERATED ALWAYS AS (EXTRACT(EPOCH FROM (dropoff_datetime - pickup_datetime)::INTERVAL)/60) STORED
) 
USING MARS3
WITH (compresstype='lz4', compresslevel=1)
DISTRIBUTED BY (vendor_id)
ORDER BY (vendor_id, pickup_datetime)
PARTITION BY RANGE (pickup_datetime)
( START (date '2016-01-01') INCLUSIVE
   END (date '2016-02-01') EXCLUSIVE
   EVERY (INTERVAL '1 day') );

The DISTRIBUTED BY clause distributes data across segments via hash distribution on the trip column, ensuring rows with the same vendor_id value reside on the same segment.
The ORDER BY clause sorts all data within each segment by the (vendor_id, pickup_datetime) sort key, enabling ordered storage.
The PARTITION BY clause defines range partitioning. It creates 31 daily partitions from January 1, 2016 (inclusive) to February 1, 2016 (exclusive). Automatic daily partitioning enables fast pruning for time-range queries and simplifies future expiration-based data management.

Note!
According to SQL syntax rules, DISTRIBUTED BY must be declared before PARTITION BY. However, during execution, DISTRIBUTED BY is applied first to distribute data to corresponding segments, followed by PARTITION BY to insert data into appropriate sub-partition tables.

3 Data Loading

Locate the downloaded yellow_tripdata_2016-01.csv file and use the mxgate command to load the data. Specify the actual file path after tail, and use the --db-master-host parameter to provide the master host name or IP address. Example:

$ tail -n +2 /home/mxadmin/workspace/nyc-taxi-data/yellow_tripdata_2016-01.csv | mxgate --source stdin --db-database postgres --db-master-host mdw --db-master-port 5432 --db-user mxadmin --time-format raw --target trip --parallel 256  --delimiter ','  --exclude-columns trip_duration

Key mxgate parameters:

--db-database postgres     // Target database name    
--db-master-host mdw       // MXMaster hostname or IP
--db-master-port 5432      // Database port
--db-user mxadmin          // Database user
--time-format raw          // Raw format, no conversion
--target trip              // Target table name
--parallel 256             // Parallel degree
--delimiter ','            // Field delimiter

For more information about mxgate, see MatrixGate.

4 Data Analysis

YMatrix provides the time_bucket function, supporting aggregation over arbitrary time intervals. Before use, install the matrixts extension to initialize time-series components (skip if already created):

=# CREATE EXTENSION matrixts;

You can now run the following SQL to count the total number of trips per day:

=# SELECT time_bucket('24 hours', pickup_datetime) AS day, count(*) 
FROM trip 
GROUP BY day 
ORDER BY day;

To analyze hourly passenger counts on January 2, 2016:

=# SELECT time_bucket('1 hour', pickup_datetime) AS hour, sum(passenger_count)
FROM trip
WHERE pickup_datetime >= '2016-01-02 00:00:00' AND pickup_datetime < '2016-01-03 00:00:00'
GROUP BY hour
ORDER BY hour;

Using max and min, you can quickly determine that the longest trip in the dataset is 485.9 miles. To further analyze the number of trips in different distance ranges (≤10, 10–50, 50–100, 100–200, >200 miles), use a single SQL statement:

=# SELECT distance_range, count(*) AS num_of_trips  
FROM
(
SELECT 
CASE
    WHEN trip_distance <= 10 THEN 10
    WHEN trip_distance >  10 AND trip_distance <= 50 THEN 50
    WHEN trip_distance >  50 AND trip_distance <= 100 THEN 100 
    WHEN trip_distance > 100 AND trip_distance <= 200 THEN 200
    WHEN trip_distance > 200 THEN 500 
    END AS distance_range  
FROM trip
) AS temp
GROUP BY distance_range;

Upon execution, you should see output like this:

 distance_range | num_of_trips
----------------+--------------
             10 |     10308767
             50 |       586200
            100 |          379
            200 |           58
            500 |            9

← Previous

Dataquery Test

Onboard: Streaming

English Русский 简体中文

A Common Time-Series Scenario

1 Introduction to the Time-Series Dataset — Urban Taxi Trip Data

2 Metadata and Table Schema

3 Data Loading

4 Data Analysis

English

Русский

简体中文