A Common Time-Series Scenario

YMatrix is suitable for IoT time-series scenarios involving devices of various scales. This tutorial uses a concrete connected-vehicle time-series scenario to demonstrate how to load, process, and query time-series data in YMatrix.

1 Introduction to the Time-Series Dataset — Taxi Trip Data from a City

A city with over 8 million residents operates more than 200,000 taxis. The municipal transportation authority collects and publishes trip records for each taxi ride, including pickup and dropoff times, locations, number of passengers, fare, and payment method. What insights can be derived from this data? The answer includes taxi utilization rates and even overall traffic conditions. With such insights, you can improve urban management and enhance the quality of life for residents and visitors. This tutorial provides a one-month dataset archive. Click here to begin your journey into urban mobility analytics (password: 1x4u).

2 Metadata and Table Schema

Among the collected data is a field indicating the payment method. Possible values include cash, credit card, free trip, disputed, unknown, and invalid—referred to as static attributes. Create a table named payment_types to store this information for later joins during queries. Since the "payment method" data is small in size and may require updates, use the default HEAP storage engine. In general, unless otherwise specified, tables will automatically use the HEAP engine.

=# CREATE TABLE IF NOT EXISTS payment_types (
    payment_type int,
    description text
)
USING HEAP;

Using the IF NOT EXISTS clause prevents errors when attempting to create a table that already exists.

=# INSERT INTO payment_types VALUES
(1, 'Credit Card'),
(2, 'Cash'),
(3, 'No Charge'),
(4, 'Disputed'),
(5, 'Unknown'),
(6, 'Invalid Journey');

Another field indicates the rate type, including standard rate, airport #1, airport #2, special zone, negotiated fare, and group travel. Similarly, create a static lookup table rate_codes using the default HEAP storage engine:

=# CREATE TABLE IF NOT EXISTS rate_codes (
    rate_code   int,
    description text
)
USING HEAP;

=# INSERT INTO rate_codes VALUES
(1, 'Standard Rate'),
(2, 'Airport 1'),
(3, 'Airport 2'),
(4, 'Special Zone'),
(5, 'Negotiated Price'),
(6, 'Group');

Now, create a time-series table to store the actual trip data. Below are explanations of key fields:

  • pickup_datetime / dropoff_datetime: Pickup and dropoff timestamps
  • pickup_longitude / pickup_latitude: Longitude and latitude at pickup
  • dropoff_longitude / dropoff_latitude: Longitude and latitude at dropoff
  • passenger_count: Number of passengers
  • trip_distance: Trip distance in miles
  • total_amount: Fare amount
  • trip_duration: A computed column generated during data loading, representing trip duration in minutes

In this taxi trip time-series scenario, the data in table trip represents time-varying measurements from devices. Given the diversity and variability of data sources in time-series applications, high performance in data ingestion and storage is critical, while requirements for updates and deletions are minimal. Therefore, MARS3 is the optimal choice, offering significant optimizations for time-series data ingestion, storage, and querying.

MARS3 tables depend on the matrixts extension. Before creating such tables, ensure the extension is installed in the target database. If already installed, skip this step.

=# CREATE EXTENSION matrixts;

When creating the table, use the USING MARS3 clause to specify the storage engine and the WITH clause to set parameters.

  • compresstype: Compression algorithm. Supported values are zstd, zlib, lz4; default is lz4.
  • compresslevel: Compression level. Lower values yield faster compression; higher values achieve better compression ratios. Moderate values balance speed and ratio. Valid ranges vary by algorithm:
    • zstd: 1–19
    • zlib: 1–9
    • lz4: 1–20 (default: 1)
=# CREATE TABLE IF NOT EXISTS trip (
  vendor_id text,
  pickup_datetime timestamp without time zone,
  dropoff_datetime timestamp without time zone,
  passenger_count int,
  trip_distance numeric,
  pickup_longitude numeric,
  pickup_latitude numeric,
  rate_code_id int,
  store_and_fwd_flag text,
  dropoff_longitude numeric,
  dropoff_latitude numeric,
  payment_type int,
  fare_amount numeric,
  extra numeric,
  mta_tax numeric,
  tip_amount numeric,
  tolls_amount numeric,
  improvement_surcharge numeric,
  total_amount numeric,
  trip_duration numeric GENERATED ALWAYS AS (EXTRACT(EPOCH FROM (dropoff_datetime - pickup_datetime)::INTERVAL)/60) STORED
) 
USING MARS3
WITH (compresstype='lz4', compresslevel=1)
DISTRIBUTED BY (vendor_id)
ORDER BY (vendor_id, pickup_datetime)
PARTITION BY RANGE (pickup_datetime)
( START (date '2016-01-01') INCLUSIVE
   END (date '2016-02-01') EXCLUSIVE
   EVERY (INTERVAL '1 day') );
  • The DISTRIBUTED BY clause distributes data across segments via hash distribution on the trip column, ensuring rows with the same vendor_id value reside on the same segment.
  • The ORDER BY clause sorts all data within each segment by the (vendor_id, pickup_datetime) sort key, enabling ordered storage.
  • The PARTITION BY clause defines range partitioning on trip, creating 31 daily partitions from January 1, 2016 (inclusive) to February 1, 2016 (exclusive). Daily partitioning enables fast pruning for time-range queries and simplifies future expiration-based data management.

Note!
According to SQL syntax rules, DISTRIBUTED BY must be declared before PARTITION BY. However, during execution, DISTRIBUTED BY is applied first to distribute data to corresponding segments, followed by PARTITION BY to insert data into appropriate sub-partition tables.

3 Data Loading

Locate the downloaded yellow_tripdata_2016-01.csv file and use the mxgate command to load it. Specify the actual file path after tail, and use the --db-master-host parameter to provide the master host name or IP address. Example:

$ tail -n +2 /home/mxadmin/workspace/nyc-taxi-data/yellow_tripdata_2016-01.csv | mxgate --source stdin --db-database postgres --db-master-host mdw --db-master-port 5432 --db-user mxadmin --time-format raw --target trip --parallel 256  --delimiter ','  --exclude-columns trip_duration 

Key mxgate parameters explained:

--db-database postgres     // Target database name    
--db-master-host mdw       // MXMaster hostname or IP
--db-master-port 5432      // Database port
--db-user mxadmin          // Database user
--time-format raw          // Raw format, no conversion
--target trip              // Target table name
--parallel 256             // Degree of parallelism
--delimiter ','            // Field delimiter

For more information about mxgate, see MatrixGate.

4 Data Analysis

YMatrix provides the time_bucket function, supporting aggregation over arbitrary time intervals. Before use, install the matrixts extension to initialize time-series components (if not already created):

=# CREATE EXTENSION matrixts;

You can now run the following SQL to count the total number of trips per day:

=# SELECT time_bucket('24 hours', pickup_datetime) AS day, count(*) 
FROM trip 
GROUP BY day 
ORDER BY day;

To analyze hourly passenger counts on January 2, 2016:

=# SELECT time_bucket('1 hour', pickup_datetime) AS hour, sum(passenger_count)
FROM trip
WHERE pickup_datetime >= '2016-01-02 00:00:00' AND pickup_datetime < '2016-01-03 00:00:00'
GROUP BY hour
ORDER BY hour;

Using max and min, you can quickly determine that the longest trip in the dataset is 485.9 miles. To further analyze the number of trips falling into different distance ranges (≤10, 10–50, 50–100, 100–200, >200 miles), use a single SQL statement:

=# SELECT distance_range, count(*) AS num_of_trips  
FROM
(
SELECT 
CASE
    WHEN trip_distance <= 10 THEN 10
    WHEN trip_distance >  10 AND trip_distance <= 50 THEN 50
    WHEN trip_distance >  50 AND trip_distance <= 100 THEN 100 
    WHEN trip_distance > 100 AND trip_distance <= 200 THEN 200
    WHEN trip_distance > 200 THEN 500 
    END AS distance_range  
FROM trip
) AS temp
GROUP BY distance_range;

The output should resemble:

 distance_range | num_of_trips
----------------+--------------
             10 |     10308767
             50 |       586200
            100 |          379
            200 |           58
            500 |            9