Quick onboard
Deployment
Data Modeling
Connecting
Migration
Query
Operations and Maintenance
Common Maintenance
Partition
Backup and Restore
Expansion
Mirroring
Resource Management
Security
Monitoring
Performance Tuning
Troubleshooting
Reference Guide
Tool guide
Data type
Storage Engine
Executor
Stream
DR (Disaster Recovery)
Configuration
Index
Extension
SQL Reference
YMatrix is suitable for IoT time-series scenarios involving devices of various scales. This tutorial uses a concrete connected-vehicle time-series scenario to demonstrate how to load, process, and query time-series data in YMatrix.
A city with over 8 million residents operates more than 200,000 taxis. The municipal authority collects and publishes trip records from these taxis, including pickup and dropoff times, locations, passenger count, fare, and payment method. What insights can be derived from this data? The answer includes taxi utilization rates and even overall traffic conditions. With such insights, you can improve urban management and enhance the living experience for residents and visitors. This tutorial provides a one-month dataset archive. Click here to begin your urban traffic management journey (password: 1x4u).
Among the collected data is a field indicating the payment method. Possible values include cash, credit card, free ride, disputed, unknown, and invalid—referred to as static attributes. Create a table named payment_types to store this information for later join operations during queries. Since the "payment method" attribute set is small and subject to updates, use the default HEAP storage engine. In general, unless otherwise specified, tables are created using the HEAP engine by default.
=# CREATE TABLE IF NOT EXISTS payment_types (
payment_type int,
description text
)
USING HEAP;
The IF NOT EXISTS clause prevents errors when attempting to create a table that already exists.
=# INSERT INTO payment_types VALUES
(1, 'Credit Card'),
(2, 'Cash'),
(3, 'No Charge'),
(4, 'Disputed'),
(5, 'Unknown'),
(6, 'Invalid Journey');
Another field indicates the rate type, including standard rate, airport #1, airport #2, special zone, negotiated fare, and group ride. Similarly, create a static lookup table rate_codes to store this information, also using the default HEAP storage engine:
=# CREATE TABLE IF NOT EXISTS rate_codes (
rate_code int,
description text
)
USING HEAP;
=# INSERT INTO rate_codes VALUES
(1, 'Standard Rate'),
(2, 'Airport 1'),
(3, 'Airport 2'),
(4, 'Special Zone'),
(5, 'Negotiated Price'),
(6, 'Group');
Next, create a time-series table to store the actual trip data. Below are explanations of key fields:
pickup_datetime / dropoff_datetime: Pickup and dropoff timestamps pickup_longitude / pickup_latitude: Longitude and latitude at pickup dropoff_longitude / dropoff_latitude: Longitude and latitude at dropoff passenger_count: Number of passengers trip_distance: Trip distance in miles total_amount: Fare amount trip_duration: A computed column generated during data loading, representing trip duration in minutes In this taxi trip time-series scenario, the data in the trip table represents time-varying measurements from devices. Given the diversity and variability of data sources in time-series applications, high performance in data ingestion and storage is required, while updates and deletions are rare. Therefore, MARS3 is the optimal choice, offering significant optimizations for time-series data ingestion, storage, and querying.
MARS3 tables depend on the matrixts extension. Before creating such tables, ensure the extension is installed in the target database. Skip this step if already installed:
=# CREATE EXTENSION matrixts;
Use the USING MARS3 clause to specify the storage engine and WITH to set parameters.
compresstype: Compression algorithm. Supported values are zstd, zlib, lz4; default is lz4. compresslevel: Compression level. Lower values yield faster compression; higher values yield better compression ratios. Moderate values balance speed and ratio. Valid ranges vary by algorithm:=# CREATE TABLE IF NOT EXISTS trip (
vendor_id text,
pickup_datetime timestamp without time zone,
dropoff_datetime timestamp without time zone,
passenger_count int,
trip_distance numeric,
pickup_longitude numeric,
pickup_latitude numeric,
rate_code_id int,
store_and_fwd_flag text,
dropoff_longitude numeric,
dropoff_latitude numeric,
payment_type int,
fare_amount numeric,
extra numeric,
mta_tax numeric,
tip_amount numeric,
tolls_amount numeric,
improvement_surcharge numeric,
total_amount numeric,
trip_duration numeric GENERATED ALWAYS AS (EXTRACT(EPOCH FROM (dropoff_datetime - pickup_datetime)::INTERVAL)/60) STORED
)
USING MARS3
WITH (compresstype='lz4', compresslevel=1)
DISTRIBUTED BY (vendor_id)
ORDER BY (vendor_id, pickup_datetime)
PARTITION BY RANGE (pickup_datetime)
( START (date '2016-01-01') INCLUSIVE
END (date '2016-02-01') EXCLUSIVE
EVERY (INTERVAL '1 day') );
DISTRIBUTED BY clause distributes data across segments via hash distribution on the trip column, ensuring rows with the same vendor_id value reside on the same segment.ORDER BY clause sorts all data within each segment by the (vendor_id, pickup_datetime) sort key, enabling ordered storage.PARTITION BY clause defines range partitioning. It creates 31 daily partitions from January 1, 2016 (inclusive) to February 1, 2016 (exclusive). Automatic daily partitioning enables fast pruning for time-range queries and simplifies future expiration-based data management.Note!
According to SQL syntax rules,DISTRIBUTED BYmust be declared beforePARTITION BY. However, during execution,DISTRIBUTED BYis applied first to distribute data to corresponding segments, followed byPARTITION BYto insert data into appropriate sub-partition tables.
Locate the downloaded yellow_tripdata_2016-01.csv file and use the mxgate command to load the data. Specify the actual file path after tail, and use the --db-master-host parameter to provide the master host name or IP address. Example:
$ tail -n +2 /home/mxadmin/workspace/nyc-taxi-data/yellow_tripdata_2016-01.csv | mxgate --source stdin --db-database postgres --db-master-host mdw --db-master-port 5432 --db-user mxadmin --time-format raw --target trip --parallel 256 --delimiter ',' --exclude-columns trip_duration
Key mxgate parameters:
--db-database postgres // Target database name
--db-master-host mdw // MXMaster hostname or IP
--db-master-port 5432 // Database port
--db-user mxadmin // Database user
--time-format raw // Raw format, no conversion
--target trip // Target table name
--parallel 256 // Parallel degree
--delimiter ',' // Field delimiter
For more information about mxgate, see MatrixGate.
YMatrix provides the time_bucket function, supporting aggregation over arbitrary time intervals. Before use, install the matrixts extension to initialize time-series components (skip if already created):
=# CREATE EXTENSION matrixts;
You can now run the following SQL to count the total number of trips per day:
=# SELECT time_bucket('24 hours', pickup_datetime) AS day, count(*)
FROM trip
GROUP BY day
ORDER BY day;
To analyze hourly passenger counts on January 2, 2016:
=# SELECT time_bucket('1 hour', pickup_datetime) AS hour, sum(passenger_count)
FROM trip
WHERE pickup_datetime >= '2016-01-02 00:00:00' AND pickup_datetime < '2016-01-03 00:00:00'
GROUP BY hour
ORDER BY hour;
Using max and min, you can quickly determine that the longest trip in the dataset is 485.9 miles. To further analyze the number of trips in different distance ranges (≤10, 10–50, 50–100, 100–200, >200 miles), use a single SQL statement:
=# SELECT distance_range, count(*) AS num_of_trips
FROM
(
SELECT
CASE
WHEN trip_distance <= 10 THEN 10
WHEN trip_distance > 10 AND trip_distance <= 50 THEN 50
WHEN trip_distance > 50 AND trip_distance <= 100 THEN 100
WHEN trip_distance > 100 AND trip_distance <= 200 THEN 200
WHEN trip_distance > 200 THEN 500
END AS distance_range
FROM trip
) AS temp
GROUP BY distance_range;
Upon execution, you should see output like this:
distance_range | num_of_trips
----------------+--------------
10 | 10308767
50 | 586200
100 | 379
200 | 58
500 | 9