MatrixDB is suitable for time series applications in IoT scenarios across devices of various scales. This tutorial uses a specific dataset to demonstrate how to load, process, and query time series data in MatrixDB.
A city with over 8 million residents and 200,000 taxis collects and publicly releases trip records from all taxis. The data includes pickup and dropoff times, locations, passenger count, fare, and payment method. These records enable analysis of taxi utilization and overall traffic conditions, supporting improved urban management and better experiences for residents and visitors. A one-month sample dataset is available for download here (password: 1x4u).
The dataset includes a field indicating the payment method, which may be cash, credit card, free ride, disputed, unknown, or invalid. Create a payment_types table to store this metadata for later joins during queries:
create table if not exists payment_types (
payment_type int,
description text
);
insert into payment_types values
(1, 'Credit Card'),
(2, 'Cash'),
(3, 'No Charge'),
(4, 'Disputed'),
(5, 'Unknown'),
(6, 'Invalid Journey');
Another field indicates the rate type, including standard rate, airport 1, airport 2, special area, negotiated fare, and group ride. Create a rate_codes table to record these types:
create table if not exists rate_codes (
rate_code int,
description text
);
insert into rate_codes values
((1, 'Standard Rate'),
(2, 'Airport 1'),
(3, 'Airport 2'),
(4, 'Special Zone'),
(5, 'Negotiated Price'),
(6, 'Group Rate');
Next, create a table to store the actual trip data. Key fields are explained as follows:
pickup_datetime / dropoff_datetime: pickup and dropoff timestamps pickup_longitude / pickup_latitude: pickup location coordinates dropoff_longitude / dropoff_latitude: dropoff location coordinates passenger_count: number of passengers trip_distance: trip distance in miles total_amount: total fare trip_duration: a generated column calculated during data loading, representing trip duration in minutes Create the trip table with daily partitioning:
create table if not exists trip (
vendor_id text,
pickup_datetime timestamp without time zone,
dropoff_datetime timestamp without time zone,
passenger_count int,
trip_distance numeric,
pickup_longitude numeric,
pickup_latitude numeric,
rate_code_id int,
store_and_fwd_flag text,
dropoff_longitude numeric,
dropoff_latitude numeric,
payment_type int,
fare_amount numeric,
extra numeric,
mta_tax numeric,
tip_amount numeric,
tolls_amount numeric,
improvement_surcharge numeric,
total_amount numeric,
trip_duration numeric generated always as (EXTRACT(EPOCH FROM (dropoff_datetime - pickup_datetime)::INTERVAL)/60) STORED
)
PARTITION BY RANGE (pickup_datetime)
( START (date '2016-01-01') INCLUSIVE
END (date '2016-02-01') EXCLUSIVE
EVERY (INTERVAL '1 day') );
The trip table is automatically partitioned by day. This enables fast partition pruning for time-based queries and simplifies future expiration-based data management.
Locate the extracted file yellow_tripdata_2016-01.csv from the downloaded archive. Use the mxgate command to load the data. Replace the file path in the tail command and specify the correct master host via --db-master-host. Example:
tail -n +2 /home/mxadmin/workspace/nyc-taxi-data/yellow_tripdata_2016-01.csv | mxgate --source stdin --db-database mxdb --db-master-host sdw3 --db-master-port 5432 --db-user mxadmin --time-format raw --target trip --parallel 256 --delimiter ',' --exclude-columns trip_duration
Key mxgate parameters:
--db-database mxdb Database name
--db-master-host sdw3 Master host name or IP
--db-master-port 5432 Database port
--db-user mxadmin Database user
--time-format raw Raw format, no conversion
--target trip Target table name
--parallel 256 Degree of parallelism
--delimiter ',' Field delimiter
For more details on mxgate usage and architecture, refer to MatrixGate.
Like many sensor datasets, this dataset contains invalid or erroneous entries due to various factors. Leverage MatrixDB’s powerful SQL capabilities to detect and remove such data quickly.
One common error is dropoff time being earlier than or equal to pickup time. Use the following query to inspect and delete such records:
select count(*)
from trip
where dropoff_datetime <= pickup_datetime;
delete from trip
where dropoff_datetime <= pickup_datetime;
Another anomaly is average speed exceeding 300 mph—clearly unrealistic, especially for longer trips. Detect and remove such outliers:
select trip_distance, trip_duration
from trip
where trip_distance > trip_duration * (300/60) and trip_distance > 100;
delete from trip
where trip_distance > trip_duration * (300/60) and trip_distance > 100;
MatrixDB provides the time_bucket function for grouping time-series data into arbitrary intervals. Before using it, install the MatrixTS extension to enable time series functionality:
``
CREATE EXTENSION matrixts;
``
After installation, use the following SQL to count the number of trips per day: ``
select time_bucket('24 hours', pickup_datetime) as day, count(*)
from trip
group by day
order by day;
``
To analyze hourly passenger counts on January 2, 2016: ``
SELECT time_bucket('1 hour', pickup_datetime) as hour, sum(passenger_count)
FROM trip
WHERE pickup_datetime >= '2016-01-02 00:00:00' and pickup_datetime < '2016-01-03 00:00:00'
GROUP BY hour
ORDER BY hour;
``
Using max() and min(), you can quickly determine that the longest trip in the dataset is 485.9 miles. To count trips in different distance ranges (≤10, 10–50, 50–100, 100–200, >200 miles), use a single SQL statement:
``
select distance_range, count(*) as num_of_trips
from
(
select
case
when trip_distance <= 10 then 10
when trip_distance > 10 AND trip_distance <= 50 then 50
when trip_distance > 50 AND trip_distance <= 100 then 100
when trip_distance > 100 AND trip_distance <= 200 then 200
when trip_distance > 200 then 500
end as distance_range
from trip
) as temp
group by distance_range;
``
Output: ``
distance_range | num_of_trips
----------------+--------------
10 | 10308767
50 | 586200
100 | 379
200 | 58
500 | 9
``