Time Series Basic Tutorial

MatrixDB is suitable for time series applications in IoT scenarios across devices of various scales. This tutorial uses a specific dataset to demonstrate how to load, process, and query time series data in MatrixDB.

1 Dataset Overview — Urban Taxi Trip Data

A city with over 8 million residents and 200,000 taxis collects and publicly releases trip records from all taxis. The data includes pickup and dropoff times, locations, passenger count, fare, and payment method. These records enable analysis of taxi utilization and overall traffic conditions, supporting improved urban management and better experiences for residents and visitors. A one-month sample dataset is available for download here (password: 1x4u).

2 Metadata and Table Schema

The dataset includes a field indicating the payment method, which may be cash, credit card, free ride, disputed, unknown, or invalid. Create a payment_types table to store this metadata for later joins during queries:

create table if not exists payment_types (
    payment_type int,
    description text
);

insert into payment_types values
(1, 'Credit Card'),
(2, 'Cash'),
(3, 'No Charge'),
(4, 'Disputed'),
(5, 'Unknown'),
(6, 'Invalid Journey');

Another field indicates the rate type, including standard rate, airport 1, airport 2, special area, negotiated fare, and group ride. Create a rate_codes table to record these types:

create table if not exists rate_codes (
    rate_code   int,
    description text
);

insert into rate_codes values
((1, 'Standard Rate'),
(2, 'Airport 1'),
(3, 'Airport 2'),
(4, 'Special Zone'),
(5, 'Negotiated Price'),
(6, 'Group Rate');

Next, create a table to store the actual trip data. Key fields are explained as follows:

  • pickup_datetime / dropoff_datetime: pickup and dropoff timestamps
  • pickup_longitude / pickup_latitude: pickup location coordinates
  • dropoff_longitude / dropoff_latitude: dropoff location coordinates
  • passenger_count: number of passengers
  • trip_distance: trip distance in miles
  • total_amount: total fare
  • trip_duration: a generated column calculated during data loading, representing trip duration in minutes

Create the trip table with daily partitioning:

create table if not exists trip (
  vendor_id text,
  pickup_datetime timestamp without time zone,
  dropoff_datetime timestamp without time zone,
  passenger_count int,
  trip_distance numeric,
  pickup_longitude numeric,
  pickup_latitude numeric,
  rate_code_id int,
  store_and_fwd_flag text,
  dropoff_longitude numeric,
  dropoff_latitude numeric,
  payment_type int,
  fare_amount numeric,
  extra numeric,
  mta_tax numeric,
  tip_amount numeric,
  tolls_amount numeric,
  improvement_surcharge numeric,
  total_amount numeric,
  trip_duration numeric generated always as (EXTRACT(EPOCH FROM (dropoff_datetime - pickup_datetime)::INTERVAL)/60) STORED
) 
PARTITION BY RANGE (pickup_datetime)
( START (date '2016-01-01') INCLUSIVE
   END (date '2016-02-01') EXCLUSIVE
   EVERY (INTERVAL '1 day') );

The trip table is automatically partitioned by day. This enables fast partition pruning for time-based queries and simplifies future expiration-based data management.

3 Data Loading

Locate the extracted file yellow_tripdata_2016-01.csv from the downloaded archive. Use the mxgate command to load the data. Replace the file path in the tail command and specify the correct master host via --db-master-host. Example:

tail -n +2 /home/mxadmin/workspace/nyc-taxi-data/yellow_tripdata_2016-01.csv | mxgate --source stdin --db-database mxdb --db-master-host sdw3 --db-master-port 5432 --db-user mxadmin --time-format raw --target trip --parallel 256 --delimiter ',' --exclude-columns trip_duration

Key mxgate parameters:

--db-database mxdb     Database name  
--db-master-host sdw3  Master host name or IP  
--db-master-port 5432  Database port  
--db-user mxadmin      Database user  
--time-format raw      Raw format, no conversion  
--target trip          Target table name  
--parallel 256         Degree of parallelism  
--delimiter ','        Field delimiter  

For more details on mxgate usage and architecture, refer to MatrixGate.

4 Data Cleaning

Like many sensor datasets, this dataset contains invalid or erroneous entries due to various factors. Leverage MatrixDB’s powerful SQL capabilities to detect and remove such data quickly.

One common error is dropoff time being earlier than or equal to pickup time. Use the following query to inspect and delete such records:

select count(*) 
from trip 
where dropoff_datetime <= pickup_datetime;

delete from trip 
where dropoff_datetime <= pickup_datetime;

Another anomaly is average speed exceeding 300 mph—clearly unrealistic, especially for longer trips. Detect and remove such outliers:

select trip_distance, trip_duration 
from trip 
where trip_distance > trip_duration * (300/60) and trip_distance > 100;

delete from trip 
where trip_distance > trip_duration * (300/60) and trip_distance > 100;

5 Data Analysis

MatrixDB provides the time_bucket function for grouping time-series data into arbitrary intervals. Before using it, install the MatrixTS extension to enable time series functionality: ``

CREATE EXTENSION matrixts;

``

After installation, use the following SQL to count the number of trips per day: ``

select time_bucket('24 hours', pickup_datetime) as day, count(*) 
from trip 
group by day 
order by day;

``

To analyze hourly passenger counts on January 2, 2016: ``

SELECT time_bucket('1 hour', pickup_datetime) as hour, sum(passenger_count)
FROM trip
WHERE pickup_datetime >= '2016-01-02 00:00:00' and pickup_datetime < '2016-01-03 00:00:00'
GROUP BY hour
ORDER BY hour;

``

Using max() and min(), you can quickly determine that the longest trip in the dataset is 485.9 miles. To count trips in different distance ranges (≤10, 10–50, 50–100, 100–200, >200 miles), use a single SQL statement: ``

select distance_range, count(*) as num_of_trips  
from
(
select 
case
    when trip_distance <= 10 then 10
    when trip_distance > 10 AND trip_distance <= 50 then 50
    when trip_distance > 50 AND trip_distance <= 100 then 100 
    when trip_distance > 100 AND trip_distance <= 200 then 200
    when trip_distance > 200 then 500 
    end as distance_range  
from trip
) as temp
group by distance_range;

``

Output: ``

 distance_range | num_of_trips
----------------+--------------
             10 |     10308767
             50 |       586200
            100 |          379
            200 |           58
            500 |            9

``