Basic Timing Tutorial

MatrixDB is suitable for IoT timing application scenarios for devices of all sizes. This tutorial takes a specific data set as an example to illustrate how to load, process and query time series data in MatrixDB.

1 Dataset Introduction -- Dataset of taxi itinerary for a city

A city has a population of more than 8 million and 200,000 taxis. The municipal management department has collected and disclosed the itinerary of each taxi, including the time of getting on and off, the place of getting on and off, the number of people riding, the fees and payment methods. With the help of this information, the utilization rate of taxis and the overall traffic conditions can be analyzed, urban management can be improved, and the life experience of residents and tourists can be improved. This tutorial provides a month's data compression package, which can be downloaded at [here] (https://pan.baidu.com/s/1JJv6ADN5vHOlem7smTrEDw) (extract code 1x4u).

2 Metadata and table schema

Among the collected data, there is a way to pay, and the possible ways to pay are cash, credit cards, free payment, disputed, unknown and invalid. This information is saved by creating a payment_types table so that it can be associated with these metadata during subsequent queries.

create table if not exists payment_types (
    payment_type int,
    description text
);

insert into payment_types values
(1, ‘Credit card’),
(2, ‘Cash’),
(3, ‘No charge’),
(4, ‘Disputed’),
(5, ‘Unknown’),
(6, ‘Invalid trip’);

There is also a type of rate indicating the type, including standard rate, Airport No. 1, Airport No. 2, special areas, negotiated prices, multiple people taking buses, etc. You can create a rate_codes table to record these types:

create table if not exists rate_codes (
    rate_code   int,
    description text
);

insert into rate_codes values
(1, ‘Standard Rate’),
(2, ‘Airport 1’),
(3, ‘Airport 2’),
(4, ‘Special Area’),
(5, ‘Negotiated Price’),
(6, ‘Group’);

Next, you can create a database table to save specific itinerary data. There are some fields that briefly explain the meanings. pickup_datetime/dropoff_datetime represents the time point of boarding and time point of boarding respectively. pickup_longitude /pickup_latitude represents the latitude and longitude of boarding locations, dropoff_longitude/dropoff_latitude represents the latitude and longitude of boarding locations, passenger_count represents the number of passengers, trip_distance is the distance of the journey (in miles), total_amount represents the ride cost, and the last field trip_duration is a field generated when data is loaded, recording the ride duration (in minutes).

create table if not exists trip (
  vendor_id text,
  pickup_datetime timestamp without time zone,
  dropoff_datetime timestamp without time zone,
  passenger_count int,
  trip_distance numeric,
  pickup_longitude numeric,
  pickup_latitude numeric,
  rate_code_id int,
  store_and_fwd_flag text,
  dropoff_longitude numeric,
  dropoff_latitude numeric,
  payment_type int,
  fare_amount numeric,
  extra numeric,
  mta_tax numeric,
  tip_amount numeric,
  tolls_amount numeric,
  improvement_surcharge numeric,
  total_amount numeric,
  trip_duration numeric generated always as (EXTRACT(EPOCH FROM (dropoff_datetime - pickup_datetime)::INTERVAL)/60) STORED
) 
PARTITION BY RANGE (pickup_datetime)
( START (date '2016-01-01') INCLUSIVE
   END (date '2016-02-01') EXCLUSIVE
   EVERY (INTERVAL '1 day') );

The trip table will be automatically partitioned by day, which facilitates quick cropping by time period queries, and is also conducive to the rapid processing of expired data in the future.

3 Data loading

Find the yellow_tripdata_2016-01.csv file path after the data file package downloaded from the above link is found and use the mxgate command to load the data. Please specify the actual file path after tail, and specify the actual node name or IP through the --db-master-host parameter, for example:

 tail -n +2 /home/mxadmin/workspace/nyc-taxi-data/yellow_tripdata_2016-01.csv | mxgate --source stdin --db-database mxdb --db-master-host sdw3 --db-master-port 5432 --db-user mxadmin --time-format raw --target trip --parallel 256  --delimiter ','  --exclude-columns trip_duration 

The main parameters of mxgate are as follows:

--db-database mxdb Specify database name
--db-master-host sdw3 master node host name or IP
--db-master-port 5432 database port
--db-user mxdb database username
--time-format raw original format, no conversion
--target trip table name to import
--parallel 256 parallel number
--delimiter ',' delimiter

For more detailed usage methods, please refer to MatrixGate

4 Data processing

Similar to many sensor data, this dataset contains some obviously wrong or invalid data for various complex reasons. With the rich SQL capabilities provided by MatrixDB, these invalid error data can be quickly detected and cleared. An error situation is that the department time is earlier than or equal to the boarding time. You can use SELECT to view this situation and delete it with the DELETE statement:

select count(*) 
from trip 
where dropoff_datetime <= pickup_datetime;

delete from trip 
where dropoff_datetime <= pickup_datetime;

Another situation is that the average speed of the car is greater than 300 miles per hour, especially during a journey or a long time. This situation is obviously unreasonable. You can use the following statement to view and clear:

select trip_distance, trip_duration 
from trip 
where trip_distance > trip_duration* (300/60) and trip_distance > 100;

delete from trip 
where trip_distance > trip_duration* (300/60) and trip_distance > 100;

5 Data Analysis

MatrixDB provides time_bucket function, which supports segmented calculations according to any time interval. Before use, you need to install MatrixTS Extension on the database to initialize the timing components:

 CREATE EXTENSION matrixts;

After completion, you can use the following SQL statement to count how many itineeraries are per day:

select time_bucket('24 hours', pickup_datetime) as day, count(*) 
from trip 
group by day 
order by day;

If you want to know how many people ride in each hour of the day on January 2, 2016, you can use the following SQL:

SELECT time_bucket('1 hour', pickup_datetime) as hour, sum(passenger_count)
FROM trip
WHERE pickup_datetime >= '2016-01-02 00:00:00' and pickup_datetime < '2016-01-03 00:00:00'
GROUP BY hour
ORDER BY hour;

It can be quickly learned from the max and min functions that the farthest distance in the current dataset is 485.9 miles. If you want to further understand the total number of strokes in different stroke distances above 10, 10-50, 50-100, 100-200, or 200, you only need one SQL statement to complete:

select distance_range, count(*) as num_of_trips  
from
(
select 
case
    when trip_distance <= 10 then 10
    when trip_distance > 10 AND trip_distance <= 50 then 50
    when trip_distance > 50 AND trip_distance <= 100 then 100 
    when trip_distance > 100 AND trip_distance <= 200 then 200
    when trip_distance > 200 then 500 
    end as distance_range  
from trip
) as temp
group by distance_range;

After execution, you can see the output like this:

 distance_range | num_of_trips
----------------------------------------------------------------------------------------------------------------------------------
             10 |     10308767
             50 |       586200
            100 |          379
            200 |           58
            500 |            9