MatrixDB is suitable for IoT timing application scenarios for devices of all sizes. This tutorial takes a specific data set as an example to illustrate how to load, process and query time series data in MatrixDB.
A city has a population of more than 8 million and 200,000 taxis. The municipal management department has collected and disclosed the itinerary of each taxi, including the time of getting on and off, the place of getting on and off, the number of people riding, the fees and payment methods. With the help of this information, the utilization rate of taxis and the overall traffic conditions can be analyzed, urban management can be improved, and the life experience of residents and tourists can be improved. This tutorial provides a month's data compression package, which can be downloaded at [here] (https://pan.baidu.com/s/1JJv6ADN5vHOlem7smTrEDw) (extract code 1x4u).
Among the collected data, there is a way to pay, and the possible ways to pay are cash, credit cards, free payment, disputed, unknown and invalid. This information is saved by creating a payment_types table so that it can be associated with these metadata during subsequent queries.
create table if not exists payment_types (
payment_type int,
description text
);
insert into payment_types values
(1, ‘Credit card’),
(2, ‘Cash’),
(3, ‘No charge’),
(4, ‘Disputed’),
(5, ‘Unknown’),
(6, ‘Invalid trip’);
There is also a type of rate indicating the type, including standard rate, Airport No. 1, Airport No. 2, special areas, negotiated prices, multiple people taking buses, etc. You can create a rate_codes table to record these types:
create table if not exists rate_codes (
rate_code int,
description text
);
insert into rate_codes values
(1, ‘Standard Rate’),
(2, ‘Airport 1’),
(3, ‘Airport 2’),
(4, ‘Special Area’),
(5, ‘Negotiated Price’),
(6, ‘Group’);
Next, you can create a database table to save specific itinerary data. There are some fields that briefly explain the meanings. pickup_datetime/dropoff_datetime represents the time point of boarding and time point of boarding respectively. pickup_longitude /pickup_latitude represents the latitude and longitude of boarding locations, dropoff_longitude/dropoff_latitude represents the latitude and longitude of boarding locations, passenger_count represents the number of passengers, trip_distance is the distance of the journey (in miles), total_amount represents the ride cost, and the last field trip_duration is a field generated when data is loaded, recording the ride duration (in minutes).
create table if not exists trip (
vendor_id text,
pickup_datetime timestamp without time zone,
dropoff_datetime timestamp without time zone,
passenger_count int,
trip_distance numeric,
pickup_longitude numeric,
pickup_latitude numeric,
rate_code_id int,
store_and_fwd_flag text,
dropoff_longitude numeric,
dropoff_latitude numeric,
payment_type int,
fare_amount numeric,
extra numeric,
mta_tax numeric,
tip_amount numeric,
tolls_amount numeric,
improvement_surcharge numeric,
total_amount numeric,
trip_duration numeric generated always as (EXTRACT(EPOCH FROM (dropoff_datetime - pickup_datetime)::INTERVAL)/60) STORED
)
PARTITION BY RANGE (pickup_datetime)
( START (date '2016-01-01') INCLUSIVE
END (date '2016-02-01') EXCLUSIVE
EVERY (INTERVAL '1 day') );
The trip table will be automatically partitioned by day, which facilitates quick cropping by time period queries, and is also conducive to the rapid processing of expired data in the future.
Find the yellow_tripdata_2016-01.csv file path after the data file package downloaded from the above link is found and use the mxgate command to load the data. Please specify the actual file path after tail, and specify the actual node name or IP through the --db-master-host parameter, for example:
tail -n +2 /home/mxadmin/workspace/nyc-taxi-data/yellow_tripdata_2016-01.csv | mxgate --source stdin --db-database mxdb --db-master-host sdw3 --db-master-port 5432 --db-user mxadmin --time-format raw --target trip --parallel 256 --delimiter ',' --exclude-columns trip_duration
The main parameters of mxgate are as follows:
--db-database mxdb Specify database name
--db-master-host sdw3 master node host name or IP
--db-master-port 5432 database port
--db-user mxdb database username
--time-format raw original format, no conversion
--target trip table name to import
--parallel 256 parallel number
--delimiter ',' delimiter
For more detailed usage methods, please refer to MatrixGate
Similar to many sensor data, this dataset contains some obviously wrong or invalid data for various complex reasons. With the rich SQL capabilities provided by MatrixDB, these invalid error data can be quickly detected and cleared. An error situation is that the department time is earlier than or equal to the boarding time. You can use SELECT to view this situation and delete it with the DELETE statement:
select count(*)
from trip
where dropoff_datetime <= pickup_datetime;
delete from trip
where dropoff_datetime <= pickup_datetime;
Another situation is that the average speed of the car is greater than 300 miles per hour, especially during a journey or a long time. This situation is obviously unreasonable. You can use the following statement to view and clear:
select trip_distance, trip_duration
from trip
where trip_distance > trip_duration* (300/60) and trip_distance > 100;
delete from trip
where trip_distance > trip_duration* (300/60) and trip_distance > 100;
MatrixDB provides time_bucket function, which supports segmented calculations according to any time interval. Before use, you need to install MatrixTS Extension on the database to initialize the timing components:
CREATE EXTENSION matrixts;
After completion, you can use the following SQL statement to count how many itineeraries are per day:
select time_bucket('24 hours', pickup_datetime) as day, count(*)
from trip
group by day
order by day;
If you want to know how many people ride in each hour of the day on January 2, 2016, you can use the following SQL:
SELECT time_bucket('1 hour', pickup_datetime) as hour, sum(passenger_count)
FROM trip
WHERE pickup_datetime >= '2016-01-02 00:00:00' and pickup_datetime < '2016-01-03 00:00:00'
GROUP BY hour
ORDER BY hour;
It can be quickly learned from the max and min functions that the farthest distance in the current dataset is 485.9 miles. If you want to further understand the total number of strokes in different stroke distances above 10, 10-50, 50-100, 100-200, or 200, you only need one SQL statement to complete:
select distance_range, count(*) as num_of_trips
from
(
select
case
when trip_distance <= 10 then 10
when trip_distance > 10 AND trip_distance <= 50 then 50
when trip_distance > 50 AND trip_distance <= 100 then 100
when trip_distance > 100 AND trip_distance <= 200 then 200
when trip_distance > 200 then 500
end as distance_range
from trip
) as temp
group by distance_range;
After execution, you can see the output like this:
distance_range | num_of_trips
----------------------------------------------------------------------------------------------------------------------------------
10 | 10308767
50 | 586200
100 | 379
200 | 58
500 | 9