In addition to basic connection, aggregation, and grouping operations, MatrixDB also provides many advanced analysis functions, such as window functions, CTEs, etc. This section introduces commonly used advanced analytical methods.
The following is an example of how to use advanced analysis using statistics disk usage.
To be simplified, the amount of disk information we want to count only includes read and write speed, that is: read and write.
The index table pattern is as follows:
CREATE TABLE disk(
time timestamp with time zone,
tag_id int,
read float,
write float
)
Distributed by (tag_id);
The window function performs an aggregation operation on a set of tuples related to the current tuple. Unlike aggregate functions, aggregate functions act on groups defined by GROUP BY clauses, while window functions act on sets of tuples related to the current tuple defined by an OVER clause.
The following are some commonly used cases of using window functions:
percentage_cont is used to calculate percentiles
ymatrix=# SELECT tag_id,
percentile_cont(0.2) WITHIN GROUP (ORDER BY read) as read,
percentile_cont(0.3) WITHIN GROUP (ORDER BY write) as write
FROM disk
GROUP BY tag_id
ORDER BY tag_id;
tag_id | read | write
-------+------+-------
1 | 19.87 | 29.86
2 | 19.95 | 29.88
3 | 20.06 | 29.93
(3 rows)
Parameters represent percentages, if 0.5 is passed to be equivalent to calculating the median.
By using SUM methods nested, the cumulative and
The following sql calculates the disk read and write accumulation sum of tag_id=1 between 2021-04-10 21:00:00 to 2021-04-10 21:00:10::
ymatrix=# SELECT time,
SUM(SUM(read)) OVER (ORDER BY time) AS read,
SUM(SUM(write)) OVER (ORDER BY time) AS write
FROM disk
WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 21:00:10'::timestamp
AND tag_id = 1
GROUP BY time
ORDER BY time;
time | read | write
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2021-04-10 21:00:00+08 | 81.07 | 73.3
2021-04-10 21:00:01+08 | 110.63 | 121.77
2021-04-10 21:00:02+08 | 202.12 | 201.36
2021-04-10 21:00:03+08 | 263.74 | 257.88
2021-04-10 21:00:04+08 | 361.6 | 299.3
2021-04-10 21:00:05+08 | 394.49 | 327.33000000000004
2021-04-10 21:00:06+08 | 438.3 | 334.98
2021-04-10 21:00:07+08 | 523.35 | 431.39
2021-04-10 21:00:08+08 | 583.15 | 461.84
2021-04-10 21:00:09+08 | 609.01 | 533.03
2021-04-10 21:00:10+08 | 669.52 | 535.9
(11 rows)
Moving average is used to calculate the average of the record and the previous n bars.
The following sql calculates the disk with tag_id of 1, and the average read and write every 10 seconds from 21:00 to 21:00 on 2021-4-10 (for the first 9 data, the average value of the number of rows that meet the condition is calculated):
ymatrix=# SELECT time,
round(AVG(read) OVER(ORDER BY time ROWS BETWEEN 9 PRECEDING AND CURRENT ROW)) AS read,
round(AVG(write) OVER(ORDER BY time ROWS BETWEEN 9 PRECEDING AND CURRENT ROW)) AS write
FROM disk
WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 21:01:00'::timestamp
AND tag_id = 1
ORDER BY time DESC;
time | read | write
------------------------+-------------------------------------
2021-04-10 21:01:00+08 | 57 | 57
2021-04-10 21:00:59+08 | 49 | 60
2021-04-10 21:00:58+08 | 52 | 56
2021-04-10 21:00:57+08 | 51 | 57
2021-04-10 21:00:56+08 | 53 | 65
2021-04-10 21:00:55+08 | 48 | 64
2021-04-10 21:00:54+08 | 49 | 64
2021-04-10 21:00:53+08 | 47 | 54
2021-04-10 21:00:52+08 | 44 | 54
2021-04-10 21:00:51+08 | 41 | 56
......
Increments are usually used to calculate the increase or decrease of a monotonic sequence, and can also be simply used to calculate the change from the previous data.
The following statement calculates the disk with tag_id of 1. The change value read by the disk from 21:00 to 21:00 on 2021-4-10. The positive number increases compared to the previous second, and the negative number decreases compared to the previous second:
ymatrix=# SELECT
time,
(
CASE WHEN lag(read) OVER (ORDER BY time) IS NULL THEN NULL
ELSE round(read - lag(read) OVER (ORDER BY time))
END
) AS read
FROM disk
WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 21:01:00'::timestamp
AND tag_id = 1
ORDER BY time;
time | read
---------------------------------------------------
2021-04-10 21:00:00+08 |
2021-04-10 21:00:01+08 | -52
2021-04-10 21:00:02+08 | 62
2021-04-10 21:00:03+08 | -30
2021-04-10 21:00:04+08 | 36
2021-04-10 21:00:05+08 | -65
2021-04-10 21:00:06+08 | 11
2021-04-10 21:00:07+08 | 41
2021-04-10 21:00:08+08 | -25
2021-04-10 21:00:09+08 | -34
......
Based on the increment, divide it by the time interval to get the growth rate (because the sample data is sampled once every 1 second, the results seen are the same as the increment):
ymatrix=# SELECT
time,
(
CASE WHEN lag(read) OVER (ORDER BY time) IS NULL THEN NULL
ELSE round(read - lag(read) OVER (ORDER BY time))
END
) / extract(epoch from time - lag(time) OVER (ORDER BY time)) AS read_rate,
extract(epoch from time - lag(time) OVER (ORDER BY time)) AS "time lag"
FROM disk
WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 21:01:00'::timestamp
AND tag_id = 1
ORDER BY time;
time | read_rate | time lag
----------------------------------------------------------------------------------------------------------------------------------
2021-04-10 21:00:00+08 | |
2021-04-10 21:00:01+08 | -52 | 1
2021-04-10 21:00:02+08 | 62 | 1
2021-04-10 21:00:03+08 | -30 | 1
2021-04-10 21:00:04+08 | 36 | 1
2021-04-10 21:00:05+08 | -65 | 1
2021-04-10 21:00:06+08 | 11 | 1
2021-04-10 21:00:07+08 | 41 | 1
2021-04-10 21:00:08+08 | -25 | 1
2021-04-10 21:00:09+08 | -34 | 1
......
Change points list records that have changed compared to the previous one. This type of query is suitable for finding changes in relatively stable data sets:
ymatrix=# SELECT time, read FROM (
SELECT time,
read,
read - lag(read) OVER (ORDER BY TIME) AS diff
FROM disk
WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 21:01:00'::timestamp
AND tag_id = 1 ) ht
WHERE diff IS NULL OR diff != 0
ORDER BY time;
time | read
-----------------------------------------------
2021-04-10 21:00:00+08 | 81.07
2021-04-10 21:00:01+08 | 29.56
2021-04-10 21:00:02+08 | 91.49
2021-04-10 21:00:03+08 | 61.62
2021-04-10 21:00:04+08 | 97.86
2021-04-10 21:00:05+08 | 32.89
2021-04-10 21:00:06+08 | 43.81
2021-04-10 21:00:07+08 | 85.05
2021-04-10 21:00:08+08 | 59.8
2021-04-10 21:00:09+08 | 25.86
(10 rows)
Common table expressions allow SQL statements to define temporary views to help simplify the structure of large query statements.
The following CTE calculates the maximum and minimum values of the average read speed in all devices:
WITH avg_read (tag_id, avg_read) AS (
SELECT tag_id, AVG(read) AS read FROM disk GROUP BY tag_id
) SELECT MAX(avg_read), MIN(avg_read) FROM avg_read;
The timing component matrixts
provided by MatrixDB also provides timing functions that are often used in timing scenarios. First, create this component:
stats=# CREATE EXTENSION matrixts;
CREATE EXTENSION
time_bucket can calculate the average value for a given time period
The following sql calculates the disk with tag_id of 1, and the average read and write speed every 5 minutes between 21 o'clock and 22 o'clock in 2021-4-10:
ymatrix=# SELECT time_bucket('5 minutes', time) AS five_min,
AVG(read) as read,
AVG(write) as write
FROM disk
WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 22:00:00'::timestamp
AND tag_id = 1
GROUP BY five_min
ORDER BY five_min;
five_min | read | write
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2021-04-10 21:00:00+08 | 48.614599999999996 | 49.48656666666666
2021-04-10 21:05:00+08 | 50.73533333333335 | 49.992566666666654
2021-04-10 21:10:00+08 | 51.6102333333333 | 49.99359999999999
2021-04-10 21:15:00+08 | 49.29116666666669 | 53.89146666666666
2021-04-10 21:20:00+08 | 49.67863333333332 | 50.47406666666665
2021-04-10 21:25:00+08 | 51.09013333333332 | 47.766733333333335
2021-04-10 21:30:00+08 | 49.55949999999999 | 50.440766666666654
2021-04-10 21:35:00+08 | 48.86253333333333 | 50.57290000000001
2021-04-10 21:40:00+08 | 51.061299999999974 | 47.028766666666684
2021-04-10 21:45:00+08 | 52.10353333333333 | 49.861466666666665
2021-04-10 21:50:00+08 | 51.780566666666694 | 51.4159
2021-04-10 21:55:00+08 | 51.83549999999998 | 49.124366666666674
2021-04-10 22:00:00+08 | 93.96 | 91.07
(13 rows)
If there is data missing during the time period, in order to avoid error query results, you can use the time_bucket_gapfill function to fill in the missing data. There are two filling strategies:
For example:
SELECT time_bucket_gapfill('5 minutes', time) AS five_min,
locf(AVG(read)) as locf_read,
interpolate(AVG(read)) as interpolate_read
FROM disk
WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 22:00:00'::timestamp
AND tag_id = 1
GROUP BY five_min
ORDER BY five_min;
first returns the earliest value of time:
ymatrix=# SELECT tag_id,
first(read, time) AS read,
first(write, time) AS write
FROM disk
GROUP BY tag_id
ORDER BY tag_id;
tag_id | read | write
-------+------+-------
1 | 11.51 | 86.61
2 | 50.07 | 25.9
3 | 83.72 | 10.5
(3 rows)
last returns the latest value:
ymatrix=# SELECT tag_id,
last(read, time) AS read,
last(write, time) AS write
FROM disk
GROUP BY tag_id
ORDER BY tag_id;
tag_id | read | write
-------+------+-------
1 | 5.32 | 4.96
2 | 5.73 | 34.73
3 | 49.03 | 86.02
(3 rows)