In addition to basic connection, aggregation, and grouping operations, YMatrix also provides many advanced analytical functions, such as window functions, CTE (Comman Table Expression), Ordered-Set Aggregate Functions, commonly used timing functions, etc. This section will introduce you to the advanced query methods commonly used in YMatrix.
The following is a case where statistics disk usage is used to demonstrate how to perform advanced queries in YMatrix. In the sample metric table disk, we designed the read and write speed fields of the disk, using the MARS2 storage engine.
MARS2 tables depend on matrixts extensions. Before building a table, you first need to create an extension in the database using the storage engine. This extension is at the database level and does not require repeated creation.
CREATE EXTENSION matrixts;
Create a metric table disk.
CREATE TABLE disk (
time timestamp with time zone,
tag_id int,
read float,
write float
)
USING mars2
DISTRIBUTED BY (tag_id);
After successfully creating the MARS2 table, you must create an additional index of type mars2_btree so that you can read and write normally.
CREATE INDEX idx_mars2 ON disk
USING mars2_btree(tag_id);
For more information about MARS2, please see Storage Engine
Window functions are mainly used to deal with relatively complex report statistical analysis scenarios. The "Window" defines a data set that performs aggregate queries between rows related to the current row. The difference between window functions and other SQL functions is the existence of OVER clauses. If a function has an OVER clause, it is a window function. You can understand the difference between window functions and ordinary aggregate functions through the following table.
Normal Aggregation Function | Window Function | |
---|---|---|
Output | One record | Multiple records |
Functions | max(), min(), count(), sum(), etc. | avg(), sum(), rank(), etc. |
Usage | is usually used in combination with the GROUP BY clause | is used in combination with the OVER clause. Generally speaking, the OVER clause is written directly after the name and parameters of the window function. In the OVER clause, you can usually use three clauses: PARTITION BY, ORDER BY, and ROWS BETWEEN. Only using the PARTITION BY clause will form a static window, and the window size and position will not change; in addition to the PARTITION BY statement, one or two of the ORDER BY and ROWS BETWEEN statements are also used, which will form a sliding window, that is, the window size and position will constantly change. |
You can practice commonly used window functions based on the following cases.
By using the SUM method nested, the cumulative sum can be calculated.
The following SQL calculates the disk read and write accumulation sum of tag_id 1 between 2021-04-10 21:00:00 to 2021-04-10 21:00:10::
ymatrix=# SELECT time,
sum(sum(read)) OVER (ORDER BY time) AS read,
sum(sum(write)) OVER (ORDER BY time) AS write
FROM disk
WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 21:00:10'::timestamp
AND tag_id = 1
GROUP BY time
ORDER BY time;
time | read | write
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2021-04-10 21:00:00+08 | 81.07 | 73.3
2021-04-10 21:00:01+08 | 110.63 | 121.77
2021-04-10 21:00:02+08 | 202.12 | 201.36
2021-04-10 21:00:03+08 | 263.74 | 257.88
2021-04-10 21:00:04+08 | 361.6 | 299.3
2021-04-10 21:00:05+08 | 394.49 | 327.33000000000004
2021-04-10 21:00:06+08 | 438.3 | 334.98
2021-04-10 21:00:07+08 | 523.35 | 431.39
2021-04-10 21:00:08+08 | 583.15 | 461.84
2021-04-10 21:00:09+08 | 609.01 | 533.03
2021-04-10 21:00:10+08 | 669.52 | 535.9
(11 rows)
The moving average is used to calculate the average value of the record with the previous n bars.
The following SQL calculates the average read and write every 10 seconds from 2021-4-10 21:00:00 to 21:01:00 (for the first 9 data, the average value of the number of rows that meet the condition is calculated):
ymatrix=# SELECT time,
round(AVG(read) OVER(ORDER BY time ROWS BETWEEN 9 PRECEDING AND CURRENT ROW)) AS read,
round(AVG(write) OVER(ORDER BY time ROWS BETWEEN 9 PRECEDING AND CURRENT ROW)) AS write
FROM disk
WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 21:01:00'::timestamp
AND tag_id = 1
ORDER BY time DESC;
time | read | write
------------------------+-------------------------------------
2021-04-10 21:01:00+08 | 57 | 57
2021-04-10 21:00:59+08 | 49 | 60
2021-04-10 21:00:58+08 | 52 | 56
2021-04-10 21:00:57+08 | 51 | 57
2021-04-10 21:00:56+08 | 53 | 65
2021-04-10 21:00:55+08 | 48 | 64
2021-04-10 21:00:54+08 | 49 | 64
2021-04-10 21:00:53+08 | 47 | 54
2021-04-10 21:00:52+08 | 44 | 54
2021-04-10 21:00:51+08 | 41 | 56
......
Increments are usually used to calculate the increase or decrease of a monotonic sequence, and can also be simply used to calculate the change from the previous data.
The following statement calculates the change value read by the disk with tag_id of 1. From 2021-4-10 21:00:00 to 21:01:00, the positive number increases compared to the previous second, and the negative number decreases compared to the previous second:
ymatrix=# SELECT
time,
(
CASE WHEN lag(read) OVER (ORDER BY time) IS NULL THEN NULL
ELSE round(read - lag(read) OVER (ORDER BY time))
END
) AS read
FROM disk
WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 21:01:00'::timestamp
AND tag_id = 1
ORDER BY time;
time | read
---------------------------------------------------
2021-04-10 21:00:00+08 |
2021-04-10 21:00:01+08 | -52
2021-04-10 21:00:02+08 | 62
2021-04-10 21:00:03+08 | -30
2021-04-10 21:00:04+08 | 36
2021-04-10 21:00:05+08 | -65
2021-04-10 21:00:06+08 | 11
2021-04-10 21:00:07+08 | 41
2021-04-10 21:00:08+08 | -25
2021-04-10 21:00:09+08 | -34
......
Based on the increment, divide it by the time interval to get the growth rate (because the sample data is sampled once in 1 second, the results seen are the same as the increment):
ymatrix=# SELECT
time,
(
CASE WHEN lag(read) OVER (ORDER BY time) IS NULL THEN NULL
ELSE round(read - lag(read) OVER (ORDER BY time))
END
) / extract(epoch from time - lag(time) OVER (ORDER BY time)) AS read_rate,
extract(epoch from time - lag(time) OVER (ORDER BY time)) AS "time lag"
FROM disk
WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 21:01:00'::timestamp
AND tag_id = 1
ORDER BY time;
time | read_rate | time lag
----------------------------------------------------------------------------------------------------------------------------------
2021-04-10 21:00:00+08 | |
2021-04-10 21:00:01+08 | -52 | 1
2021-04-10 21:00:02+08 | 62 | 1
2021-04-10 21:00:03+08 | -30 | 1
2021-04-10 21:00:04+08 | 36 | 1
2021-04-10 21:00:05+08 | -65 | 1
2021-04-10 21:00:06+08 | 11 | 1
2021-04-10 21:00:07+08 | 41 | 1
2021-04-10 21:00:08+08 | -25 | 1
2021-04-10 21:00:09+08 | -34 | 1
......
Change points list records that have changed compared to the previous one. This type of query is suitable for finding changes in relatively stable data sets:
ymatrix=# SELECT time, read FROM (
SELECT time,
read,
read - lag(read) OVER (ORDER BY TIME) AS diff
FROM disk
WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 21:01:00'::timestamp
AND tag_id = 1 ) ht
WHERE diff IS NULL OR diff != 0
ORDER BY time;
time | read
-----------------------------------------------
2021-04-10 21:00:00+08 | 81.07
2021-04-10 21:00:01+08 | 29.56
2021-04-10 21:00:02+08 | 91.49
2021-04-10 21:00:03+08 | 61.62
2021-04-10 21:00:04+08 | 97.86
2021-04-10 21:00:05+08 | 32.89
2021-04-10 21:00:06+08 | 43.81
2021-04-10 21:00:07+08 | 85.05
2021-04-10 21:00:08+08 | 59.8
2021-04-10 21:00:09+08 | 25.86
(10 rows)
You can use the following ordered aggregation function to calculate the continuous percentage.
percentile_cont(fractions) WITHIN GROUP (ORDER BY sort_expression)
The examples and return results are as follows.
ymatrix=# SELECT tag_id,
percentile_cont(0.2) WITHIN GROUP
(ORDER BY read) AS read,
percentile_cont(0.3) WITHIN GROUP (ORDER BY write) AS write
FROM disk
GROUP BY tag_id
ORDER BY tag_id;
tag_id | read | write
-------+------+-------
1 | 19.87 | 29.86
2 | 19.95 | 29.88
3 | 20.06 | 29.93
(3 rows)
The parameter represents the percentage, if 0.5 is passed, it is equivalent to calculating the median.
A common table expression, namely, CTE, can be used to define and build a temporary view, thereby simplifying large query statements. It is implemented through the WITH keyword. Before using it, you can understand the difference between CTE and CREATE VIEW statements through the following table.
CTE | CREATE VIEW | |
---|---|---|
Expression | is not an independent SQL statement, but a part of the statement, that is, an expression | independent SQL statement |
Application scope | The created temporary view is only used for the query statement to which it belongs | The created view can be used for all query statements |
The following CTE calculates the maximum and minimum values of the average read speed in all devices:
WITH avg_read (tag_id, avg_read) AS (
SELECT tag_id, AVG(read) AS read FROM disk GROUP BY tag_id
) SELECT MAX(avg_read), MIN(avg_read) FROM avg_read;
The timing component matrixts
provided by YMatrix also provides timing functions that are often used in timing scenarios. First, create this component:
stats=# CREATE EXTENSION matrixts;
time_bucket calculates the average value for a given time period.
| Parameter name | Description Data type| Default** --|--|--|-- period |Time window size |int16;int32;int64;interval| No timestamp |Columns that need to be converted|int16;int32;int64;date;timestamp;timestamptz | No
The following SQL calculates the average read and write speed for every 5 minutes between 2021-4-10 21:00:00 and 22:00:00:00::
ymatrix=# SELECT time_bucket('5 minutes', time) AS five_min,
AVG(read) as read,
AVG(write) as write
FROM disk
WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 22:00:00'::timestamp
AND tag_id = 1
GROUP BY five_min
ORDER BY five_min;
five_min | read | write
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2021-04-10 21:00:00+08 | 48.614599999999996 | 49.48656666666666
2021-04-10 21:05:00+08 | 50.73533333333335 | 49.992566666666654
2021-04-10 21:10:00+08 | 51.6102333333333 | 49.99359999999999
2021-04-10 21:15:00+08 | 49.29116666666669 | 53.89146666666666
2021-04-10 21:20:00+08 | 49.67863333333332 | 50.47406666666665
2021-04-10 21:25:00+08 | 51.09013333333332 | 47.766733333333335
2021-04-10 21:30:00+08 | 49.55949999999999 | 50.440766666666654
2021-04-10 21:35:00+08 | 48.86253333333333 | 50.57290000000001
2021-04-10 21:40:00+08 | 51.061299999999974 | 47.028766666666684
2021-04-10 21:45:00+08 | 52.10353333333333 | 49.861466666666665
2021-04-10 21:50:00+08 | 51.780566666666694 | 51.4159
2021-04-10 21:55:00+08 | 51.83549999999998 | 49.124366666666674
2021-04-10 22:00:00+08 | 93.96 | 91.07
(13 rows)
When there is data missing during the time period and data cleaning is required, you can use the time_bucket_gapfill function to fill the missing data, so that the data can be distributed smoothly in the timing at the specified time interval, so as to facilitate analysis. There are two filling strategies: locf (last observation carried forward) and interpolate.
Assume that the data in the example disk table is as follows (only applicable to Section 4.2):
ymatrix=# SELECT * FROM disk ORDER BY tag_id;
time | tag_id | read | write
------------------------+---------------------------------------------------------------------------------------------------------
2021-04-10 21:00:00+08 | 1 | 3.4 | 4.6
2021-04-10 21:50:00+08 | 1 | 4 | 2.7
2021-04-10 21:40:00+08 | 1 | 8.4 | 12
2021-04-10 21:20:00+08 | 1 | 2.9 | 6
2021-04-10 21:30:00+08 | 1 | 9 | 10.2
2021-04-10 21:10:00+08 | 1 | 5.2 | 6.6
2021-04-10 22:00:00+08 | 1 | 10 | 7
(7 rows)
Use the time_bucket_gapfill function to fill the data every 5 minutes between 21:00:00 and 22:00:00 to get the result.
SELECT time_bucket_gapfill('5 minutes', time) AS five_min,
locf(AVG(read)) as locf_read,
interpolate(AVG(read)) as interpolate_read
FROM disk
WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 22:00:00'::timestamp
AND tag_id = 1
GROUP BY five_min
ORDER BY five_min;
five_min | locf_read | interpolate_read
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2021-04-10 21:00:00+08 | 3.4 | 3.4
2021-04-10 21:05:00+08 | 3.4 | 4.3
2021-04-10 21:10:00+08 | 5.2 | 5.2
2021-04-10 21:15:00+08 | 5.2 | 4.05
2021-04-10 21:20:00+08 | 2.9 | 2.9
2021-04-10 21:25:00+08 | 2.9 | 5.95
2021-04-10 21:30:00+08 | 9 | 9
2021-04-10 21:35:00+08 | 9 | 8.7
2021-04-10 21:40:00+08 | 8.4 | 8.4
2021-04-10 21:45:00+08 | 8.4 | 6.2
2021-04-10 21:50:00+08 | 4 | 4
2021-04-10 21:55:00+08 | 4 | 7
2021-04-10 22:00:00+08 | 10 | 10
(13 rows)
first Returns the earliest value of time:
ymatrix=# SELECT tag_id,
first(read, time) AS read,
first(write, time) AS write
FROM disk
GROUP BY tag_id
ORDER BY tag_id;
tag_id | read | write
-------+------+-------
1 | 11.51 | 86.61
2 | 50.07 | 25.9
3 | 83.72 | 10.5
(3 rows)
last Returns the latest value:
ymatrix=# SELECT tag_id,
last(read, time) AS read,
last(write, time) AS write
FROM disk
GROUP BY tag_id
ORDER BY tag_id;
tag_id | read | write
-------+------+-------
1 | 5.32 | 4.96
2 | 5.73 | 34.73
3 | 49.03 | 86.02
(3 rows)
last_not_null_value is equivalent to adding not null filtering on the last basis, returning the last non-null value:
ymatrix=# SELECT last_not_null_value(read, time)
FROM disk WHERE tag_id = 1;
last_not_null_value
----------------------------------------------------------------------------------------------------------------------------------
3.1
(1 row)
Compared with last_not_null and last_not_null_value, not only does the return value, but the time will also return. The returned type is a string, with the format '["value", "time"]':
ymatrix=# SELECT last_not_null(read, time)
FROM disk WHERE tag_id = 1;
last_not_null
----------------------------------------------------------------------------------------------------------------------------------
["3.1","2021-11-05 17:32:51.754457+08"]
(1 row)