Advanced Query

In addition to basic connection, aggregation, and grouping operations, MatrixDB also provides many advanced analysis functions, such as window functions, CTEs, etc. This section introduces commonly used advanced analytical methods.

The following is an example of how to use advanced analysis using statistics disk usage.

To be simplified, the amount of disk information we want to count only includes read and write speed, that is: read and write.

The index table pattern is as follows:

CREATE TABLE disk(
    time timestamp with time zone,
    tag_id int,
    read float,
    write float
)
Distributed by (tag_id);

1. Window Function

The window function performs an aggregation operation on a set of tuples related to the current tuple. Unlike aggregate functions, aggregate functions act on groups defined by GROUP BY clauses, while window functions act on sets of tuples related to the current tuple defined by an OVER clause.

The following are some commonly used cases of using window functions:

1.1 Calculate the percentile

percentage_cont is used to calculate percentiles

ymatrix=# SELECT tag_id,
    percentile_cont(0.2) WITHIN GROUP (ORDER BY read) as read,
    percentile_cont(0.3) WITHIN GROUP (ORDER BY write) as write
    FROM disk
    GROUP BY tag_id
    ORDER BY tag_id;

 tag_id | read  | write
-------+------+-------
      1 | 19.87 | 29.86
      2 | 19.95 | 29.88
      3 | 20.06 | 29.93
(3 rows)

Parameters represent percentages, if 0.5 is passed to be equivalent to calculating the median.

1.2 Accumulation and

By using SUM methods nested, the cumulative and

The following sql calculates the disk read and write accumulation sum of tag_id=1 between 2021-04-10 21:00:00 to 2021-04-10 21:00:10::

ymatrix=# SELECT time,
    SUM(SUM(read)) OVER (ORDER BY time) AS read,
    SUM(SUM(write)) OVER (ORDER BY time) AS write
    FROM disk
    WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 21:00:10'::timestamp
    AND tag_id = 1 
    GROUP BY time
    ORDER BY time;

          time          |  read  |       write
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 2021-04-10 21:00:00+08 |  81.07 |               73.3
 2021-04-10 21:00:01+08 | 110.63 |             121.77
 2021-04-10 21:00:02+08 | 202.12 |             201.36
 2021-04-10 21:00:03+08 | 263.74 |             257.88
 2021-04-10 21:00:04+08 |  361.6 |              299.3
 2021-04-10 21:00:05+08 | 394.49 | 327.33000000000004
 2021-04-10 21:00:06+08 |  438.3 |             334.98
 2021-04-10 21:00:07+08 | 523.35 |             431.39
 2021-04-10 21:00:08+08 | 583.15 |             461.84
 2021-04-10 21:00:09+08 | 609.01 |             533.03
 2021-04-10 21:00:10+08 | 669.52 |              535.9
(11 rows)

1.3 Moving Average

Moving average is used to calculate the average of the record and the previous n bars.

The following sql calculates the disk with tag_id of 1, and the average read and write every 10 seconds from 21:00 to 21:00 on 2021-4-10 (for the first 9 data, the average value of the number of rows that meet the condition is calculated):

ymatrix=# SELECT time,
    round(AVG(read) OVER(ORDER BY time ROWS BETWEEN 9 PRECEDING AND CURRENT ROW)) AS read,
    round(AVG(write) OVER(ORDER BY time ROWS BETWEEN 9 PRECEDING AND CURRENT ROW)) AS write
    FROM disk
    WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 21:01:00'::timestamp
    AND tag_id = 1
    ORDER BY time DESC;

          time          | read | write
------------------------+-------------------------------------
 2021-04-10 21:01:00+08 |   57 |    57
 2021-04-10 21:00:59+08 |   49 |    60
 2021-04-10 21:00:58+08 |   52 |    56
 2021-04-10 21:00:57+08 |   51 |    57
 2021-04-10 21:00:56+08 |   53 |    65
 2021-04-10 21:00:55+08 |   48 |    64
 2021-04-10 21:00:54+08 |   49 |    64
 2021-04-10 21:00:53+08 |   47 |    54
 2021-04-10 21:00:52+08 |   44 |    54
 2021-04-10 21:00:51+08 |   41 |    56
......

1.4 Increment

Increments are usually used to calculate the increase or decrease of a monotonic sequence, and can also be simply used to calculate the change from the previous data.

The following statement calculates the disk with tag_id of 1. The change value read by the disk from 21:00 to 21:00 on 2021-4-10. The positive number increases compared to the previous second, and the negative number decreases compared to the previous second:

ymatrix=# SELECT
    time,
    (
     CASE WHEN lag(read) OVER (ORDER BY time) IS NULL THEN NULL
     ELSE round(read - lag(read) OVER (ORDER BY time))
     END
    ) AS read
    FROM disk
    WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 21:01:00'::timestamp
    AND tag_id = 1
    ORDER BY time;

          time          | read
---------------------------------------------------
 2021-04-10 21:00:00+08 |
 2021-04-10 21:00:01+08 |  -52
 2021-04-10 21:00:02+08 |   62
 2021-04-10 21:00:03+08 |  -30
 2021-04-10 21:00:04+08 |   36
 2021-04-10 21:00:05+08 |  -65
 2021-04-10 21:00:06+08 |   11
 2021-04-10 21:00:07+08 |   41
 2021-04-10 21:00:08+08 |  -25
 2021-04-10 21:00:09+08 |  -34
......

1.5 Growth

Based on the increment, divide it by the time interval to get the growth rate (because the sample data is sampled once every 1 second, the results seen are the same as the increment):

ymatrix=# SELECT
    time,
    (
     CASE WHEN lag(read) OVER (ORDER BY time) IS NULL THEN NULL
     ELSE round(read - lag(read) OVER (ORDER BY time))
     END
    ) / extract(epoch from time - lag(time) OVER (ORDER BY time)) AS read_rate,
    extract(epoch from time - lag(time) OVER (ORDER BY time)) AS "time lag"
    FROM disk
    WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 21:01:00'::timestamp
    AND tag_id = 1
    ORDER BY time;

          time          | read_rate | time lag
----------------------------------------------------------------------------------------------------------------------------------
 2021-04-10 21:00:00+08 |           |
 2021-04-10 21:00:01+08 |       -52 |        1
 2021-04-10 21:00:02+08 |        62 |        1
 2021-04-10 21:00:03+08 |       -30 |        1
 2021-04-10 21:00:04+08 |        36 |        1
 2021-04-10 21:00:05+08 |       -65 |        1
 2021-04-10 21:00:06+08 |        11 |        1
 2021-04-10 21:00:07+08 |        41 |        1
 2021-04-10 21:00:08+08 |       -25 |        1
 2021-04-10 21:00:09+08 |       -34 |        1
......

1.6 Change points

Change points list records that have changed compared to the previous one. This type of query is suitable for finding changes in relatively stable data sets:

ymatrix=# SELECT time, read FROM (
        SELECT time,
            read,
            read - lag(read) OVER (ORDER BY TIME) AS diff
        FROM disk
        WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 21:01:00'::timestamp
        AND tag_id = 1 ) ht
    WHERE diff IS NULL OR diff != 0
    ORDER BY time;

          time          | read
-----------------------------------------------
 2021-04-10 21:00:00+08 | 81.07
 2021-04-10 21:00:01+08 | 29.56
 2021-04-10 21:00:02+08 | 91.49
 2021-04-10 21:00:03+08 | 61.62
 2021-04-10 21:00:04+08 | 97.86
 2021-04-10 21:00:05+08 | 32.89
 2021-04-10 21:00:06+08 | 43.81
 2021-04-10 21:00:07+08 | 85.05
 2021-04-10 21:00:08+08 |  59.8
 2021-04-10 21:00:09+08 | 25.86
(10 rows)

2. CTE

Common table expressions allow SQL statements to define temporary views to help simplify the structure of large query statements.

The following CTE calculates the maximum and minimum values ​​of the average read speed in all devices:

WITH avg_read (tag_id, avg_read) AS (
    SELECT tag_id, AVG(read) AS read FROM disk GROUP BY tag_id
) SELECT MAX(avg_read), MIN(avg_read) FROM avg_read;

3. Timing function

The timing component matrixts provided by MatrixDB also provides timing functions that are often used in timing scenarios. First, create this component:

stats=# CREATE EXTENSION matrixts;
CREATE EXTENSION

3.1 time_bucket

time_bucket can calculate the average value for a given time period

The following sql calculates the disk with tag_id of 1, and the average read and write speed every 5 minutes between 21 o'clock and 22 o'clock in 2021-4-10:

ymatrix=# SELECT time_bucket('5 minutes', time) AS five_min,
    AVG(read) as read,
    AVG(write) as write
    FROM disk
    WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 22:00:00'::timestamp
    AND tag_id = 1
    GROUP BY five_min
    ORDER BY five_min;

        five_min        |        read         |        write
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 2021-04-10 21:00:00+08 | 48.614599999999996 |  49.48656666666666
 2021-04-10 21:05:00+08 |  50.73533333333335 | 49.992566666666654
 2021-04-10 21:10:00+08 |   51.6102333333333 |  49.99359999999999
 2021-04-10 21:15:00+08 |  49.29116666666669 |  53.89146666666666
 2021-04-10 21:20:00+08 |  49.67863333333332 |  50.47406666666665
 2021-04-10 21:25:00+08 |  51.09013333333332 | 47.766733333333335
 2021-04-10 21:30:00+08 |  49.55949999999999 | 50.440766666666654
 2021-04-10 21:35:00+08 |  48.86253333333333 |  50.57290000000001
 2021-04-10 21:40:00+08 | 51.061299999999974 | 47.028766666666684
 2021-04-10 21:45:00+08 |  52.10353333333333 | 49.861466666666665
 2021-04-10 21:50:00+08 | 51.780566666666694 |            51.4159
 2021-04-10 21:55:00+08 |  51.83549999999998 | 49.124366666666674
 2021-04-10 22:00:00+08 |              93.96 |              91.07
(13 rows)

3.2 time_bucket_gapfill

If there is data missing during the time period, in order to avoid error query results, you can use the time_bucket_gapfill function to fill in the missing data. There are two filling strategies:

  1. locf: Fill with the values ​​that appeared before in the aggregate group
  2. interpolate: Linear interpolation fills the missing values

For example:

SELECT time_bucket_gapfill('5 minutes', time) AS five_min,
    locf(AVG(read)) as locf_read,
    interpolate(AVG(read)) as interpolate_read
    FROM disk
    WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 22:00:00'::timestamp
    AND tag_id = 1
    GROUP BY five_min
    ORDER BY five_min;

3.3 first/last

first returns the earliest value of time:

ymatrix=# SELECT tag_id,
    first(read, time) AS read,
    first(write, time) AS write
    FROM disk
    GROUP BY tag_id
    ORDER BY tag_id;

 tag_id | read  | write
-------+------+-------
      1 | 11.51 | 86.61
      2 | 50.07 |  25.9
      3 | 83.72 |  10.5
(3 rows)

last returns the latest value:

ymatrix=# SELECT tag_id,
    last(read, time) AS read,
    last(write, time) AS write
    FROM disk
    GROUP BY tag_id
    ORDER BY tag_id;

 tag_id | read  | write
-------+------+-------
      1 |  5.32 |  4.96
      2 |  5.73 | 34.73
      3 | 49.03 | 86.02
(3 rows)

3.4 last_not_null_value

last_not_null_value is equivalent to adding not null filtering on the last basis, returning the last non-null value:

ymatrix=# SELECT last_not_null_value(read, time)
    FROM disk WHERE tag_id = 1;
 last_not_null_value
----------------------------------------------------------------------------------------------------------------------------------
                 3.1
(1 row)

3.5 last_not_null

Compared with last_not_null and last_not_null_value, not only does the return value, but the time will also return. The returned type is a string, with the format '["value", "time"]':

ymatrix=# SELECT last_not_null(read, time)
    FROM disk WHERE tag_id = 1;
              last_not_null
----------------------------------------------------------------------------------------------------------------------------------
 ["3.1","2021-11-05 17:32:51.754457+08"]
(1 row)