Advanced Query

In addition to basic connection, aggregation, and grouping operations, YMatrix also provides many advanced analytical functions, such as window functions, CTE (Comman Table Expression), Ordered-Set Aggregate Functions, commonly used timing functions, etc. This section will introduce you to the advanced query methods commonly used in YMatrix.

The following is a case where statistics disk usage is used to demonstrate how to perform advanced queries in YMatrix. In the sample metric table disk, we designed the read and write speed fields of the disk, using the MARS2 storage engine.
MARS2 tables depend on matrixts extensions. Before building a table, you first need to create an extension in the database using the storage engine. This extension is at the database level and does not require repeated creation.

CREATE EXTENSION matrixts;

Create a metric table disk.

CREATE TABLE disk (
    time timestamp with time zone,
    tag_id int,
    read float,
    write float
)
USING mars2
DISTRIBUTED BY (tag_id);

After successfully creating the MARS2 table, you must create an additional index of type mars2_btree so that you can read and write normally.

CREATE INDEX idx_mars2 ON disk 
USING mars2_btree(tag_id);

For more information about MARS2, please see Storage Engine

1 Window Function

Window functions are mainly used to deal with relatively complex report statistical analysis scenarios. The "Window" defines a data set that performs aggregate queries between rows related to the current row. The difference between window functions and other SQL functions is the existence of OVER clauses. If a function has an OVER clause, it is a window function. You can understand the difference between window functions and ordinary aggregate functions through the following table.

	Normal Aggregation Function	Window Function
Output	One record	Multiple records
Functions	max(), min(), count(), sum(), etc.	avg(), sum(), rank(), etc.
Usage	is usually used in combination with the GROUP BY clause	is used in combination with the OVER clause. Generally speaking, the OVER clause is written directly after the name and parameters of the window function. In the OVER clause, you can usually use three clauses: PARTITION BY, ORDER BY, and ROWS BETWEEN. Only using the PARTITION BY clause will form a static window, and the window size and position will not change; in addition to the PARTITION BY statement, one or two of the ORDER BY and ROWS BETWEEN statements are also used, which will form a sliding window, that is, the window size and position will constantly change.

You can practice commonly used window functions based on the following cases.

1.1 Cumulative and

By using the SUM method nested, the cumulative sum can be calculated.

The following SQL calculates the disk read and write accumulation sum of tag_id 1 between 2021-04-10 21:00:00 to 2021-04-10 21:00:10::

ymatrix=# SELECT time,
    sum(sum(read)) OVER (ORDER BY time) AS read,
    sum(sum(write)) OVER (ORDER BY time) AS write
    FROM disk
    WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 21:00:10'::timestamp
    AND tag_id = 1 
    GROUP BY time
    ORDER BY time;

          time          |  read  |       write
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 2021-04-10 21:00:00+08 |  81.07 |               73.3
 2021-04-10 21:00:01+08 | 110.63 |             121.77
 2021-04-10 21:00:02+08 | 202.12 |             201.36
 2021-04-10 21:00:03+08 | 263.74 |             257.88
 2021-04-10 21:00:04+08 |  361.6 |              299.3
 2021-04-10 21:00:05+08 | 394.49 | 327.33000000000004
 2021-04-10 21:00:06+08 |  438.3 |             334.98
 2021-04-10 21:00:07+08 | 523.35 |             431.39
 2021-04-10 21:00:08+08 | 583.15 |             461.84
 2021-04-10 21:00:09+08 | 609.01 |             533.03
 2021-04-10 21:00:10+08 | 669.52 |              535.9
(11 rows)

1.2 Moving Average

The moving average is used to calculate the average value of the record with the previous n bars.

The following SQL calculates the average read and write every 10 seconds from 2021-4-10 21:00:00 to 21:01:00 (for the first 9 data, the average value of the number of rows that meet the condition is calculated):

ymatrix=# SELECT time,
    round(AVG(read) OVER(ORDER BY time ROWS BETWEEN 9 PRECEDING AND CURRENT ROW)) AS read,
    round(AVG(write) OVER(ORDER BY time ROWS BETWEEN 9 PRECEDING AND CURRENT ROW)) AS write
    FROM disk
    WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 21:01:00'::timestamp
    AND tag_id = 1
    ORDER BY time DESC;

          time          | read | write
------------------------+-------------------------------------
 2021-04-10 21:01:00+08 |   57 |    57
 2021-04-10 21:00:59+08 |   49 |    60
 2021-04-10 21:00:58+08 |   52 |    56
 2021-04-10 21:00:57+08 |   51 |    57
 2021-04-10 21:00:56+08 |   53 |    65
 2021-04-10 21:00:55+08 |   48 |    64
 2021-04-10 21:00:54+08 |   49 |    64
 2021-04-10 21:00:53+08 |   47 |    54
 2021-04-10 21:00:52+08 |   44 |    54
 2021-04-10 21:00:51+08 |   41 |    56
......

1.3 Increment

Increments are usually used to calculate the increase or decrease of a monotonic sequence, and can also be simply used to calculate the change from the previous data.

The following statement calculates the change value read by the disk with tag_id of 1. From 2021-4-10 21:00:00 to 21:01:00, the positive number increases compared to the previous second, and the negative number decreases compared to the previous second:

ymatrix=# SELECT
    time,
    (
     CASE WHEN lag(read) OVER (ORDER BY time) IS NULL THEN NULL
     ELSE round(read - lag(read) OVER (ORDER BY time))
     END
    ) AS read
    FROM disk
    WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 21:01:00'::timestamp
    AND tag_id = 1
    ORDER BY time;

          time          | read
---------------------------------------------------
 2021-04-10 21:00:00+08 |
 2021-04-10 21:00:01+08 |  -52
 2021-04-10 21:00:02+08 |   62
 2021-04-10 21:00:03+08 |  -30
 2021-04-10 21:00:04+08 |   36
 2021-04-10 21:00:05+08 |  -65
 2021-04-10 21:00:06+08 |   11
 2021-04-10 21:00:07+08 |   41
 2021-04-10 21:00:08+08 |  -25
 2021-04-10 21:00:09+08 |  -34
......

1.4 Growth

Based on the increment, divide it by the time interval to get the growth rate (because the sample data is sampled once in 1 second, the results seen are the same as the increment):

ymatrix=# SELECT
    time,
    (
     CASE WHEN lag(read) OVER (ORDER BY time) IS NULL THEN NULL
     ELSE round(read - lag(read) OVER (ORDER BY time))
     END
    ) / extract(epoch from time - lag(time) OVER (ORDER BY time)) AS read_rate,
    extract(epoch from time - lag(time) OVER (ORDER BY time)) AS "time lag"
    FROM disk
    WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 21:01:00'::timestamp
    AND tag_id = 1
    ORDER BY time;

          time          | read_rate | time lag
----------------------------------------------------------------------------------------------------------------------------------
 2021-04-10 21:00:00+08 |           |
 2021-04-10 21:00:01+08 |       -52 |        1
 2021-04-10 21:00:02+08 |        62 |        1
 2021-04-10 21:00:03+08 |       -30 |        1
 2021-04-10 21:00:04+08 |        36 |        1
 2021-04-10 21:00:05+08 |       -65 |        1
 2021-04-10 21:00:06+08 |        11 |        1
 2021-04-10 21:00:07+08 |        41 |        1
 2021-04-10 21:00:08+08 |       -25 |        1
 2021-04-10 21:00:09+08 |       -34 |        1
......

1.5 Change points

Change points list records that have changed compared to the previous one. This type of query is suitable for finding changes in relatively stable data sets:

ymatrix=# SELECT time, read FROM (
        SELECT time,
            read,
            read - lag(read) OVER (ORDER BY TIME) AS diff
        FROM disk
        WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 21:01:00'::timestamp
        AND tag_id = 1 ) ht
    WHERE diff IS NULL OR diff != 0
    ORDER BY time;

          time          | read
-----------------------------------------------
 2021-04-10 21:00:00+08 | 81.07
 2021-04-10 21:00:01+08 | 29.56
 2021-04-10 21:00:02+08 | 91.49
 2021-04-10 21:00:03+08 | 61.62
 2021-04-10 21:00:04+08 | 97.86
 2021-04-10 21:00:05+08 | 32.89
 2021-04-10 21:00:06+08 | 43.81
 2021-04-10 21:00:07+08 | 85.05
 2021-04-10 21:00:08+08 |  59.8
 2021-04-10 21:00:09+08 | 25.86
(10 rows)

2 Ordered-Set Aggregate Functions

2.1 Calculate continuous percentage

You can use the following ordered aggregation function to calculate the continuous percentage.

percentile_cont(fractions) WITHIN GROUP (ORDER BY sort_expression)

The examples and return results are as follows.

ymatrix=# SELECT tag_id,
    percentile_cont(0.2) WITHIN GROUP
     (ORDER BY read) AS read,
    percentile_cont(0.3) WITHIN GROUP (ORDER BY write) AS write
    FROM disk
    GROUP BY tag_id
    ORDER BY tag_id;

 tag_id | read  | write
-------+------+-------
      1 | 19.87 | 29.86
      2 | 19.95 | 29.88
      3 | 20.06 | 29.93
(3 rows)

The parameter represents the percentage, if 0.5 is passed, it is equivalent to calculating the median.

3. Comman Table Expression

A common table expression, namely, CTE, can be used to define and build a temporary view, thereby simplifying large query statements. It is implemented through the WITH keyword. Before using it, you can understand the difference between CTE and CREATE VIEW statements through the following table.

	CTE	CREATE VIEW
Expression	is not an independent SQL statement, but a part of the statement, that is, an expression	independent SQL statement
Application scope	The created temporary view is only used for the query statement to which it belongs	The created view can be used for all query statements

The following CTE calculates the maximum and minimum values of the average read speed in all devices:

WITH avg_read (tag_id, avg_read) AS (
    SELECT tag_id, AVG(read) AS read FROM disk GROUP BY tag_id
) SELECT MAX(avg_read), MIN(avg_read) FROM avg_read;

4. Timing function

The timing component matrixts provided by YMatrix also provides timing functions that are often used in timing scenarios. First, create this component:

stats=# CREATE EXTENSION matrixts;

4.1 time_bucket

time_bucket calculates the average value for a given time period.

Parameter name	Description	Data type	Default
period	Time window size	int16;int32;int64;interval	No
timestamp	Columns that need to be converted	int16;int32;int64;date;timestamp;timestamptz	No

The following SQL calculates the average read and write speed for every 5 minutes between 2021-4-10 21:00:00 and 22:00:00:00::

ymatrix=# SELECT time_bucket('5 minutes', time) AS five_min,
    AVG(read) as read,
    AVG(write) as write
    FROM disk
    WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 22:00:00'::timestamp
    AND tag_id = 1
    GROUP BY five_min
    ORDER BY five_min;

        five_min        |        read         |        write
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 2021-04-10 21:00:00+08 | 48.614599999999996 |  49.48656666666666
 2021-04-10 21:05:00+08 |  50.73533333333335 | 49.992566666666654
 2021-04-10 21:10:00+08 |   51.6102333333333 |  49.99359999999999
 2021-04-10 21:15:00+08 |  49.29116666666669 |  53.89146666666666
 2021-04-10 21:20:00+08 |  49.67863333333332 |  50.47406666666665
 2021-04-10 21:25:00+08 |  51.09013333333332 | 47.766733333333335
 2021-04-10 21:30:00+08 |  49.55949999999999 | 50.440766666666654
 2021-04-10 21:35:00+08 |  48.86253333333333 |  50.57290000000001
 2021-04-10 21:40:00+08 | 51.061299999999974 | 47.028766666666684
 2021-04-10 21:45:00+08 |  52.10353333333333 | 49.861466666666665
 2021-04-10 21:50:00+08 | 51.780566666666694 |            51.4159
 2021-04-10 21:55:00+08 |  51.83549999999998 | 49.124366666666674
 2021-04-10 22:00:00+08 |              93.96 |              91.07
(13 rows)

4.2 time_bucket_gapfill

When there is data missing during the time period and data cleaning is required, you can use the time_bucket_gapfill function to fill the missing data, so that the data can be distributed smoothly in the timing at the specified time interval, so as to facilitate analysis. There are two filling strategies: locf (last observation carried forward) and interpolate.

locf: Fill with the values that appeared before in the aggregate group
interpolate: Linear interpolation fills the missing values

Assume that the data in the example disk table is as follows (only applicable to Section 4.2):

ymatrix=# SELECT * FROM disk ORDER BY tag_id;
          time          | tag_id | read | write
------------------------+---------------------------------------------------------------------------------------------------------
 2021-04-10 21:00:00+08 |      1 |  3.4 |   4.6
 2021-04-10 21:50:00+08 |      1 |    4 |   2.7
 2021-04-10 21:40:00+08 |      1 |  8.4 |    12
 2021-04-10 21:20:00+08 |      1 |  2.9 |     6
 2021-04-10 21:30:00+08 |      1 |    9 |  10.2
 2021-04-10 21:10:00+08 |      1 |  5.2 |   6.6
 2021-04-10 22:00:00+08 |      1 |   10 |     7
(7 rows)

Use the time_bucket_gapfill function to fill the data every 5 minutes between 21:00:00 and 22:00:00 to get the result.

SELECT time_bucket_gapfill('5 minutes', time) AS five_min,
    locf(AVG(read)) as locf_read,
    interpolate(AVG(read)) as interpolate_read
    FROM disk
    WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 22:00:00'::timestamp
    AND tag_id = 1
    GROUP BY five_min
    ORDER BY five_min;

        five_min        | locf_read | interpolate_read
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 2021-04-10 21:00:00+08 |       3.4 |              3.4
 2021-04-10 21:05:00+08 |       3.4 |              4.3
 2021-04-10 21:10:00+08 |       5.2 |              5.2
 2021-04-10 21:15:00+08 |       5.2 |             4.05
 2021-04-10 21:20:00+08 |       2.9 |              2.9
 2021-04-10 21:25:00+08 |       2.9 |             5.95
 2021-04-10 21:30:00+08 |         9 |                9
 2021-04-10 21:35:00+08 |         9 |              8.7
 2021-04-10 21:40:00+08 |       8.4 |              8.4
 2021-04-10 21:45:00+08 |       8.4 |              6.2
 2021-04-10 21:50:00+08 |         4 |                4
 2021-04-10 21:55:00+08 |         4 |                7
 2021-04-10 22:00:00+08 |        10 |               10
(13 rows)

4.3 first/last

first Returns the earliest value of time:

ymatrix=# SELECT tag_id,
    first(read, time) AS read,
    first(write, time) AS write
    FROM disk
    GROUP BY tag_id
    ORDER BY tag_id;

 tag_id | read  | write
-------+------+-------
      1 | 11.51 | 86.61
      2 | 50.07 |  25.9
      3 | 83.72 |  10.5
(3 rows)

last Returns the latest value:

ymatrix=# SELECT tag_id,
    last(read, time) AS read,
    last(write, time) AS write
    FROM disk
    GROUP BY tag_id
    ORDER BY tag_id;

 tag_id | read  | write
-------+------+-------
      1 |  5.32 |  4.96
      2 |  5.73 | 34.73
      3 | 49.03 | 86.02
(3 rows)

4.4 last_not_null_value

last_not_null_value is equivalent to adding not null filtering on the last basis, returning the last non-null value:

ymatrix=# SELECT last_not_null_value(read, time)
    FROM disk WHERE tag_id = 1;
 last_not_null_value
----------------------------------------------------------------------------------------------------------------------------------
                 3.1
(1 row)

4.5 last_not_null

Compared with last_not_null and last_not_null_value, not only does the return value, but the time will also return. The returned type is a string, with the format '["value", "time"]':

ymatrix=# SELECT last_not_null(read, time)
    FROM disk WHERE tag_id = 1;
              last_not_null
----------------------------------------------------------------------------------------------------------------------------------
 ["3.1","2021-11-05 17:32:51.754457+08"]
(1 row)

简体中文