Advanced Query

This document describes the functions and expressions that YMatrix provides for querying.

In addition to basic query statement, YMatrix also provides many advanced analytical functions and SQL expressions.


1 Window Function

Window functions can handle relatively complex report statistical analysis, and are mostly used in OLAP scenarios, timing scenarios, etc.

The "Window" defines a data set that performs aggregation query between rows related to the current row. The difference between window functions and other functions is the existence of the OVER clause. If a function has an OVER clause, then it is a window function. The OVER clause is used to define the scope of the window function.

You can understand the difference between aggregation window functions and dedicated window functions through the following table:

Aggregated window function Special window function
Output One line Multiple lines
Functions max(), min(), count(), sum(), avg(), sum(), etc. In addition to the commonly used aggregation functions on the left, there are also dedicated aggregation functions: row_number(), rank(), dense_rank(), first_value(), last_value(), etc.
Usage is usually used in combination with the GROUP BY clause, or can be used in combination with the OVER clause as a window function is used in combination with the OVER clause. Generally speaking, the OVER clause is written directly after the name and parameters of the window function. The OVER clause can usually use three certificates: PARTITION BY, ORDER BY, and ROWS BETWEEN. Only using the PARTITION BY clause or using the PARTITION BY and ORDER BY clause will form a static window, and the window size and position will not change; in addition to the PARTITION BY statement, the ROWS BETWEEN clause or contains both ROWS BETWEEN and ORDER BY clause, which will form a sliding window, that is, sliding a fixed-size window through each input line to get the final result.

1.1 Syntax

Window functions appear in the table expression of SELECT, just like the aggregation function:

=# SELECT 
    <function_name(parameters)> OVER (
        PARTITION BY …
        ORDER BY …
    )
   FROM …;

Where, the OVER clause is used to define the window scope:

  • The PARTITION BY list specifications that rows are divided into partitions, each partition can be treated as a window; if ignored, all rows are divided into one partition.
  • The ORDER BY clause describes the row sorting rules within a partition.

1.2 Special window function

The dedicated window functions supported by YMatrix are as follows:

Functions Description
row_number() Assign a unique sequential integer to each row within the partition of the result set, the first row starts at 1
rank() Assign a ranking value to each row in the partition of the result set. The same value will get the same ranking, and the parallel ranking will delay the subsequent rankings. Therefore, in the presence of the same value, the ranking is not a continuous integer
density_rank() Assign a ranking value to each row in the partition of the result set. The same value will get the same ranking, and even if the ranking is tied, it will not be posted. Therefore, in the presence of the same value, the ranking is also continuous integers
first_value() Get the first value in the ordered partition of the result set
last_value() Get the end value in the ordered partition of the result set
ntile() Divide the rows of the ordered partition into n groups (buckets)
cume_dist() Calculate the proportion of the value of the current row in all rows in the current partition that is greater than or equal to the current row data (including the current row itself)
percent_rank() Calculate the proportion of the value greater than the data of the current row in all rows in the current partition except the current row
lag(,) Used to locate rows that are offset forward by a certain number of rows
lead(,) Used to locate the row that is offset backward by a certain number of rows

1.2.1 Distribution statistics related functions

grammar:

=# SELECT …,
    ntile(<n>) OVER (ORDER BY …),
    cume_dist() OVER (ORDER BY …),
    percent_rank() OVER (ORDER BY …)  
FROM …;

Notes!
The denominator and numberrator of cume_dist() and percent_rank() are different: 1. The former denominator is the total number of rows of the current partition, and the latter is the total number of rows of the current partition -1. 2. The former number is the number of rows with a value greater than/equal to the number of rows with a current row data +1. (including the current row itself), and the latter number is the number of rows with a value greater than the current row data (excluding itself).

1.2.2 Offset calculation related functions

The lag() and lead() functions are used to locate rows whose current row is offset forward/backward by a certain number of rows. They are often used in scenarios where the current data and historical data are compared and analyzed.

grammar:

=# SELECT 
          lag(<forward shift column names>,<number of rows offset forward>) OVER (ORDER BY …) AS …,
          lead(<backward shift column name>,<number of rows offset backward>) OVER (ORDER BY …) AS …
    FROM …;

1.3 Aggregate window function

YMatrix provides but is not limited to the following aggregation functions:

Functions Description
count(/*) Number of valid values ​​​​​(only count(*) contains null values)
sum() cumulative sum
avg() Average value of a certain column
min() Minimum value of a column
max() Maximum value of a certain column

The combination of aggregation functions and the OVER clause are different from the GROUP BY semantics:

  • GROUP BY Now group the input lines, then gather each group and output a result for each group;
  • The OVER clause makes each input line correspond to a window, and the aggregation function is applied to each window, so each input line produces an aggregation result.

1.4 Window Frame/Sliding Window

Window frames can be used for dedicated window functions and aggregated window functions to define the window scope or implement sliding window calculations.

After it is applied to the ORDER BY clause, the complete syntax is as follows:

=# <aggregate function名(…)> OVER (… ORDER BY … ROWS|RANGE {
          UNBOUNDED PRECEDING
        | n PRECEDING
        | CURRENT ROW
        | BETWEEN <window frame constraints> AND <window frame constraints> })

The value range of window frame constraints is shown in the table below, where n can be a number or an expression that can calculate numbers:

Window frame constraints Description
UNBOUNDED PRECEDING First line of partition
n PRECEDING n lines before the current line
CURRENT ROW Current Line
n FOLLOWING n lines after the current line
UNBOUNDED FOLLOWING The last line of the partition

Notes!
For examples of window function application, see Typical Query in OLAP Scenario.


2 Ordered-Set Aggregate Functions

Ordered aggregation functions are a type of database aggregation functions that are used to aggregate an Ordered Set in database queries. Ordered aggregation functions are usually used to process data sets with sorting requirements and return the results after aggregating these sets.

| Functions | Description | Syntax | |--|--|--|--|--| | mode() | Return the most frequent input value (if there are multiple values ​​​​​with the same frequency, select the first one) | mode() WITHIN GROUP (ORDER BY…) | | percentage_cont() | Calculate continue percentage/multi-continuous percentage:
࠾ Continuous percentage: Returns a value corresponding to the specified score position in the sort. If there is no actual input value that meets the score position condition, linear interpolation is performed between adjacent input terms and the calculated value is returned.
࠾ Multiple-continuous percentage: Returns a result array matching the parameter shape of the score array, where each non-null element is replaced with the corresponding percentage value | Continuous percentage: percentile_cont(<score>) WITHIN GROUP (ORDER BY …)
Multiple-continuous percentage: percentile_cont(<score>) WITHIN GROUP (ORDER BY …) | | percentage_disc() | Calculate discrete percentage/multiple discrete percentage:
࠾ Discrete percentage: Returns the first actual input value in the sorting position equals or exceeds the specified score. Discrete percentage does not involve interpolation calculation, so the returned result must be the actual value in the original dataset
࠾ Multiple discrete percentage: Returns an array of results matching the parameter shape of the fraction array, where each non-null element is replaced with the input value of the corresponding percentage | Discrete percentage: percentile_disc(<score>) WITHIN GROUP (ORDER BY …)
Multiple discrete percentage: percentile_disc(<score array>) WITHIN GROUP (ORDER BY …) |

To ensure the correctness and consistency of the ordered aggregation function in use, please pay attention to the following instructions:

  • If there are null values ​​​​​(NULLs) in the input dataset, these null values ​​​​​ will be ignored when performing aggregation calculations and will not affect the aggregation result. However, an empty fraction value will produce an empty result.
  • When using this type of function, you need to provide a valid score value between 0 and 1 as a parameter. If the provided score value is not within this range, the function will throw an error.
  • If the fraction value you provide is null (NULL), the function will return an empty result. This means that the function cannot calculate the aggregated result because of the lack of valid fractional information.

Notes!
For an example of the orderly aggregation function application, see Continuous Percentage.


3 Timing function

Notes!
This part of the functions must be created to create the matrixts extension before they can be used.| Functions | Description | |--|--| | time_bucket('

3.1 time_bucket() function

This function can be used in conjunction with the aggregation function to calculate the aggregation value for a given time period.

| Function parameter name | Supported data types | Supported units | |--|--|--|--|--| | Time window size | int16; int32; int64; interval | us / ms / s / minute(min, minutes) / hour / day, e.g. 1 day | | Timestamp column | int16; int32; int64; date; timestamp; timestamptz | |

Syntax example:

=# SELECT time_bucket('<time window size>', <timestamp column>) AS …,
    <aggregate function> AS …
    FROM …
    WHERE …
    GROUP BY …
    ORDER BY …;

3.2 time_bucket_gapfill() function

This function is similar to the time_bucket() function, but provides additional functionality for missing value padding. When there is data missing during the time period and data cleaning is required, you can use this function to fill in the missing data, so that the data can be distributed smoothly in the timing at the specified time interval, so as to facilitate analysis.

| Function parameter name | Supported data types | Supported units | |--|--|--|--|--| | Time window size | int16; int32; int64; interval | us / ms / s / minute(min, minutes) / hour / day, e.g. 1 day | | Timestamp column | int16; int32; int64; date; timestamp; timestamptz | | | The column to be filled | The column's original data type | The column's original unit |

There are two filling strategies: locf (Last Observation Carried Forward) and interpolate:

  • locf: Fill with the non-missible actual value that appeared in the previous one in the aggregation group. This preserves the trend of the time series and uses the most recent observations to estimate the missing values, returning the actual values ​​​​​ that existed before the fill.
  • interpolate: Linear interpolation fills the missing values. Interpolation is to calculate the missing value by inferring the trend between known data points, and returns the calculated new value.

Syntax example:

  1. No missing value filling
    =# SELECT time_bucket_gapfill('<time window size>', <timestamp column>) AS …,
     <aggregate function> AS …
     FROM …
     WHERE …
     GROUP BY …
     ORDER BY …;
  2. Use the locf policy to fill in missing values
    =# SELECT time_bucket_gapfill('<time window size>', <timestamp column>) AS …,
     locf(<aggregate function>) AS …
     FROM …
     WHERE …
     GROUP BY …
     ORDER BY …;
  3. Use the interpolate strategy to fill in missing values
    =# SELECT time_bucket_gapfill('<time window size>', <timestamp column>) AS …,
     interpolate(<aggregate function>) AS …
     FROM …
     WHERE …
     GROUP BY …
     ORDER BY …;

    Notes!
    For examples of timing function application, see Spatial-Time-Specific Query.


4 Common Table Expression

A common table expression, namely, CTE, can be used to define and build a temporary view, thereby simplifying large query statements. It is implemented through the WITH keyword. Before using it, you can understand the difference between CTE and CREATE VIEW statements through the following table.

CTE CREATE VIEW
Expression is not an independent SQL statement, but a part of the statement, that is, an expression an independent SQL statement
Application scope The created temporary view is only used for the query statement to which it belongs The created view can be used for all query statements

grammar:

=# WITH <Table expression name> (<Column name>, …) AS (
    SELECT … FROM …
) SELECT … FROM …;

Notes!
For examples of timing function application, please see [Indicator Comparative Analysis] (/doc/latest/dataquery/olap_oltp_scene#duibi).


5 Nested Query

5.1 Where the subquery appears

We provide the following syntax examples:

  • Appears in the SELECT list, for example:

    =# SELECT (SELECT …) AS … FROM …;
  • Appears in the FROM list, for example:

    =# SELECT …
     FROM … AS …, (SELECT … FROM …) AS …
     WHERE …;

    Note! Subqueries that appear in the FROM list must specify an alias.

  • Appears in the WHERE expression, for example:

    =# SELECT … 
     FROM …
     WHERE … IN (SELECT … FROM … WHERE …);

    5.2 Subquery that returns a single value

When the subquery returns a single numeric value, the subquery expression can appear in the SELECT/FROM list or in the WHERE conditional expression. It can be used in conjunction with general comparison operators to perform comparison operations on subquery results, including =, <, <=, >, !=.

Suppose we have tables t1 (including column c1), t2 (including column c2):

=# SELECT (SELECT max(c1) FROM t1), t2.*
   FROM t2
   WHERE c2 > (SELECT avg(c2) FROM t2);

As you can see, aggregation subqueries that return single values ​​​​​appear in both the SELECT list and the WHERE expression.

When executing this query, first execute SELECT avg(c2) FROM t2 to calculate the average value of the c2 column in the t2 table. The external query then filters out rows that meet the criteria based on this average and other conditions c2 > (SELECT avg(c2) FROM t2). For each row that meets the criteria, the query returns all columns of the t2 table and the result of SELECT max(c1) FROM t1.

5.3 Returns the subquery of the collection

If a subquery can return multiple tuples, it can appear as a temporary table in the FROM list and can participate in the connection JOIN. Consistent with returning single value, it can also be used with general comparison operators.

Suppose we have tables ttt1 (including columns c1, c4), ttt2 (including columns c2, c3):

=# SELECT t1.c1, t2.c2
   FROM ttt1 AS t1
   INNER JOIN (SELECT c3, c2 FROM ttt2) AS t2
   ON t1.c4 = t2.c3;

When executing this query, the subquery SELECT c3, c2 FROM ttt2 is first executed, which selects columns c3 and c2 from the table ttt2 and returns the result table t2. The external query then associates the table ttt1 and the result table t2 with the internal join and returns the rows that meet the join criteria, where the columns t1.c1 and t2.c2 are selected.

Subqueries that return a collection can also be applied to the WHERE conditional expression via the following SQL syntax:

  • ALL: satisfied all tuples in a subquery collection.
  • ANY: satisfied at least one tuple in the subquery.
  • IN = ANY, the difference is that IN only takes out the result set of the subquery and cannot be compared, while ANY can add a comparison operator before it.
  • EXISTS: The subquery returns at least one tuple.

Suppose we have table t1:

=# CREATE TABLE t1 (
   c1 int,
   c2 text
   );
=# INSERT INTO t1 VALUES (1,'content');
   INSERT INTO t1 VALUES (2,'content');
   INSERT INTO t1 VALUES (3,'content');
   INSERT INTO t1 VALUES (4,'text');
   INSERT INTO t1 VALUES (5,'text');
  1. ALL syntax example
    =# SELECT * FROM t1
    WHERE c1 >= ALL (
     SELECT c1 FROM t1
     WHERE c2 = 'content'
     ORDER by c1
    );

    Since the subquery result is the dataset (1,2,3), the above SQL is actually equivalent to:

    =# SELECT * FROM t1
    WHERE c1 >= 1
    AND c1 >= 2
    AND c1 >= 3
    ORDER by c1;
    c1 |   c2
    ----+---------
    3 | content
    4 | text
    5 | text
    (3 rows)
  2. ANY syntax example
    =# SELECT * FROM t1
    WHERE c1 >= ANY (
     SELECT c1 FROM t1
     WHERE c2 = 'content'
    );

    Since the subquery result is the dataset (1,2,3), the above SQL is actually equivalent to:

    =# SELECT * FROM t1
    WHERE c1 >= 1
    OR c1 >= 2
    OR c1 >= 3;
  3. IN syntax example
    =# SELECT * FROM t1
    WHERE c1 IN (
     SELECT c1 FROM t1
     WHERE c2 = 'content'
    );

    Since the subquery result is the dataset (1,2,3), the above SQL is actually equivalent to:

    =# SELECT * FROM t1
    WHERE c1 = 1
    OR c1 = 2
    OR c1 = 3;
  4. EXISTS syntax example
    =# SELECT * FROM t1
    WHERE EXISTS (
     SELECT * FROM t1
     WHERE c1 < 60
     AND c2 = 'content'
    );

    Since the subquery result is not empty in the data set, the above SQL is actually equivalent to:

    =# SELECT * FROM t1
    WHERE true;

    If the dataset is empty, it is equivalent to:

    =# SELECT * FROM t1
    WHERE false;

    5.4 Related subqueries

The query conditions in the subquery depend on a certain value in the outer query. For each tuple of the outer query, it must be substituted into the subquery and repeatedly evaluated, so that the subquery is processed more than once.

Specific execution process:

  1. Take out a tuple from the outer query and pass the values ​​​​of the tuple-related columns to the inner query;
  2. Execute inner-level query to get the results of the subquery operation;
  3. The outer query substitutes the result set returned by the subquery into the WHERE conditional expression to obtain the result;
  4. Take out the next tuple from the outer query and repeat steps 1-3 until all the outer tuples are processed.

Suppose we have tables t1 (including column c1), t2 (including column c2):

=# SELECT * FROM t1
   WHERE NOT EXISTS (
    SELECT * FROM t2
    WHERE t2.c2 = t1.c1
   );

For each tuple of the t1 table in the outer query, the subquery will be entered to perform the inner query, and find out whether there are any results that meet the criteria in the t2 table, and the NOT EXISTS expression determines whether to output this t1 tuple to the final result.

Notes!
For examples of nested query applications, please see Adaptive Analysis.


6 Data Cubes

In the field of business intelligence (BI), it is necessary to see through a data set from different dimensions, and through data visualization, the knowledge contained in the data is presented to decision makers to realize their business value. A data cube is a SQL feature generated according to this need, which returns a multidimensional array covering multiple dimensions, thus demonstrating the intrinsic connection of the data.

![](https://img.ymatrix.cn/ymatrix_home/Data cube (screenshot)_1696736236.png)

The Data Cube provides a multidimensional data model based on facts and dimensions:

  • "Dimension" is the perspective of people observing the objective world and is a high-level type division. "Dimension" generally contains hierarchical relationships, which are sometimes quite complex. By defining multiple important attributes of an entity as dimensions, users can compare data on different dimensions. For example, the electronic products, first quarter of 2023, Beijing and other dimension labels in the picture above.
  • "Facts" are actual detailed data and objective activity records.

6.1 GROUPING SETS clause

That is, group sets. Since the grouping aggregation calculation occurs after the FROM and WHERE expressions are calculated, it will calculate the aggregate separately for each specified group, and then return the result together.

Suppose we have table t1 (including columns c1, c2, c3):

=# SELECT c1, c2, sum(c3) AS sum 
   FROM t1
   GROUP BY GROUPING SETS (c1, c2)
   ORDER BY (c1,c2);
 c1 | c2 | sum
----+----+----
 a  |    | 6
 b  |    | 15
 c  |    | 24
    | aa | 12
    | bb | 15
    | cc | 18
(6 rows)

It can be seen that the sum-aggregation calculation results of the c1 dimension and c2 dimension are displayed respectively in the return:

  • c1 Dimension: the sum of c3 values ​​​​of a, the sum of b, and the sum of c;
  • c2 Dimension: the sum of c3 values ​​​​of aa, the sum of bb, and the sum of cc;

In fact, executing GROUPING SETS once is equivalent to executing UNION ALL multiple times, the former is simpler and more standard.

6.2 ROLLUP clause

In addition to GROUPING SETS, YMatrix also provides other simplified methods to specify grouping sets of common types.

The aggregation results of groups according to the ROLLUP command will be rolled up layer by layer in the order of grouping keys.

In fact, ROLLUP ( c1, c2, … ) is equivalent to:

=# GROUPING SETS (
    ( c1, c2, … ),
    …
    ( c1 ),
    ( )
   )

The last empty grouping key is equivalent to performing aggregation calculations for all return values.

Suppose we have table t1 (including columns c1, c2, c3):

=# SELECT c1, c2, sum(c3) AS sum 
   FROM t1
   WHERE c1 = 'a' OR c1 = 'b'
   GROUP BY ROLLUP (c1, c2)
   ORDER BY (c1,c2);
 c1 | c2 | sum
----+----+----
 a  |  aa  | 1
 a  |  bb  | 2
 a  |  cc  | 3
 a  |      | 6
 b  |  aa  | 4
 b  |  bb  | 5
 b  |  cc  | 6
 b  |      | 15
    |      | 21
(9 rows)

You can see:

  • Results lines 1-3, 4-6 are the accumulated sum of the results of the grouping key (c1,c2) (in this example, the c3 value itself);
  • Results Lines 4 and 8 are the accumulated sums of the results of the grouping key (c1) (i.e. the total value of lines 1-3, 5-7 respectively);
  • The last line, line 9, is the accumulation of all return values.

The ROLLUP clause is often used to analyze historical data, such as total salary calculated by department, district and company, and the results can be visualized as multi-ring pie charts, etc.

6.3 CUBE

According to the CUBE command, the aggregation results are grouped according to the CUBE command, and the union and deduplication are respectively used to upload the volume and use each grouping key.

CUBE ( c1, c2 ) is equivalent to:

=# GROUPING SETS (
    ( c1, c2 ),
    ( c1     ),
    (     c2 ),
    (        )
   )

The last empty grouping key is equivalent to performing aggregation calculations for all return values.

Suppose we have table t1 (including columns c1, c2, c3):

=# SELECT c1, c2, sum(c3) AS sum 
   FROM t1
   WHERE c1 = 'a' OR c1 = 'b'
   GROUP BY CUBE (c1, c2)
   ORDER BY (c1,c2);
 c1 | c2 | sum
----+----+-----
 a  | aa |   1
 a  | bb |   2
 a  | cc |   3
 a  |    |   6
 b  | aa |   4
 b  | bb |   5
 b  | cc |   6
 b  |    |  15
    | aa |   5
    | bb |   7
    | cc |   9
    |    |  21
(12 rows)

As you can see, except for the same part as ROLLUP (lines 1-8 and last line):

  • Line 9-11 is the accumulated sum of the results of the grouping key (c2) (e.g. aa = 1+4, bb = 2+5).

The calculation results of the CUBE clause can be rendered with rich visual and interactive effects through BI tools, such as [Sankey Chart](https://baike.baidu.com/item/Sankey Chart/10773057?fr=ge_ala).

Notes!
For examples of data cube application, please see [Multidimensional Data Query] (/doc/latest/dataquery/olap_oltp_scene_scene#1.10%20 multidimensional data analysis).