This document describes the functions and expressions that YMatrix provides for querying.
In addition to basic query statement, YMatrix also provides many advanced analytical functions and SQL expressions.
Window functions can handle relatively complex report statistical analysis, and are mostly used in OLAP scenarios, timing scenarios, etc.
The "Window" defines a data set that performs aggregation query between rows related to the current row. The difference between window functions and other functions is the existence of the OVER
clause. If a function has an OVER clause, then it is a window function. The OVER clause is used to define the scope of the window function.
You can understand the difference between aggregation window functions and dedicated window functions through the following table:
Aggregated window function | Special window function | |
---|---|---|
Output | One line | Multiple lines |
Functions | max() , min() , count() , sum() , avg() , sum() , etc. |
In addition to the commonly used aggregation functions on the left, there are also dedicated aggregation functions: row_number() , rank() , dense_rank() , first_value() , last_value() , etc. |
Usage | is usually used in combination with the GROUP BY clause, or can be used in combination with the OVER clause as a window function |
is used in combination with the OVER clause. Generally speaking, the OVER clause is written directly after the name and parameters of the window function. The OVER clause can usually use three certificates: PARTITION BY , ORDER BY , and ROWS BETWEEN . Only using the PARTITION BY clause or using the PARTITION BY and ORDER BY clause will form a static window, and the window size and position will not change; in addition to the PARTITION BY statement, the ROWS BETWEEN clause or contains both ROWS BETWEEN and ORDER BY clause, which will form a sliding window, that is, sliding a fixed-size window through each input line to get the final result. |
Window functions appear in the table expression of SELECT
, just like the aggregation function:
=# SELECT
<function_name(parameters)> OVER (
PARTITION BY …
ORDER BY …
)
FROM …;
Where, the OVER
clause is used to define the window scope:
PARTITION BY
list specifications that rows are divided into partitions, each partition can be treated as a window; if ignored, all rows are divided into one partition.ORDER BY
clause describes the row sorting rules within a partition.The dedicated window functions supported by YMatrix are as follows:
Functions | Description |
---|---|
row_number() | Assign a unique sequential integer to each row within the partition of the result set, the first row starts at 1 |
rank() | Assign a ranking value to each row in the partition of the result set. The same value will get the same ranking, and the parallel ranking will delay the subsequent rankings. Therefore, in the presence of the same value, the ranking is not a continuous integer |
density_rank() | Assign a ranking value to each row in the partition of the result set. The same value will get the same ranking, and even if the ranking is tied, it will not be posted. Therefore, in the presence of the same value, the ranking is also continuous integers |
first_value( |
Get the first value in the ordered partition of the result set |
last_value( |
Get the end value in the ordered partition of the result set |
ntile( |
Divide the rows of the ordered partition into n groups (buckets) |
cume_dist() | Calculate the proportion of the value of the current row in all rows in the current partition that is greater than or equal to the current row data (including the current row itself) |
percent_rank() | Calculate the proportion of the value greater than the data of the current row in all rows in the current partition except the current row |
lag( |
Used to locate rows that are offset forward by a certain number of rows |
lead( |
Used to locate the row that is offset backward by a certain number of rows |
grammar:
=# SELECT …,
ntile(<n>) OVER (ORDER BY …),
cume_dist() OVER (ORDER BY …),
percent_rank() OVER (ORDER BY …)
FROM …;
Notes!
The denominator and numberrator of cume_dist() and percent_rank() are different: 1. The former denominator is the total number of rows of the current partition, and the latter is the total number of rows of the current partition -1. 2. The former number is the number of rows with a value greater than/equal to the number of rows with a current row data +1. (including the current row itself), and the latter number is the number of rows with a value greater than the current row data (excluding itself).
The lag()
and lead()
functions are used to locate rows whose current row is offset forward/backward by a certain number of rows. They are often used in scenarios where the current data and historical data are compared and analyzed.
grammar:
=# SELECT
lag(<forward shift column names>,<number of rows offset forward>) OVER (ORDER BY …) AS …,
lead(<backward shift column name>,<number of rows offset backward>) OVER (ORDER BY …) AS …
FROM …;
YMatrix provides but is not limited to the following aggregation functions:
Functions | Description |
---|---|
count( |
Number of valid values (only count(*) contains null values) |
sum( |
cumulative sum |
avg( |
Average value of a certain column |
min( |
Minimum value of a column |
max( |
Maximum value of a certain column |
The combination of aggregation functions and the OVER
clause are different from the GROUP BY
semantics:
GROUP BY
Now group the input lines, then gather each group and output a result for each group;OVER
clause makes each input line correspond to a window, and the aggregation function is applied to each window, so each input line produces an aggregation result.Window frames can be used for dedicated window functions and aggregated window functions to define the window scope or implement sliding window calculations.
After it is applied to the ORDER BY
clause, the complete syntax is as follows:
=# <aggregate function名(…)> OVER (… ORDER BY … ROWS|RANGE {
UNBOUNDED PRECEDING
| n PRECEDING
| CURRENT ROW
| BETWEEN <window frame constraints> AND <window frame constraints> })
The value range of window frame constraints is shown in the table below, where n
can be a number or an expression that can calculate numbers:
Window frame constraints | Description |
---|---|
UNBOUNDED PRECEDING | First line of partition |
n PRECEDING | n lines before the current line |
CURRENT ROW | Current Line |
n FOLLOWING | n lines after the current line |
UNBOUNDED FOLLOWING | The last line of the partition |
Notes!
For examples of window function application, see Typical Query in OLAP Scenario.
Ordered aggregation functions are a type of database aggregation functions that are used to aggregate an Ordered Set in database queries. Ordered aggregation functions are usually used to process data sets with sorting requirements and return the results after aggregating these sets.
| Functions | Description | Syntax |
|--|--|--|--|--|
| mode() | Return the most frequent input value (if there are multiple values with the same frequency, select the first one) | mode() WITHIN GROUP (ORDER BY…)
|
| percentage_cont(
࠾ Continuous percentage: Returns a value corresponding to the specified score position in the sort. If there is no actual input value that meets the score position condition, linear interpolation is performed between adjacent input terms and the calculated value is returned.
࠾ Multiple-continuous percentage: Returns a result array matching the parameter shape of the score array, where each non-null element is replaced with the corresponding percentage value | Continuous percentage: percentile_cont(<score>) WITHIN GROUP (ORDER BY …)
Multiple-continuous percentage: percentile_cont(<score>) WITHIN GROUP (ORDER BY …)
|
| percentage_disc(
࠾ Discrete percentage: Returns the first actual input value in the sorting position equals or exceeds the specified score. Discrete percentage does not involve interpolation calculation, so the returned result must be the actual value in the original dataset
࠾ Multiple discrete percentage: Returns an array of results matching the parameter shape of the fraction array, where each non-null element is replaced with the input value of the corresponding percentage | Discrete percentage: percentile_disc(<score>) WITHIN GROUP (ORDER BY …)
Multiple discrete percentage: percentile_disc(<score array>) WITHIN GROUP (ORDER BY …)
|
To ensure the correctness and consistency of the ordered aggregation function in use, please pay attention to the following instructions:
0
and 1
as a parameter. If the provided score value is not within this range, the function will throw an error.Notes!
For an example of the orderly aggregation function application, see Continuous Percentage.
Notes!
This part of the functions must be created to create thematrixts
extension before they can be used.| Functions | Description | |--|--| | time_bucket('
This function can be used in conjunction with the aggregation function to calculate the aggregation value for a given time period.
| Function parameter name | Supported data types | Supported units |
|--|--|--|--|--|
| Time window size | int16
; int32
; int64
; interval | us
/ ms
/ s
/ minute
(min
, minutes
) / hour
/ day
, e.g. 1 day
|
| Timestamp column | int16
; int32
; int64
; date
; timestamp
; timestamptz
| |
Syntax example:
=# SELECT time_bucket('<time window size>', <timestamp column>) AS …,
<aggregate function> AS …
FROM …
WHERE …
GROUP BY …
ORDER BY …;
This function is similar to the time_bucket()
function, but provides additional functionality for missing value padding. When there is data missing during the time period and data cleaning is required, you can use this function to fill in the missing data, so that the data can be distributed smoothly in the timing at the specified time interval, so as to facilitate analysis.
| Function parameter name | Supported data types | Supported units |
|--|--|--|--|--|
| Time window size | int16
; int32
; int64
; interval | us
/ ms
/ s
/ minute
(min
, minutes
) / hour
/ day
, e.g. 1 day
|
| Timestamp column | int16
; int32
; int64
; date
; timestamp
; timestamptz
| |
| The column to be filled | The column's original data type | The column's original unit |
There are two filling strategies: locf
(Last Observation Carried Forward) and interpolate
:
locf
: Fill with the non-missible actual value that appeared in the previous one in the aggregation group. This preserves the trend of the time series and uses the most recent observations to estimate the missing values, returning the actual values that existed before the fill.interpolate
: Linear interpolation fills the missing values. Interpolation is to calculate the missing value by inferring the trend between known data points, and returns the calculated new value.Syntax example:
=# SELECT time_bucket_gapfill('<time window size>', <timestamp column>) AS …,
<aggregate function> AS …
FROM …
WHERE …
GROUP BY …
ORDER BY …;
locf
policy to fill in missing values=# SELECT time_bucket_gapfill('<time window size>', <timestamp column>) AS …,
locf(<aggregate function>) AS …
FROM …
WHERE …
GROUP BY …
ORDER BY …;
interpolate
strategy to fill in missing values=# SELECT time_bucket_gapfill('<time window size>', <timestamp column>) AS …,
interpolate(<aggregate function>) AS …
FROM …
WHERE …
GROUP BY …
ORDER BY …;
Notes!
For examples of timing function application, see Spatial-Time-Specific Query.
A common table expression, namely, CTE, can be used to define and build a temporary view, thereby simplifying large query statements. It is implemented through the WITH
keyword. Before using it, you can understand the difference between CTE and CREATE VIEW
statements through the following table.
CTE | CREATE VIEW | |
---|---|---|
Expression | is not an independent SQL statement, but a part of the statement, that is, an expression | an independent SQL statement |
Application scope | The created temporary view is only used for the query statement to which it belongs | The created view can be used for all query statements |
grammar:
=# WITH <Table expression name> (<Column name>, …) AS (
SELECT … FROM …
) SELECT … FROM …;
Notes!
For examples of timing function application, please see [Indicator Comparative Analysis] (/doc/latest/dataquery/olap_oltp_scene#duibi).
We provide the following syntax examples:
Appears in the SELECT
list, for example:
=# SELECT (SELECT …) AS … FROM …;
Appears in the FROM
list, for example:
=# SELECT …
FROM … AS …, (SELECT … FROM …) AS …
WHERE …;
Note! Subqueries that appear in the
FROM
list must specify an alias.
Appears in the WHERE
expression, for example:
=# SELECT …
FROM …
WHERE … IN (SELECT … FROM … WHERE …);
When the subquery returns a single numeric value, the subquery expression can appear in the SELECT/FROM
list or in the WHERE
conditional expression. It can be used in conjunction with general comparison operators to perform comparison operations on subquery results, including =
, <
, <=
, >
, !=
.
Suppose we have tables t1
(including column c1
), t2
(including column c2
):
=# SELECT (SELECT max(c1) FROM t1), t2.*
FROM t2
WHERE c2 > (SELECT avg(c2) FROM t2);
As you can see, aggregation subqueries that return single values appear in both the SELECT
list and the WHERE
expression.
When executing this query, first execute SELECT avg(c2) FROM t2
to calculate the average value of the c2
column in the t2
table. The external query then filters out rows that meet the criteria based on this average and other conditions c2 > (SELECT avg(c2) FROM t2)
. For each row that meets the criteria, the query returns all columns of the t2
table and the result of SELECT max(c1) FROM t1
.
If a subquery can return multiple tuples, it can appear as a temporary table in the FROM
list and can participate in the connection JOIN
. Consistent with returning single value, it can also be used with general comparison operators.
Suppose we have tables ttt1
(including columns c1
, c4
), ttt2
(including columns c2
, c3
):
=# SELECT t1.c1, t2.c2
FROM ttt1 AS t1
INNER JOIN (SELECT c3, c2 FROM ttt2) AS t2
ON t1.c4 = t2.c3;
When executing this query, the subquery SELECT c3, c2 FROM ttt2
is first executed, which selects columns c3
and c2
from the table ttt2
and returns the result table t2
. The external query then associates the table ttt1
and the result table t2
with the internal join and returns the rows that meet the join criteria, where the columns t1.c1
and t2.c2
are selected.
Subqueries that return a collection can also be applied to the WHERE
conditional expression via the following SQL syntax:
Suppose we have table t1
:
=# CREATE TABLE t1 (
c1 int,
c2 text
);
=# INSERT INTO t1 VALUES (1,'content');
INSERT INTO t1 VALUES (2,'content');
INSERT INTO t1 VALUES (3,'content');
INSERT INTO t1 VALUES (4,'text');
INSERT INTO t1 VALUES (5,'text');
ALL
syntax example=# SELECT * FROM t1
WHERE c1 >= ALL (
SELECT c1 FROM t1
WHERE c2 = 'content'
ORDER by c1
);
Since the subquery result is the dataset (1,2,3)
, the above SQL is actually equivalent to:
=# SELECT * FROM t1
WHERE c1 >= 1
AND c1 >= 2
AND c1 >= 3
ORDER by c1;
c1 | c2
----+---------
3 | content
4 | text
5 | text
(3 rows)
ANY
syntax example=# SELECT * FROM t1
WHERE c1 >= ANY (
SELECT c1 FROM t1
WHERE c2 = 'content'
);
Since the subquery result is the dataset (1,2,3)
, the above SQL is actually equivalent to:
=# SELECT * FROM t1
WHERE c1 >= 1
OR c1 >= 2
OR c1 >= 3;
IN
syntax example=# SELECT * FROM t1
WHERE c1 IN (
SELECT c1 FROM t1
WHERE c2 = 'content'
);
Since the subquery result is the dataset (1,2,3)
, the above SQL is actually equivalent to:
=# SELECT * FROM t1
WHERE c1 = 1
OR c1 = 2
OR c1 = 3;
EXISTS
syntax example=# SELECT * FROM t1
WHERE EXISTS (
SELECT * FROM t1
WHERE c1 < 60
AND c2 = 'content'
);
Since the subquery result is not empty in the data set, the above SQL is actually equivalent to:
=# SELECT * FROM t1
WHERE true;
If the dataset is empty, it is equivalent to:
=# SELECT * FROM t1
WHERE false;
The query conditions in the subquery depend on a certain value in the outer query. For each tuple of the outer query, it must be substituted into the subquery and repeatedly evaluated, so that the subquery is processed more than once.
Specific execution process:
WHERE
conditional expression to obtain the result;Suppose we have tables t1
(including column c1
), t2
(including column c2
):
=# SELECT * FROM t1
WHERE NOT EXISTS (
SELECT * FROM t2
WHERE t2.c2 = t1.c1
);
For each tuple of the t1
table in the outer query, the subquery will be entered to perform the inner query, and find out whether there are any results that meet the criteria in the t2
table, and the NOT EXISTS
expression determines whether to output this t1
tuple to the final result.
Notes!
For examples of nested query applications, please see Adaptive Analysis.
In the field of business intelligence (BI), it is necessary to see through a data set from different dimensions, and through data visualization, the knowledge contained in the data is presented to decision makers to realize their business value. A data cube is a SQL feature generated according to this need, which returns a multidimensional array covering multiple dimensions, thus demonstrating the intrinsic connection of the data.
_1696736236.png)
The Data Cube provides a multidimensional data model based on facts and dimensions:
That is, group sets. Since the grouping aggregation calculation occurs after the FROM
and WHERE
expressions are calculated, it will calculate the aggregate separately for each specified group, and then return the result together.
Suppose we have table t1
(including columns c1
, c2
, c3
):
=# SELECT c1, c2, sum(c3) AS sum
FROM t1
GROUP BY GROUPING SETS (c1, c2)
ORDER BY (c1,c2);
c1 | c2 | sum
----+----+----
a | | 6
b | | 15
c | | 24
| aa | 12
| bb | 15
| cc | 18
(6 rows)
It can be seen that the sum-aggregation calculation results of the c1
dimension and c2
dimension are displayed respectively in the return:
c1
Dimension: the sum of c3
values of a
, the sum of b
, and the sum of c
;c2
Dimension: the sum of c3
values of aa
, the sum of bb
, and the sum of cc
;In fact, executing GROUPING SETS
once is equivalent to executing UNION ALL
multiple times, the former is simpler and more standard.
In addition to GROUPING SETS
, YMatrix also provides other simplified methods to specify grouping sets of common types.
The aggregation results of groups according to the ROLLUP
command will be rolled up layer by layer in the order of grouping keys.
In fact, ROLLUP ( c1, c2, … )
is equivalent to:
=# GROUPING SETS (
( c1, c2, … ),
…
( c1 ),
( )
)
The last empty grouping key is equivalent to performing aggregation calculations for all return values.
Suppose we have table t1
(including columns c1
, c2
, c3
):
=# SELECT c1, c2, sum(c3) AS sum
FROM t1
WHERE c1 = 'a' OR c1 = 'b'
GROUP BY ROLLUP (c1, c2)
ORDER BY (c1,c2);
c1 | c2 | sum
----+----+----
a | aa | 1
a | bb | 2
a | cc | 3
a | | 6
b | aa | 4
b | bb | 5
b | cc | 6
b | | 15
| | 21
(9 rows)
You can see:
(c1,c2)
(in this example, the c3
value itself);(c1)
(i.e. the total value of lines 1-3, 5-7 respectively);The ROLLUP
clause is often used to analyze historical data, such as total salary calculated by department, district and company, and the results can be visualized as multi-ring pie charts, etc.
According to the CUBE
command, the aggregation results are grouped according to the CUBE
command, and the union and deduplication are respectively used to upload the volume and use each grouping key.
CUBE ( c1, c2 )
is equivalent to:
=# GROUPING SETS (
( c1, c2 ),
( c1 ),
( c2 ),
( )
)
The last empty grouping key is equivalent to performing aggregation calculations for all return values.
Suppose we have table t1
(including columns c1
, c2
, c3
):
=# SELECT c1, c2, sum(c3) AS sum
FROM t1
WHERE c1 = 'a' OR c1 = 'b'
GROUP BY CUBE (c1, c2)
ORDER BY (c1,c2);
c1 | c2 | sum
----+----+-----
a | aa | 1
a | bb | 2
a | cc | 3
a | | 6
b | aa | 4
b | bb | 5
b | cc | 6
b | | 15
| aa | 5
| bb | 7
| cc | 9
| | 21
(12 rows)
As you can see, except for the same part as ROLLUP
(lines 1-8 and last line):
(c2)
(e.g. aa = 1+4, bb = 2+5).The calculation results of the CUBE clause can be rendered with rich visual and interactive effects through BI tools, such as [Sankey Chart](https://baike.baidu.com/item/Sankey Chart/10773057?fr=ge_ala).
Notes!
For examples of data cube application, please see [Multidimensional Data Query] (/doc/latest/dataquery/olap_oltp_scene_scene#1.10%20 multidimensional data analysis).