Advanced Queries

This document describes the functions and expressions provided by YMatrix for querying data.

In addition to basic query statements, YMatrix provides many advanced analytical functions and SQL expressions.


1 Window Functions

Window functions support complex reporting and analytical operations, commonly used in OLAP and time-series scenarios.

A "window" defines a set of rows related to the current row. The key difference between window functions and other functions is the presence of the OVER clause. If a function includes an OVER clause, it becomes a window function. This clause specifies the window over which the function operates.

The following table highlights the differences between aggregate window functions and specialized window functions:

Aggregate Window Functions Specialized Window Functions
Output One row per input row Multiple rows
Functions max(), min(), count(), sum(), avg(), sum(), etc. In addition to standard aggregates, specialized functions such as row_number(), rank(), dense_rank(), first_value(), last_value(), etc.
Usage Typically used with GROUP BY; can also act as window functions when combined with OVER Used with OVER. Usually, the OVER clause follows immediately after the function name and arguments. Within OVER, you may use PARTITION BY, ORDER BY, or ROWS BETWEEN. Using only PARTITION BY, or combining PARTITION BY and ORDER BY, creates a static window (fixed size and position). When PARTITION BY is used along with ROWS BETWEEN, or both ROWS BETWEEN and ORDER BY, a sliding window is formed—applying a fixed-size window across each input row sequentially.

1.1 Syntax

Like aggregate functions, window functions appear in the SELECT clause of a table expression:

=# SELECT 
    <function_name(arguments)> OVER (
        PARTITION BY …
        ORDER BY …
    )
   FROM …;

The OVER clause defines the window scope:

  • The PARTITION BY list divides rows into partitions; each partition acts as a separate window. If omitted, all rows belong to one partition.
  • The ORDER BY clause determines the ordering of rows within a partition.

1.2 Specialized Window Functions

YMatrix supports the following specialized window functions:

Function Description
row_number() Assigns a unique sequential integer to each row within a result set partition, starting from 1
rank() Assigns a rank value to each row in a partition. Tied values receive the same rank, and subsequent ranks are incremented accordingly. Ranks may not be consecutive when ties exist
dense_rank() Assigns a rank value to each row in a partition. Tied values receive the same rank, but the next rank continues without gaps. Thus, ranks remain consecutive even with ties
first_value() Returns the first value in the ordered partition
last_value() Returns the last value in the ordered partition
ntile() Divides the ordered partition into n buckets (groups) of equal size
cume_dist() Computes the proportion of rows in the current partition with values greater than or equal to the current row (including itself)
percent_rank() Computes the relative rank of a row: the proportion of rows in the partition (excluding the current row) with values higher than the current row
lag(,) Accesses the value of a column from a specified number of rows before the current row
lead(,) Accesses the value of a column from a specified number of rows after the current row

1.2.1 Distribution Statistics Functions

Syntax:

=# SELECT …,
    ntile(<n>) OVER (ORDER BY …),
    cume_dist() OVER (ORDER BY …),
    percent_rank() OVER (ORDER BY …)  
FROM …;

Note!
The cume_dist() and percent_rank() functions differ in both numerator and denominator:

  1. The denominator of cume_dist() is the total number of rows in the current partition, while the denominator of percent_rank() is the total number of rows in the current partition minus one.
  2. The numerator of cume_dist() is the number of rows with values greater than or equal to the current row's value plus one (includes the current row itself), whereas the numerator of percent_rank() is the number of rows with values greater than the current row's value (excludes the current row itself).

1.2.2 Offset Calculation Functions

lag() and lead() are used to access data from previous or future rows. They are often applied in comparisons between current and historical data.

Syntax:

=# SELECT 
          lag(<offset_column>,<offset_rows>) OVER (ORDER BY …) AS …,
          lead(<offset_column>,<offset_rows>) OVER (ORDER BY …) AS …
    FROM …;

1.3 Aggregate Window Functions

YMatrix supports, but is not limited to, the following aggregate functions:

Function Description
count(/*) Counts non-null values (count(*) includes nulls)
sum() Sum of values
avg() Average of values
min() Minimum value
max() Maximum value

Using aggregate functions with OVER differs from using them with GROUP BY:

  • GROUP BY groups input rows first, then applies aggregation per group, returning one output row per group.
  • With the OVER clause, every input row corresponds to a window. The aggregate function computes results over each window, producing one result per input row.

1.4 Window Frame / Sliding Window

A window frame restricts the range of rows included in a window calculation. It applies to both specialized and aggregate window functions.

It follows the ORDER BY clause. Full syntax:

=# <aggregate_function(…)> OVER (… ORDER BY … ROWS|RANGE {
          UNBOUNDED PRECEDING
        | n PRECEDING
        | CURRENT ROW
        | BETWEEN <frame_boundary> AND <frame_boundary> })

Valid window frame boundaries are listed below. Here, n can be a number or any expression evaluating to a number:

Frame Boundary Description
UNBOUNDED PRECEDING First row of the partition
n PRECEDING n rows before the current row
CURRENT ROW Current row
n FOLLOWING n rows after the current row
UNBOUNDED FOLLOWING Last row of the partition

Note!
For usage examples of window functions, see Typical Queries in OLAP Scenarios.


2 Ordered-Set Aggregate Functions

Ordered-set aggregate functions perform aggregations on sorted sets of data. These functions are useful when order-sensitive computations are required.

Function Description Syntax
mode() Returns the most frequent input value (first one if multiple values have the same frequency) mode() WITHIN GROUP (ORDER BY …)
percentile_cont() Computes continuous percentile(s):
࠾ Single percentile: returns a value corresponding to the given fractional position in the sorted dataset. If no exact match exists, linear interpolation is used
࠾ Multiple percentiles: returns an array shaped like the input fraction array, with each element replaced by its corresponding percentile value
Single: percentile_cont(<分数>) WITHIN GROUP (ORDER BY …)
Multiple: percentile_cont(<分数数组>) WITHIN GROUP (ORDER BY …)
percentile_disc() Computes discrete percentile(s):
࠾ Single percentile: returns the first actual input value whose position in the sorted order is equal to or exceeds the specified fraction. No interpolation is performed
࠾ Multiple percentiles: returns an array where each non-null element is replaced by the corresponding input value at that percentile
Single: percentile_disc(<分数>) WITHIN GROUP (ORDER BY …)
Multiple: percentile_disc(<分数数组>) WITHIN GROUP (ORDER BY …)

To ensure correct and consistent use of ordered-set aggregate functions, note the following:

  • NULL values in the input data are ignored during computation. However, if the fraction parameter is NULL, the result will be NULL.
  • The fraction must be a valid value between 0 and 1. Providing a value outside this range causes an error.
  • A NULL fraction results in a NULL return, indicating insufficient information to compute the result.

Note!
For usage examples, see Continuous Percentile.


3 Time-Series Functions

Note!
These functions require the matrixts extension to be created before use.

Function Description
time_bucket('', ) Groups time-series data into fixed intervals for aggregation
time_bucket_gapfill('', ) Fills missing data points to create smooth time-series distributions. Supports two strategies: locf (Last Observation Carried Forward) and interpolate
first(, ) Returns the first observed value of the metric. Note: second parameter need not be timestamp; labeled as such for time-series context
last(, ) Returns the last observed value of the metric. Same note applies
last_not_null_value(, ) Returns the last non-NULL value of the specified column
last_not_null(, ) Returns the last non-NULL value and its timestamp

3.1 time_bucket() Function

This function works with aggregate functions to compute aggregated values over fixed time intervals.

Parameter Supported Data Types Supported Units
Interval int16; int32; int64; interval us / ms / s / minute(min, minutes) / hour / day, e.g., 1 day
Timestamp Column int16; int32; int64; date; timestamp; timestamptz

Example syntax:

=# SELECT time_bucket('<interval>', <timestamp_column>) AS …,
    <aggregate_function> AS …
    FROM …
    WHERE …
    GROUP BY …
    ORDER BY …;

3.2 time_bucket_gapfill() Function

Similar to time_bucket(), this function fills gaps in time-series data. Use it when data is missing and needs smoothing for analysis.

Parameter Supported Data Types Supported Units
Interval int16; int32; int64; interval us / ms / s / minute(min, minutes) / hour / day, e.g., 1 day
Timestamp Column int16; int32; int64; date; timestamp; timestamptz
Value Column Original column type Original unit

Two fill strategies are supported: locf and interpolate:

  • locf: Uses the last observed non-missing value (LOCF). Preserves trend and returns existing actual values.
  • interpolate: Performs linear interpolation between known points. Returns computed estimates.

Syntax examples:

  1. Without gap filling
=# SELECT time_bucket_gapfill('<interval>', <timestamp_column>) AS …,
    <aggregate_function> AS …
    FROM …
    WHERE …
    GROUP BY …
    ORDER BY …;
  1. Using locf strategy
=# SELECT time_bucket_gapfill('<interval>', <timestamp_column>) AS …,
    locf(<aggregate_function>) AS …
    FROM …
    WHERE …
    GROUP BY …
    ORDER BY …;
  1. Using interpolate strategy
=# SELECT time_bucket_gapfill('<interval>', <timestamp_column>) AS …,
    interpolate(<aggregate_function>) AS …
    FROM …
    WHERE …
    GROUP BY …
    ORDER BY …;

Note!
For usage examples, see Spatiotemporal Queries.


4 Common Table Expressions (CTE)

A Common Table Expression (CTE) allows defining a temporary view to simplify complex queries. It is implemented using the WITH keyword. The following table compares CTEs with CREATE VIEW statements:

CTE CREATE VIEW
Definition Not a standalone SQL statement; part of a larger expression Standalone SQL statement
Scope Temporary view accessible only within the enclosing query View available globally for all queries

Syntax:

=# WITH <cte_name> (<column_list>, …) AS (
    SELECT … FROM …
) SELECT … FROM …;

Note!
For usage examples, see Metric Comparison Analysis.


5 Nested Queries

5.1 Subquery Placement

Subqueries can appear in various parts of a query:

  • In the SELECT list, for example:

    =# SELECT (SELECT …) AS … FROM …;
  • In the FROM list, for example:

    =# SELECT …
     FROM … AS …, (SELECT … FROM …) AS …
     WHERE …;

Note!
Subqueries in the FROM list must have an alias.

  • In WHERE expressions, for example:
    =# SELECT … 
     FROM …
     WHERE … IN (SELECT … FROM … WHERE …);

5.2 Scalar Subqueries

When a subquery returns a single value, it can appear in the SELECT/FROM list or in WHERE conditions. It can be used with comparison operators such as =, <, <=, >, >=, !=.

Assume tables t1 (with column c1) and t2 (with column c2):

=# SELECT (SELECT max(c1) FROM t1), t2.*
   FROM t2
   WHERE c2 > (SELECT avg(c2) FROM t2);

Here, scalar aggregate subqueries appear in both the SELECT list and the WHERE condition.

Execution proceeds as follows: First, the inner query SELECT avg(c2) FROM t2 computes the average of column t2 in table c2. Then, the outer query filters rows based on this average and other conditions c2 > (SELECT avg(c2) FROM t2). For each matching row, it returns all columns from t2 and the result of SELECT max(c1) FROM t1.

5.3 Set-Returning Subqueries

If a subquery returns multiple rows, it can act as a temporary table in the FROM list and participate in joins (JOIN). Like scalar subqueries, they can also be used with comparison operators.

Assume tables ttt1 (columns c1, c4) and ttt2 (columns c2, c3):

=# SELECT t1.c1, t2.c2
   FROM ttt1 AS t1
   INNER JOIN (SELECT c3, c2 FROM ttt2) AS t2
   ON t1.c4 = t2.c3;

Execution steps:

  1. Execute subquery SELECT c3, c2 FROM ttt2, selecting columns c3 and c2 from table ttt2, resulting in temporary table t2.
  2. Join table ttt1 with t2 using an inner join.
  3. Return rows satisfying the join condition, selecting t1.c1 and t2.c2.

Set-returning subqueries can also be used in WHERE conditions via:

  • ALL: Must satisfy all tuples in the subquery result.
  • ANY: Satisfies at least one tuple.
  • IN = ANY, except IN only checks membership, while ANY allows comparison operators.
  • EXISTS: True if subquery returns at least one row.

Assume table t1:

=# CREATE TABLE t1 (
   c1 int,
   c2 text
   );
=# INSERT INTO t1 VALUES (1,'content');
   INSERT INTO t1 VALUES (2,'content');
   INSERT INTO t1 VALUES (3,'content');
   INSERT INTO t1 VALUES (4,'text');
   INSERT INTO t1 VALUES (5,'text');
  1. ALL Example
=# SELECT * FROM t1
   WHERE c1 >= ALL (
    SELECT c1 FROM t1
    WHERE c2 = 'content'
    ORDER by c1
   );

Since the subquery yields (1,2,3), the above is equivalent to:

=# SELECT * FROM t1
   WHERE c1 >= 1
   AND c1 >= 2
   AND c1 >= 3
   ORDER by c1;
 c1 |   c2
----+---------
  3 | content
  4 | text
  5 | text
(3 rows)
  1. ANY Example
=# SELECT * FROM t1
   WHERE c1 >= ANY (
    SELECT c1 FROM t1
    WHERE c2 = 'content'
   );

Equivalent to:

=# SELECT * FROM t1
   WHERE c1 >= 1
   OR c1 >= 2
   OR c1 >= 3;
  1. IN Example
=# SELECT * FROM t1
   WHERE c1 IN (
    SELECT c1 FROM t1
    WHERE c2 = 'content'
   );

Equivalent to:

=# SELECT * FROM t1
   WHERE c1 = 1
   OR c1 = 2
   OR c1 = 3;
  1. EXISTS Example
=# SELECT * FROM t1
   WHERE EXISTS (
    SELECT * FROM t1
    WHERE c1 < 60
    AND c2 = 'content'
   );

Since the subquery returns non-empty results:

=# SELECT * FROM t1
   WHERE true;

If empty:

=# SELECT * FROM t1
   WHERE false;

5.4 Correlated Subqueries

A correlated subquery depends on values from the outer query. For each row in the outer query, the subquery is re-evaluated.

Execution process:

  1. Retrieve a row from the outer query and pass relevant column values to the inner query.
  2. Execute the inner query and obtain its result.
  3. Evaluate the WHERE condition using the subquery result.
  4. Repeat for the next outer row until all are processed.

Assume tables t1 (column c1) and t2 (column c2):

=# SELECT * FROM t1
   WHERE NOT EXISTS (
    SELECT * FROM t2
    WHERE t2.c2 = t1.c1
   );

For each row in outer table t1, the subquery checks whether a matching row exists in t2. The NOT EXISTS expression determines whether to include the t1 row in the final result.

Note!
For usage examples, see Ad Hoc Analysis.


6 Data Cube

In Business Intelligence (BI), users analyze datasets across multiple dimensions and visualize insights for decision-making. The data cube feature enables multidimensional analysis by generating a multi-dimensional array of grouped data.

Data cubes use a model based on facts and dimensions:

  • Dimensions represent perspectives for observing data, often hierarchical (e.g., Product, Quarter, Region).
  • Facts are detailed transactional records (e.g., sales amounts).

6.1 GROUPING SETS Clause

GROUPING SETS performs multiple groupings in a single query. Aggregation occurs after FROM and WHERE evaluation, computing results for each specified grouping and returning them together.

Assume table t1 (columns c1, c2, c3):

=# SELECT c1, c2, sum(c3) AS sum 
   FROM t1
   GROUP BY GROUPING SETS (c1, c2)
   ORDER BY (c1,c2);
 c1 | c2 | sum
----+----+----
 a  |    | 6
 b  |    | 15
 c  |    | 24
    | aa | 12
    | bb | 15
    | cc | 18
(6 rows)

Results show sums grouped by:

  • c1: Total of c3 for a, b, c
  • c2: Total of c3 for aa, bb, cc

Using GROUPING SETS is equivalent to running multiple UNION ALL queries, but more concise and standardized.

--- SPLIT ---

6.2 ROLLUP Clause

In addition to GROUPING SETS, YMatrix provides other simplified methods to specify commonly used types of grouping sets.

Aggregation results grouped by the ROLLUP clause are rolled up layer by layer in the order of the grouping keys.

In fact, ROLLUP ( c1, c2, … ) is equivalent to:

=# GROUPING SETS (
    ( c1, c2, … ),
    …
    ( c1 ),
    ( )
   )

where the last empty grouping key represents aggregation over all returned rows.

Assume we have a table t1 with columns c1, c2, and c3:

=# SELECT c1, c2, sum(c3) AS sum 
   FROM t1
   WHERE c1 = 'a' OR c1 = 'b'
   GROUP BY ROLLUP (c1, c2)
   ORDER BY (c1,c2);
 c1 | c2 | sum
----+----+----
 a  |  aa  | 1
 a  |  bb  | 2
 a  |  cc  | 3
 a  |      | 6
 b  |  aa  | 4
 b  |  bb  | 5
 b  |  cc  | 6
 b  |      | 15
    |      | 21
(9 rows)

From the result:

  • Rows 1–3 and 4–6 show the sum grouped by (c1,c2) (in this example, these are the individual values of c3);
  • Rows 4 and 8 show the sum grouped by (c1) (i.e., the totals for rows 1–3 and 5–7 respectively);
  • Row 9 (last row) shows the total sum across all returned rows.

The ROLLUP clause is typically used for historical data analysis, such as calculating total salaries by department, region, and company-wide levels. The results can be visualized using multi-level pie charts or similar representations.

6.3 CUBE Clause

Aggregation results grouped by the CUBE clause compute roll-ups for each combination of grouping keys, then take the union and remove duplicates.

CUBE ( c1, c2 ) is equivalent to:

=# GROUPING SETS (
    ( c1, c2 ),
    ( c1     ),
    (     c2 ),
    (        )
   )

where the last empty grouping set represents aggregation over all returned rows.

Assume we have a table t1 with columns c1, c2, and c3:

=# SELECT c1, c2, sum(c3) AS sum 
   FROM t1
   WHERE c1 = 'a' OR c1 = 'b'
   GROUP BY CUBE (c1, c2)
   ORDER BY (c1,c2);
 c1 | c2 | sum
----+----+-----
 a  | aa |   1
 a  | bb |   2
 a  | cc |   3
 a  |    |   6
 b  | aa |   4
 b  | bb |   5
 b  | cc |   6
 b  |    |  15
    | aa |   5
    | bb |   7
    | cc |   9
    |    |  21
(12 rows)

As shown, in addition to the same results as ROLLUP (rows 1–8 and the last row):

  • Rows 9–11 show sums grouped by (c2) (e.g., aa = 1+4, bb = 2+5).

Results from the CUBE clause can be rendered into rich visualizations and interactive dashboards using BI tools, such as Sankey diagrams.

Note!
For an example of data cube usage, see Multidimensional Data Query.