This document gives examples of typical query statements in timing scenarios. Since most scenarios are fusion in actual application, this article is not strictly classified.
Notes!
Understanding the query requirements of a scenario is one of the keys to building an accurate data model.
Query the values of a certain indicator or multiple indicators of a single device at a certain point in time. The most common query is the latest value or non-empty latest value of certain indicators of a certain device, such as the real-time speed of a car while driving:
=# SELECT timestamp, car_id, value FROM car
WHERE car_id = '2'
ORDER BY timestamp DESC
LIMIT 1;
Querying the detailed data of a device in a certain time interval may involve a single indicator or may be multiple indicators. Usually for further detailed analysis, such as analysis of the cause of failure in after-sales service:
=# SELECT * FROM analysis
WHERE device_id = '2'
AND timestamp >= '2023-10-08 00:00:00'
AND timestamp <= '2023-10-08 09:00:00';
Query the aggregated values of certain indicators of a certain device in a certain time interval, such as maximum value, minimum value, accumulated sum, average value, etc.:
=# SELECT count(1),sum(c1),avg(c1),max(c2),min(c2) FROM metrics
WHERE device_id = '1'
AND timestamp >= '2023-10-08 00:00:00'
AND timestamp <= '2023-10-08 09:00:00';
Querying the indicator values of multiple devices at a certain point in time may involve a single indicator or may be multiple indicators, such as combined signal query.
=# SELECT timestamp, device_id, value FROM metrics
WHERE device_id = '2' OR device_id = '3'
AND timestamp = '2023-10-08 00:00:00'
ORDER BY device_id, timestamp DESC;
hint
The query will return data that meets the criteria, first sorted in ascending order in the device_id
column. If there are multiple records of the same device_id
, they will be sorted in descending order in the timestamp
column. If you want both columns to return in descending order, you need to modify SQL to ORDER BY device_id DESC, timesstamp DESC
.
Querying the detailed data of multiple devices in a certain time interval may involve a single indicator or multiple indicators, mostly used for data phenomenon discovery. Such as root cause analysis, new product tracking, etc. Generally for R&D personnel:
=# SELECT d.device_id, d.timestamp, d.value, s.device_name, l.location_name
FROM device_data AS d
JOIN device_status AS s ON d.device_id = s.device_id
JOIN device_location AS l ON d.device_id = l.device_id
WHERE d.device_id IN ('device1', 'device2', 'device3')
AND d.timestamp >= '2023-11-01 00:00:00'
AND d.timestamp <= '2023-11-10 23:59:59'
ORDER BY d.device_id, d.timestamp;
This is an example of root cause analysis, where there are several tables:
device_data
: A table containing device data, with columns such as device_id
, timestamp
, value
, etc.device_status
: table containing device status, with columns such as device_id
, device_name
, etc.device_location
: A table containing device location information, with columns such as device_id
, location_name
, etc.Key Points:
JOIN
statement in the query is used to connect related tables and associate them with device ID.WHERE
clause is used to filter data in the specified device ID and time range.Query the aggregated values of certain indicators of multiple devices in a certain time interval, such as maximum value, minimum value, accumulation sum, average value, etc., and group them according to the equipment, such as network status checks in each area:
=# SELECT l.location_name, COUNT(d.device_id) AS total_devices, AVG(d.value) AS average_value
FROM device_data AS d
JOIN device_location AS l ON d.device_id = l.device_id
WHERE l.location_name IN ('location1', 'location2', 'location3')
AND d.timestamp >= '2023-11-01 00:00:00'
AND d.timestamp <= '2023-11-10 23:59:59'
GROUP BY l.location_name;
This is an example of network status checks in each region, assuming that there are the following two tables:
device_data
: A table containing device data, with columns such as device_id
, timestamp
, value
, etc.device_location
: A table containing device location information, with columns such as device_id
, location_name
, etc.Key Points:
JOIN
statement in the query is used to connect related tables and associate them with device ID.WHERE
clause is used to filter data in a specified region and time range.GROUP BY
clause is used to group by region.COUNT
and AVG
aggregation functions are used to calculate the number of devices and the average value, respectively.Queries related to the first()
function, last()
function, etc., sometimes also involve queries combined with spatial data, such as the starting oil volume, the ending oil volume, etc.
Using the YMatrix timing function first requires creating an extension:
=# CREATE EXTENSION matrixts;
Extend to the database level, create a database once, without repeated creation.
Returns the value of the first record in the specified column:
=# SELECT id,first(c1, ts) AS c,
FROM t
GROUP BY id
ORDER BY id;
Returns the value of the last record in the specified column:
=# SELECT device_id, last(value, timestamp) AS last_value
FROM t
GROUP BY device_id
ORDER BY device_id;
=# SELECT time_bucket('5 s', timestamp) AS five_second, count(*) FROM t
GROUP BY five_second;
The above SQL means that the t
table will be count(*)
query every 5s
.
Key Points:
GROUP BY
statement to group the original data at wider time intervals and count the relevant queries for the key feature information of each set of data, which is called downsampling. Downsampling not only reduces storage overhead, but also retains key data characteristics, making it easier to analyze historical trends of data, predict future trends, etc.=# SELECT
device_id,
max(timestamp) as max_timestamp,
last_not_null_value(value1, timestamp) as lastvalue1,
last_not_null_value(value2, timestamp) as lastvalue2,
last_not_null_value(value3, timestamp) as lastvalue3
FROM t
GROUP BY device_id;
or
=# SELECT
device_id,
max(timestamp) as max_timestamp,
last_not_null(value1, timestamp) as lastvalue1,
last_not_null(value2, timestamp) as lastvalue2,
last_not_null(value3, timestamp) as lastvalue3
FROM t
GROUP BY device_id;
Key Points:
last_not_null()
, last_not_null_value()
functions.last_not_null()
and last_not_null_value()
, not only does it return the value, but the time will also return. The returned type is a string, with the format [<value>, <time>]
.Calculate the difference to determine the changes in indicators, such as whether the stock is normal, etc.
=# SELECT
device_id,
timestamp,
value,
lag(value) OVER(PARTITION BY device_id ORDER BY timestamp),
lead(value) OVER(PARTITION BY device_id ORDER BY timestamp),
lead(value) OVER(PARTITION BY device_id ORDER BY timestamp) - value AS lead_value
FROM t;
Key Points:
lag()
is used to get the value of the previous row of the current row.lead()
is used to get the value of the next row of the current row.device_id
in the PARTITION BY
clause, we apply the window function to rows within each device ID partition.lead(value) - value
to calculate the difference between the value of the next record and the current record and name it lead_value
.=# SELECT time_bucket_gapfill('45 minutes', timestamp) AS five_min,
locf(AVG(value)) As locf_value,
interpolate(AVG(value)) AS interpolate_value
FROM t
WHERE device_id = '1'
AND timestamp >= '2021-12-01 00:00:00'
AND timestamp < '2021-12-02 06:00:00'
GROUP BY five_min
ORDER BY five_min;
Key Points:
time_bucket_gapfill()
function as well as the backward (LOCF) and linear interpolation functions to fill in the missing values in the t table.time_bucket_gapfill('45 minutes', timestamp)
: This function buckets the timestamp timestamp
at the specified time interval (45 minutes) and fills the missing time bucket. This ensures that the time buckets in the result are continuous and without missing.locf(AVG(c1))
: This function uses the LOCF method to fill the value
value in each time bucket. The LOCF (Last Observation Carried Forward) method uses the last non-missing value before the current time bucket to fill in the missing value.interpolate(AVG(c1))
: This function uses a linear interpolation method to fill the value
value in each time bucket. The linear interpolation method uses a linear function between the non-missed values before and after the current time bucket to estimate the missing values.Anomaly data detection can be performed through abnormal changes in the cumulative average value and median:
-- Accumulated and
=# SELECT device_id, timestamp, value,
SUM(value) OVER(PARTITION BY device_id, timestamp::date ORDER BY timestamp) AS accumulation_value
FROM t;
-- Cumulative Average
=# SELECT device_id, timestamp, value,
AVG(value) OVER(PARTITION BY device_id, timestamp::date ORDER BY timestamp) AS accumulation_value
FROM t;
-- Median
=# SELECT device_id,
percentile_cont(0.5) WITHIN GROUP (ORDER BY value) AS accumulation_value
FROM t
GROUP BY device_id
ORDER BY device_id;
Increment can be calculated for an increase or decrease in a monotonic sequence, or can be simply used to calculate changes from the previous data, usually used for absolute data detection:
Example 1
This example calculates the change value of disks with tag_id
of 1
and the disk reads from 2021-4-10 21:00:00
to 21:01:00
:
=# SELECT
time,
(
CASE WHEN lag(read) OVER (ORDER BY time) IS NULL THEN NULL
ELSE round(read - lag(read) OVER (ORDER BY time))
END
) AS read
FROM disk
WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 21:01:00'::timestamp
AND tag_id = 1
ORDER BY time;
Key Points:
lag()
to get the value of the previous row.Example 2
This example calculates sales increments per day using the sales
table:
=# SELECT
date,
sales_amount,
sales_amount - LAG(sales_amount) OVER (ORDER BY date) AS sales_increment
FROM
sales
ORDER BY
date;
Key Points:
Example 1
Based on the increment, divide it by the time interval to get the growth rate:
=# SELECT
time,
(
CASE WHEN lag(read) OVER (ORDER BY time) IS NULL THEN NULL
ELSE round(read - lag(read) OVER (ORDER BY time))
END
) / extract(epoch from time - lag(time) OVER (ORDER BY time)) AS read_rate,
extract(epoch from time - lag(time) OVER (ORDER BY time)) AS "time lag"
FROM disk
WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 21:01:00'::timestamp
AND tag_id = 1
ORDER BY time;
Key Points:
Increment/Time Interval=Growth Rate
.This query can calculate the time before and after each jump, device number, and indicator values before and after the jump, such as querying stock index flow, retrospective stock trading status, etc.:
=# SELECT stock_id, timestamp, value, lag_value, lead_value, lag_diff_value, lead_diff_value FROM (
SELECT stock_id, timestamp, value,
lag(value) OVER (PARTITION BY stock_id ORDER BY timestamp) AS lag_value,
lead(value) OVER (PARTITION BY stock_id ORDER BY timestamp) AS lead_value,
value - lag(value) OVER (PARTITION BY stock_id ORDER BY timestamp) AS lag_diff_value,
value - lead(value) OVER (PARTITION BY stock_id ORDER BY timestamp) AS lead_diff_value
FROM t
WHERE 1=1
AND stock_id in ('1')
AND timestamp >= '2021-12-01 00:00:00'
AND timestamp < '2021-12-02 23:59:59'::timestamp
AND value IS NOT NULL
) ht
WHERE abs(lag_diff_c1) > 3 OR abs(lead_diff_c1) > 3
ORDER BY stock_id, timestamp;
Key Points:
lag()
and lead()
to get the values of the previous row and the next row of the current row in the order of the timestamp
column in each stock_id
partition.lag_value
column is the previous row value of the value
column.lead_value
column is the next row value of the value
column.lag_diff_value
column is the difference between the value
value of the current row and the value
value of the previous row.lead_diff_value
column is the difference between the value
value of the current row and the value
value of the next row.Notes!
Jump detection refers to the jump that a specified device finds a specified indicator within a specified time range. Jump means that the adjacent time value changes dramatically exceed a certain threshold.
Use quantile functions to calculate real-time data for monitoring large screens, for example, use 9 quantiles to reflect a stock trend:
=# SELECT stock_id,
percentile_cont(0.9) WITHIN GROUP (ORDER BY value) AS value_9
FROM t
GROUP BY stock_id
ORDER BY stock_id;
Using the cumulative function, you can calculate the cumulative distribution of stock prices or indicators over a given time period. This helps to assess the relative position of a price or indicator, as well as determine the price range and statistical characteristics of a stock. For example, it is used to monitor the display of statistical data distribution on large screens in real time.
=# SELECT stock_id, timestamp, price,
cume_dist() OVER(PARTITION BY stock_id, timestamp::date ORDER BY timestamp) AS cume_dist
FROM t;
Key Points:
cume_dist()
function can calculate the cumulative distribution value: cumulative distribution value = (number of partition rows before the current row or in the same level) / total number of partition rows
In timing scenarios, row and column transformation is used to implement data model transformation, which is mostly used for ad-hoc analysis.
=# SELECT name,
max(CASE WHEN attribute='age' THEN value ELSE 0 END) AS age,
max(CASE WHEN attribute='height' THEN value ELSE 0 END) AS height,
max(CASE WHEN attribute='weight' THEN value ELSE 0 END) AS weight,
max(CASE WHEN attribute='shoe_size' THEN value ELSE 0 END) AS shoe_size
FROM t
GROUP BY name
ORDER BY age DESC;
Table t
original structure example:
+------+----------+-------+
| name | attribute | value |
+------+----------+-------+
| John | age | 30 |
| John | height | 175 |
| John | weight | 70 |
| John | shoe_size| 9.5 |
| Mary | age | 25 |
| Mary | height | 160 |
| Mary | weight | 55 |
| Mary | shoe_size| 8 |
+------+----------+-------+
Example of row-to-column results:
+------+-----+-------+--------+-----------+
| name | age | height| weight | shoe_size |
+------+-----+-------+--------+-----------+
| John | 30 | 175 | 70 | 9.5 |
| Mary | 25 | 160 | 55 | 8 |
+------+-----+-------+--------+-----------+
Key Points:
=# SELECT currenttimestamp,
deviceid,
devicetemplatecode,
statisticstype,
(b.rec).key AS key,
(b.rec).value AS value
FROM
(SELECT currenttimestamp,
deviceid,
devicetemplatecode,
statisticstype,
jsonb_each_text(row_to_json(t.*)::jsonb-'currenttimestamp'-'deviceid'-'devicetemplatecode'-'statisticstype') AS rec
FROM t
) b
WHERE (b.rec).value IS NOT NULL;
Table t
original structure example:
+---------------------+----------+-------------------+----------------+--------+--------+--------+
| currenttimestamp | deviceid | devicetemplatecode| statisticstype | key1 | key2 | key3 |
+---------------------+----------+-------------------+----------------+--------+--------+--------+
| 2023-11-13 08:30:45 | 1234567 | template1 | type1 | value1 | value2 | value3 |
+---------------------+----------+-------------------+----------------+--------+--------+--------+
Example of column conversion result:
+---------------------+----------+-------------------+----------------+------+-------+
| currenttimestamp | deviceid | devicetemplatecode| statisticstype | key | value |
+---------------------+----------+-------------------+----------------+------+-------+
| 2023-11-13 08:30:45 | 123456 | template1 | type1 | key1 | value1|
| 2023-11-13 08:30:45 | 123456 | template1 | type1 | key2 | value2|
| 2023-11-13 08:30:45 | 123456 | template1 | type1 | key3 | value3|
| ... | ... | ... | ... | ... | ... |
+---------------------+----------+-------------------+----------------+------+-------+
Key Points:
row_to_json()
function, and then converting the JSON object into a key-value pair using the jsonb_each_text()
function.Show the status of a certain machine/account/person:
=# SELECT first_name,
last_name,
salary,
CASE
WHEN salary >= 80000 THEN 'senior'
WHEN salary >= 60000 THEN 'intermediate'
ELSE 'junior'
END AS employee_level
FROM employees;
Key Points:
CASE WHEN
clause:salary
is greater than or equal to 80000
, the value of the employee_level
column will be senior
.salary
is greater than or equal to 60000
and less than 80000
, the value of the employee_level
column will be intermediate
.salary
is less than 60000
, the value of the employee_level
column will be junior
.Specify the TOP10 peaks of the specified indicators for stocks within a specified time range, and the time when the peak occurs, etc.:
=# SELECT stock_id, timestamp, value FROM (
SELECT stock_id, timestamp, value,
row_number() OVER (PARTITION BY stock_id ORDER BY value DESC) AS rn
FROM t_test
WHERE stock_id ='1'
AND timestamp >= '2021-12-01 00:00:00'
AND timestamp < '2021-12-03 00:00:00'
) AS derived_table
WHERE rn <= 10;
Key Points:
row_number()" are used in the query statement to filter and sort the table
t_testand show the top ten rows with the highest
value`.Used for traceability query, metric traversal of JSONB
data type, etc.:
=# WITH RECURSIVE json_recursive AS (
SELECT id,
'person' AS key,
data->'person' AS value
FROM my_table
UNION ALL
SELECT j.id,
k AS key,
v AS value
FROM json_recursive j,
jsonb_each(j.value) AS kv(k, v)
WHERE jsonb_typeof(j.value) = 'object'
)
SELECT id, key, value
FROM json_recursive;
Key Points:
WITH RECURSIVE
) and two subqueries (SELECT
statement) are used in the query statement.id
column in the table my_table
, the fixed key name person
and the corresponding value as the initial result set.json_recursive
table with the result of the jsonb_each()
function. In each iteration, the current JSON value (j.value
) is broken down into a key-value pair using the jsonb_each()
function, and the disassembled key (k
) and value (v
) are used as the iteration result. At the same time, through the condition jsonb_typeof(j.value) = 'object'
in the WHERE
clause, only records with the JSON value type as the object are selected to ensure the recursive termination condition.id
, key
and value
columns from the json_recursive
table to obtain complete traceability information.Machine learning can be used for prediction and classification.
For example, based on data from January 1 to 10, a linear regression model is calculated to predict the indicators on the 11th:
=# SELECT '2021-12-11 00:00:01' AS timestamp,
extract(epoch FROM '2021-12-11 00:00:01'::timestamp) * slope + intercept AS c1_value
FROM (
SELECT regr_slope(value, extract(epoch from timestamp)) AS slope,
regr_intercept(value, extract(epoch from timestamp)) AS intercept
FROM t_test
WHERE vin = '1'
AND timestamp >= '2021-12-01 00:00:00'
AND timestamp < '2021-12-11 00:00:00'
) AS liner_regression;
Key Points:
regr_slope()
and regr_intercept()
to perform linear regression calculation on the table t_test
. They can be used to calculate the slope and intercept of linear regression models for simple machine learning tasks in SQL.vin
value 1
, timestamp
greater than or equal to 2021-12-01 00:00:00
and less than 2021-12-11 00:00:00
according to the condition, and calculated the slope and intercept of the value
column and the timestamp
column.timestamp
column is fixed to 2021-12-11 00:00:01
, and the corresponding value
value is calculated using the slope and intercept, and named it c1_value
.