Typical query in the IoT timing scenario

This document gives examples of typical query statements in timing scenarios. Since most scenarios are fusion in actual application, this article is not strictly classified.

Notes!
Understanding the query requirements of a scenario is one of the keys to building an accurate data model.


1 Basic query of single equipment

1.1 Single device click check

Query the values ​​​​​of a certain indicator or multiple indicators of a single device at a certain point in time. The most common query is the latest value or non-empty latest value of certain indicators of a certain device, such as the real-time speed of a car while driving:

=# SELECT timestamp, car_id, value FROM car
    WHERE car_id = '2' 
    ORDER BY timestamp DESC
    LIMIT 1;

1.2 Single device details query

Querying the detailed data of a device in a certain time interval may involve a single indicator or may be multiple indicators. Usually for further detailed analysis, such as analysis of the cause of failure in after-sales service:

=# SELECT * FROM analysis
    WHERE device_id = '2' 
    AND timestamp >= '2023-10-08 00:00:00' 
    AND timestamp <= '2023-10-08 09:00:00';

1.3 Single-device aggregate query

Query the aggregated values ​​​​of certain indicators of a certain device in a certain time interval, such as maximum value, minimum value, accumulated sum, average value, etc.:

=# SELECT count(1),sum(c1),avg(c1),max(c2),min(c2) FROM metrics
    WHERE device_id = '1'
    AND timestamp >= '2023-10-08 00:00:00' 
    AND timestamp <= '2023-10-08 09:00:00';


2 Basic query of multiple devices

2.1 Multi-device point check

Querying the indicator values ​​​​​of multiple devices at a certain point in time may involve a single indicator or may be multiple indicators, such as combined signal query.

=# SELECT timestamp, device_id, value FROM metrics
    WHERE device_id = '2' OR device_id = '3'
    AND timestamp = '2023-10-08 00:00:00'
    ORDER BY device_id, timestamp DESC;

hint The query will return data that meets the criteria, first sorted in ascending order in the device_id column. If there are multiple records of the same device_id, they will be sorted in descending order in the timestamp column. If you want both columns to return in descending order, you need to modify SQL to ORDER BY device_id DESC, timesstamp DESC.

2.2 Multi-device detailed query

Querying the detailed data of multiple devices in a certain time interval may involve a single indicator or multiple indicators, mostly used for data phenomenon discovery. Such as root cause analysis, new product tracking, etc. Generally for R&D personnel:

=# SELECT d.device_id, d.timestamp, d.value, s.device_name, l.location_name
FROM device_data AS d
JOIN device_status AS s ON d.device_id = s.device_id
JOIN device_location AS l ON d.device_id = l.device_id
WHERE d.device_id IN ('device1', 'device2', 'device3')
  AND d.timestamp >= '2023-11-01 00:00:00'
  AND d.timestamp <= '2023-11-10 23:59:59'
ORDER BY d.device_id, d.timestamp;

This is an example of root cause analysis, where there are several tables:

  • device_data: A table containing device data, with columns such as device_id, timestamp, value, etc.
  • device_status: table containing device status, with columns such as device_id, device_name, etc.
  • device_location: A table containing device location information, with columns such as device_id, location_name, etc.

Key Points:

  • This query combines the above three tables to obtain device data, device status and device location information.
    • The JOIN statement in the query is used to connect related tables and associate them with device ID.
    • The WHERE clause is used to filter data in the specified device ID and time range.

2.3 Multi-device aggregate query

Query the aggregated values ​​​​of certain indicators of multiple devices in a certain time interval, such as maximum value, minimum value, accumulation sum, average value, etc., and group them according to the equipment, such as network status checks in each area:

=# SELECT l.location_name, COUNT(d.device_id) AS total_devices, AVG(d.value) AS average_value
FROM device_data AS d
JOIN device_location AS l ON d.device_id = l.device_id
WHERE l.location_name IN ('location1', 'location2', 'location3')
  AND d.timestamp >= '2023-11-01 00:00:00'
  AND d.timestamp <= '2023-11-10 23:59:59'
GROUP BY l.location_name;

This is an example of network status checks in each region, assuming that there are the following two tables:

  • device_data: A table containing device data, with columns such as device_id, timestamp, value, etc.
  • device_location: A table containing device location information, with columns such as device_id, location_name, etc.

Key Points:

  • The above query combines the above two tables to obtain device data and device location information.
    • The JOIN statement in the query is used to connect related tables and associate them with device ID.
    • The WHERE clause is used to filter data in a specified region and time range.
    • The GROUP BY clause is used to group by region.
    • The COUNT and AVG aggregation functions are used to calculate the number of devices and the average value, respectively.

3 Advanced Query


3.1 Space-time query

Queries related to the first() function, last() function, etc., sometimes also involve queries combined with spatial data, such as the starting oil volume, the ending oil volume, etc.

Using the YMatrix timing function first requires creating an extension:

=# CREATE EXTENSION matrixts;

Extend to the database level, create a database once, without repeated creation.


3.1.1 Query the earliest value

Returns the value of the first record in the specified column:

=# SELECT id,first(c1, ts) AS c,
    FROM t
    GROUP BY id
    ORDER BY id;


3.1.2 Query the latest value

Returns the value of the last record in the specified column:

=# SELECT device_id, last(value, timestamp) AS last_value
    FROM t
    GROUP BY device_id
    ORDER BY device_id;


3.1.3 Calculate the average value for a given time period

=# SELECT time_bucket('5 s', timestamp) AS five_second, count(*) FROM t
    GROUP BY five_second;

The above SQL means that the t table will be count(*) query every 5s.

Key Points:

  • Use the GROUP BY statement to group the original data at wider time intervals and count the relevant queries for the key feature information of each set of data, which is called downsampling. Downsampling not only reduces storage overhead, but also retains key data characteristics, making it easier to analyze historical trends of data, predict future trends, etc.


3.1.4 Non-empty latest value

=# SELECT 
    device_id,
    max(timestamp) as max_timestamp,
    last_not_null_value(value1, timestamp) as lastvalue1,
    last_not_null_value(value2, timestamp) as lastvalue2,
    last_not_null_value(value3, timestamp) as lastvalue3
   FROM t
     GROUP BY device_id;

or

=# SELECT 
    device_id,
    max(timestamp) as max_timestamp,
    last_not_null(value1, timestamp) as lastvalue1,
    last_not_null(value2, timestamp) as lastvalue2,
    last_not_null(value3, timestamp) as lastvalue3
   FROM t
     GROUP BY device_id;

Key Points:

  • last_not_null(), last_not_null_value() functions.
  • Compared with last_not_null() and last_not_null_value(), not only does it return the value, but the time will also return. The returned type is a string, with the format [<value>, <time>].


3.1.5 Difference calculation

Calculate the difference to determine the changes in indicators, such as whether the stock is normal, etc.

=# SELECT
    device_id,
    timestamp, 
    value,
    lag(value) OVER(PARTITION BY device_id ORDER BY timestamp),
    lead(value) OVER(PARTITION BY device_id ORDER BY timestamp),
    lead(value) OVER(PARTITION BY device_id ORDER BY timestamp) - value AS lead_value 
FROM t;

Key Points:

  • The window function lag() is used to get the value of the previous row of the current row.
  • The window function lead() is used to get the value of the next row of the current row.
  • By specifying device_id in the PARTITION BY clause, we apply the window function to rows within each device ID partition.
  • Use lead(value) - value to calculate the difference between the value of the next record and the current record and name it lead_value.


3.2 Abnormal data cleaning

=# SELECT time_bucket_gapfill('45 minutes', timestamp) AS five_min,
          locf(AVG(value)) As locf_value,
          interpolate(AVG(value)) AS interpolate_value
   FROM t
       WHERE device_id = '1'
       AND timestamp >= '2021-12-01 00:00:00'
       AND timestamp < '2021-12-02 06:00:00'
   GROUP BY five_min
   ORDER BY five_min;

Key Points:

  • The above query uses the time_bucket_gapfill() function as well as the backward (LOCF) and linear interpolation functions to fill in the missing values ​​​​​in the t table.
    • time_bucket_gapfill('45 minutes', timestamp): This function buckets the timestamp timestamp at the specified time interval (45 minutes) and fills the missing time bucket. This ensures that the time buckets in the result are continuous and without missing.
    • locf(AVG(c1)): This function uses the LOCF method to fill the value value in each time bucket. The LOCF (Last Observation Carried Forward) method uses the last non-missing value before the current time bucket to fill in the missing value.
    • interpolate(AVG(c1)): This function uses a linear interpolation method to fill the value value in each time bucket. The linear interpolation method uses a linear function between the non-missed values ​​​​​before and after the current time bucket to estimate the missing values.


3.3 Anomaly data monitoring

3.3.1 Calculation of cumulative sum, cumulative mean and median

Anomaly data detection can be performed through abnormal changes in the cumulative average value and median:

-- Accumulated and
=# SELECT device_id, timestamp, value,
          SUM(value) OVER(PARTITION BY device_id, timestamp::date ORDER BY timestamp) AS accumulation_value
   FROM t;

-- Cumulative Average
=# SELECT device_id, timestamp, value,
       AVG(value) OVER(PARTITION BY device_id, timestamp::date ORDER BY timestamp) AS accumulation_value
   FROM t;

-- Median
=# SELECT device_id,
       percentile_cont(0.5) WITHIN GROUP (ORDER BY value) AS accumulation_value
   FROM t
   GROUP BY device_id
   ORDER BY device_id;

3.3.2 Incremental calculation

Increment can be calculated for an increase or decrease in a monotonic sequence, or can be simply used to calculate changes from the previous data, usually used for absolute data detection:

Example 1

This example calculates the change value of disks with tag_id of 1 and the disk reads from 2021-4-10 21:00:00 to 21:01:00:

=# SELECT
      time,
      (
       CASE WHEN lag(read) OVER (ORDER BY time) IS NULL THEN NULL
       ELSE round(read - lag(read) OVER (ORDER BY time))
       END
      ) AS read
   FROM disk
      WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 21:01:00'::timestamp
      AND tag_id = 1
      ORDER BY time;

Key Points:

  • Use the window function lag() to get the value of the previous row.
  • Positive values ​​​​​are the increase in the previous second, and negative values ​​​​​are the decrease in the previous second.

Example 2

This example calculates sales increments per day using the sales table:

=# SELECT
      date,
      sales_amount,
      sales_amount - LAG(sales_amount) OVER (ORDER BY date) AS sales_increment
   FROM
      sales
   ORDER BY
      date;

Key Points:

  • Calculate the daily sales increment by subtracting the sales of the previous day from the current day.

3.3.3 Calculation of growth characteristic

Example 1

Based on the increment, divide it by the time interval to get the growth rate:

=# SELECT
    time,
    (
     CASE WHEN lag(read) OVER (ORDER BY time) IS NULL THEN NULL
     ELSE round(read - lag(read) OVER (ORDER BY time))
     END
    ) / extract(epoch from time - lag(time) OVER (ORDER BY time)) AS read_rate,
    extract(epoch from time - lag(time) OVER (ORDER BY time)) AS "time lag"
    FROM disk
    WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 21:01:00'::timestamp
    AND tag_id = 1
    ORDER BY time;

Key Points:

  • Increment/Time Interval=Growth Rate.


3.4 Indicator jump difference value query

This query can calculate the time before and after each jump, device number, and indicator values ​​​​​before and after the jump, such as querying stock index flow, retrospective stock trading status, etc.:

=# SELECT stock_id, timestamp, value, lag_value, lead_value, lag_diff_value, lead_diff_value FROM (
    SELECT stock_id, timestamp, value,
        lag(value) OVER (PARTITION BY stock_id ORDER BY timestamp) AS lag_value,  
        lead(value) OVER (PARTITION BY stock_id ORDER BY timestamp) AS lead_value,  
        value - lag(value) OVER (PARTITION BY stock_id ORDER BY timestamp) AS lag_diff_value,
        value - lead(value) OVER (PARTITION BY stock_id ORDER BY timestamp) AS lead_diff_value
    FROM t
       WHERE 1=1
       AND stock_id in ('1')
       AND timestamp >= '2021-12-01 00:00:00'
       AND timestamp < '2021-12-02 23:59:59'::timestamp
       AND value IS NOT NULL
   ) ht 
   WHERE abs(lag_diff_c1) > 3 OR abs(lead_diff_c1) > 3
   ORDER BY stock_id, timestamp;

Key Points:

  • The subquery in the query uses the window functions lag() and lead() to get the values ​​​​of the previous row and the next row of the current row in the order of the timestamp column in each stock_id partition.
    • The lag_value column is the previous row value of the value column.
    • The lead_value column is the next row value of the value column.
    • The lag_diff_value column is the difference between the value value of the current row and the value value of the previous row.
    • The lead_diff_value column is the difference between the value value of the current row and the value value of the next row.
  • The outermost query selects rows that satisfy the difference (absolute value) greater than 3 from the subquery results.

Notes!
Jump detection refers to the jump that a specified device finds a specified indicator within a specified time range. Jump means that the adjacent time value changes dramatically exceed a certain threshold.


3.5 Monitoring large screen

3.5.1 Quantile function

Use quantile functions to calculate real-time data for monitoring large screens, for example, use 9 quantiles to reflect a stock trend:

=# SELECT stock_id,
          percentile_cont(0.9) WITHIN GROUP (ORDER BY value) AS value_9
    FROM t
    GROUP BY stock_id
    ORDER BY stock_id;

3.5.2 Cumulative Distribution Function (CDF)

Using the cumulative function, you can calculate the cumulative distribution of stock prices or indicators over a given time period. This helps to assess the relative position of a price or indicator, as well as determine the price range and statistical characteristics of a stock. For example, it is used to monitor the display of statistical data distribution on large screens in real time.

=# SELECT stock_id, timestamp, price, 
          cume_dist() OVER(PARTITION BY stock_id, timestamp::date ORDER BY timestamp) AS cume_dist
   FROM t;

Key Points:

  • The cume_dist() function can calculate the cumulative distribution value: cumulative distribution value = (number of partition rows before the current row or in the same level) / total number of partition rows


3.6 Advanced analysis

In timing scenarios, row and column transformation is used to implement data model transformation, which is mostly used for ad-hoc analysis.

3.6.1 Row to column

=# SELECT name,
          max(CASE WHEN attribute='age' THEN value ELSE 0 END) AS age,
          max(CASE WHEN attribute='height' THEN value ELSE 0 END) AS height,
          max(CASE WHEN attribute='weight' THEN value ELSE 0 END) AS weight,
          max(CASE WHEN attribute='shoe_size' THEN value ELSE 0 END) AS shoe_size
   FROM t
   GROUP BY name
   ORDER BY age DESC;

Table t original structure example:

+------+----------+-------+
| name | attribute | value |
+------+----------+-------+
| John | age      | 30    |
| John | height   | 175   |
| John | weight   | 70    |
| John | shoe_size| 9.5   |
| Mary | age      | 25    |
| Mary | height   | 160   |
| Mary | weight   | 55    |
| Mary | shoe_size| 8     |
+------+----------+-------+

Example of row-to-column results:

+------+-----+-------+--------+-----------+
| name | age | height| weight | shoe_size |
+------+-----+-------+--------+-----------+
| John | 30  | 175   | 70     | 9.5       |
| Mary | 25  | 160   | 55     | 8         |
+------+-----+-------+--------+-----------+

Key Points:

  • This row-to-column query method is very useful when you need to convert row data into column data for comparison or presentation. The data can be easily viewed and displayed by converting different attribute values ​​​​​into corresponding columns using conditional statements and calculating the maximum value of each attribute (or other aggregation operations).

3.6.2 Column to Row

=# SELECT currenttimestamp, 
          deviceid, 
          devicetemplatecode,
          statisticstype,
          (b.rec).key AS key, 
          (b.rec).value AS value 
   FROM
     (SELECT currenttimestamp, 
             deviceid, 
             devicetemplatecode,
             statisticstype,
             jsonb_each_text(row_to_json(t.*)::jsonb-'currenttimestamp'-'deviceid'-'devicetemplatecode'-'statisticstype') AS rec  
     FROM t
   ) b
   WHERE (b.rec).value IS NOT NULL;

Table t original structure example:

+---------------------+----------+-------------------+----------------+--------+--------+--------+
| currenttimestamp    | deviceid | devicetemplatecode| statisticstype | key1   | key2   | key3   |
+---------------------+----------+-------------------+----------------+--------+--------+--------+
| 2023-11-13 08:30:45 | 1234567  | template1         | type1          | value1 | value2 | value3 |
+---------------------+----------+-------------------+----------------+--------+--------+--------+

Example of column conversion result:

+---------------------+----------+-------------------+----------------+------+-------+
| currenttimestamp    | deviceid | devicetemplatecode| statisticstype | key  | value |
+---------------------+----------+-------------------+----------------+------+-------+
| 2023-11-13 08:30:45 | 123456   | template1         | type1          | key1 | value1|
| 2023-11-13 08:30:45 | 123456   | template1         | type1          | key2 | value2|
| 2023-11-13 08:30:45 | 123456   | template1         | type1          | key3 | value3|
| ...                 | ...      | ...               | ...            | ...  | ...   |
+---------------------+----------+-------------------+----------------+------+-------+

Key Points:

  • Column-to-row operation is implemented by converting each row of data into a JSON object using the row_to_json() function, and then converting the JSON object into a key-value pair using the jsonb_each_text() function.


3.7 Indicator real-time tags

Show the status of a certain machine/account/person:

=# SELECT first_name,
          last_name,
          salary,
    CASE
     WHEN salary >= 80000 THEN 'senior'
     WHEN salary >= 60000 THEN 'intermediate'
    ELSE 'junior'
    END AS employee_level
   FROM employees;

Key Points:

  • Query compound metrics using the CASE WHEN clause:
    • If salary is greater than or equal to 80000, the value of the employee_level column will be senior.
    • If salary is greater than or equal to 60000 and less than 80000, the value of the employee_level column will be intermediate.
    • In other cases, if salary is less than 60000, the value of the employee_level column will be junior.


3.8 Peak Detection

Specify the TOP10 peaks of the specified indicators for stocks within a specified time range, and the time when the peak occurs, etc.:

=# SELECT  stock_id, timestamp, value FROM (
    SELECT stock_id, timestamp, value,
           row_number() OVER (PARTITION BY stock_id ORDER BY value DESC) AS rn
    FROM t_test
      WHERE stock_id ='1'
      AND timestamp >= '2021-12-01 00:00:00'
      AND timestamp < '2021-12-03 00:00:00'
   ) AS derived_table
   WHERE rn <= 10;

Key Points:

  • The subquery and window function row_number()" are used in the query statement to filter and sort the tablet_testand show the top ten rows with the highestvalue`.


3.9 Traceability Query

Used for traceability query, metric traversal of JSONB data type, etc.:

=# WITH RECURSIVE json_recursive AS (
    SELECT id,
           'person' AS key,
           data->'person' AS value
    FROM my_table
    UNION ALL
    SELECT j.id,
           k AS key,
           v AS value
    FROM json_recursive j,
         jsonb_each(j.value) AS kv(k, v)
    WHERE jsonb_typeof(j.value) = 'object'
   )
   SELECT id, key, value
   FROM json_recursive;

Key Points:

  • Recursive query (WITH RECURSIVE) and two subqueries (SELECT statement) are used in the query statement.
    • The starting part of the recursive query is the first subquery, which selects the id column in the table my_table, the fixed key name person and the corresponding value as the initial result set.
    • The iterative part of the recursive query is the second subquery which iterates by joining the json_recursive table with the result of the jsonb_each() function. In each iteration, the current JSON value (j.value) is broken down into a key-value pair using the jsonb_each() function, and the disassembled key (k) and value (v) are used as the iteration result. At the same time, through the condition jsonb_typeof(j.value) = 'object' in the WHERE clause, only records with the JSON value type as the object are selected to ensure the recursive termination condition.
  • The final query result selects the id, key and value columns from the json_recursive table to obtain complete traceability information.


3.10 Machine Learning

Machine learning can be used for prediction and classification.

For example, based on data from January 1 to 10, a linear regression model is calculated to predict the indicators on the 11th:

=# SELECT '2021-12-11 00:00:01' AS timestamp,
    extract(epoch FROM '2021-12-11 00:00:01'::timestamp) * slope + intercept AS c1_value
   FROM (
    SELECT regr_slope(value, extract(epoch from timestamp)) AS slope,
         regr_intercept(value, extract(epoch from timestamp)) AS intercept
    FROM t_test
    WHERE vin = '1'
        AND timestamp >= '2021-12-01 00:00:00'
        AND timestamp < '2021-12-11 00:00:00'
   ) AS liner_regression;

Key Points:

  • This query statement uses two subqueries and linear regression functions regr_slope() and regr_intercept() to perform linear regression calculation on the table t_test. They can be used to calculate the slope and intercept of linear regression models for simple machine learning tasks in SQL.
    • The subquery selected records with vin value 1, timestamp greater than or equal to 2021-12-01 00:00:00 and less than 2021-12-11 00:00:00 according to the condition, and calculated the slope and intercept of the value column and the timestamp column.
    • Then, in the outer layer query, the value of the timestamp column is fixed to 2021-12-11 00:00:01, and the corresponding value value is calculated using the slope and intercept, and named it c1_value.