Typical query in OLAP and OLTP scenarios

This document gives examples of typical query statements in OLAP and OLTP scenarios. Since most scenarios are fusion in actual application, this article is not strictly classified.

Notes!
Understanding the query requirements of a scenario is one of the keys to building an accurate [data model] (/doc/latest/datamodel/guidebook).


1 OLAP

OLAP is Online Analysis Processing, which is online analysis and processing scenarios. It is also often called a data warehouse scenario and is mostly used for decision support. It can perform complex data analytics for multi-angle observation by data scientists, business analysis and knowledge workers, supporting business intelligence (BI), data mining and other decision-making support applications.


1.1 Ranking Calculation

Example:

=# SELECT *, 
          row_number() OVER (ORDER BY c1 DESC),
          rank() OVER (ORDER BY c1 DESC),
          dense_rank() OVER (ORDER BY c1 DESC)
   FROM t1;

Key Points:

  • Use window functions row_number(), rank(), dense_rank() to calculate different types of rankings:
    • The row_number() function assigns a unique successful ranking value to each row in the result set, which is arranged in descending order of the c1 column.
    • The rank() function assigns a ranking value to each different c1 value in the result set, sorting it in descending order of the c1 column. If there is the same c1 value, the corresponding ranking will be skipped.
    • The dense_rank() function assigns a ranking value to each different c1 value in the result set, sorting it in descending order of the c1 column. If there is the same c1 value, the corresponding ranking will be retained and will not be skipped.


1.2 First and last values

Example:

=# SELECT *,
          first_value(c1) OVER (ORDER BY c1 DESC),
          last_value(c1) OVER (ORDER BY c1 DESC)
   FROM t1;


1.3 Window frame/sliding window

Example:

=# SELECT *,
          SUM(c1) OVER (ORDER BY c1 DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sum_c1 -- 添加框架声明,使得窗口涵盖从第一行到最后一行的所有行
   FROM t1;

Use window frames to control the range of aggregate calculations, such as calculating the average of each month's income and the previous two months:

=# SELECT month,
          revenue, 
          avg(revenue) OVER (ORDER BY month ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING)
   FROM t;

Key Points:

  • Use window frame declarations to define the boundaries of the window.


1.4 Distribution Statistics

Inquiry of population records in each city where income is between 40% and 70% of the city’s (inclusive).

=# SELECT id,name,salary,city
   FROM (
    SELECT *,
           ntile(10) OVER (
            PARTITION BY city
            ORDER BY salary DESC
           ) AS bucket
    FROM city_salary
   ) AS t
   WHERE t.bucket >= 4 AND t.bucket <= 7;

Key Points:

  • This query uses the window function ntile() to bucket the records in the city_salary table and filters out the records from the 4th to the 7th bucket from 10 buckets.


1.5 Offset calculation

Calculate the proportion of changes in monthly income relative to the previous month:

=# WITH detail(m, r, l) AS (
    SELECT month, 
           revenue,
           lag(revenue, 1)  OVER (ORDER BY month) AS last_month_revenue
    FROM sales
   )
   SELECT m, ((r-l)::float8/l * 100)::decimal(18,2) || '%'
   FROM detail;

Key Points:

  • Use common table expressions (CTE) to calculate sales revenue for each month and the percentage increase compared to the previous month.
  • (r-l)::float8/l * 100)::decimal(18,2) || '%': This expression is used for calculation of the growth percentage. It subtracts sales revenue from the current month from the previous month and divides the result from the previous month's sales revenue. Finally, use type conversion to convert the result to percentage format and retain the accuracy of the two decimal places.


1.6 Conducting aggregation query on windows

Check the proportion of monthly revenue to the current quarterly revenue:

=# SELECT *,
          revenue::real / quarter_revenue * 100 AS percent_of_quarter
   FROM (
   SELECT quarter,
          month,
          revenue,
          sum(revenue) OVER (PARTITION BY quarter) AS quarter_revenue
   FROM sales2
   ) AS t;


1.7 Change points

Change points list records that have changed compared to the previous one. This type of query is suitable for finding changes in relatively stable data sets:

=# SELECT time, read FROM (
        SELECT time,
            read,
            read - lag(read) OVER (ORDER BY TIME) AS diff
        FROM disk
        WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 21:01:00'::timestamp
        AND tag_id = 1 ) ht
    WHERE diff IS NULL OR diff != 0
    ORDER BY time;

Key Points:

  • Use the window function lag() to calculate the difference between the read value of each time point and the read value of the previous time point.
  • Filtering condition WHERE diff IS NULL OR diff != 0 means that only time-point records with a difference value of NULL or non-zero are filtered out.


1.8 Continuous percentage

Calculate the value of the disk array sorted by tag_id, where the read speed is at 20% position and the write speed is at 30% position:

=# SELECT tag_id,
    percentile_cont(0.2) WITHIN GROUP (ORDER BY read) AS read,
    percentile_cont(0.3) WITHIN GROUP (ORDER BY write) AS write
    FROM disk
    GROUP BY tag_id
    ORDER BY tag_id;

If the function parameter value is 0.5, it is equivalent to calculating the median.


1.9 Index comparison analysis

For example, using multi-table association aggregation to compare the indicators of a retail store during holidays:

=# WITH holiday_rst AS (
    SELECT year,holiday,date,row_number() OVER(PARTITION BY year,holiday ORDER BY date)
    AS the_day_of_the_holiday
    FROM dim_holiday_info
    )
   SELECT t.year,t.holiday,t.date,t.the_day_of_the_holiday,b.date AS corresponding_date_of_previous_year
   FROM holiday_rst t
   LEFT JOIN holiday rst b
   ON t.year = b.year + 1
   AND t.holiday = b.holiday
   AND t.the_day_of_the_holiday = b.the_day_of_the_holiday;

Key Points:

  • This query uses a common table expression to calculate the dates for each holiday and number the dates for each holiday in each year.
  • The query results will contain the holiday information for the current year and the corresponding date of the same holiday in the previous year.


1.10 Multidimensional Data Analysis

In the field of business intelligence (BI), it is necessary to see through a data set from different dimensions, and through data visualization, the knowledge contained in the data is presented to decision makers to realize their business value.

Data visualization is an important part of multi-dimensional data analysis. Through specific BI analysis software, the array of data cubes output by the database is depicted into graphs, images and metering, so that the inherent statistical features of big data can be identified and perceived by decision makers, and can also be interacted and predicted.

1.10.1 Multi-dimensional aggregation calculation

Aggregation calculations are performed from multiple dimensions and the results are displayed together:

=# SELECT city,
          season, 
          category, 
          sum(value) 
   FROM bi_sales
   GROUP BY GROUPING SETS (city, season, category)
   ORDER by (city);
   city   | season | category  | sum
----------+-----------------------------------------------------------------------------------------------------------------------
 Beijing  |        |           | 1098
 Shanghai |        |           | 1089
 Guangzhou|        |           | 1080
          | Q3     |           | 1989
          |        | Groceries | 1089
          | Q1     |           |  189
          |        | Electrics |  999
          |        | Books     | 1179
          | Q2     |           | 1089
(9 rows)

It can be seen that the sum-aggregation results are shown separately:

  • Calculation of city dimensions: the sum of the value values ​​​​of each city in the city column;
  • Calculation of quarterly dimensions: the sum of the relative value values ​​​​​in each quarter in the season column;
  • Calculation of category dimensions: The sum of the value values ​​​​of each product category in the category column.

2 OLTP

OLTP is Online Transaction Processing, which is an online transaction processing scenario and is mostly used for daily transaction processing. It can quickly and accurately perform data processing for ATMs, online banking, cash registers, e-commerce and brick-and-mortar store purchases, as well as hotel and air ticket booking, and perform many database transactions (change, insert, delete or query operations) in real time.


2.1 Instant Query

Ad Hoc is the user who flexible selects query conditions according to their own needs. The system can quickly generate corresponding statistical reports based on user's choices. The biggest difference between ad-hoc query and ordinary application query is that ordinary application query is customized, while ad-hoc query is customized by users.

=# SELECT * FROM sales WHERE customer_id=100;