Vectorization best practices

This section discusses how to use vectorized execution engine efficiently.

1 Scene

First of all, for different scenarios, the performance improvement of vectorized execution engines is different compared to traditional row-oriented execution engines. You can have a basic understanding and understanding through the overview section.
For example, operators that can operate in full sequential order, such as general expression calculations, selection operators, and aggregation operators, can maximize performance improvements. For operators with randomness, such as sorting operators and Hash-operators, there is limited room for performance improvement.

2 Data Types

2.1 Use original types whenever possible

You need to use [primitive type] as much as possible (https://zh.wikipedia.org/wiki/%E5%8E%9F%E5%A7%8B%E5%9E%8B%E5%88%A5), such as int32, int64, etc., to improve the performance and efficiency of vectorization.
Generally, enumeration type fields are often used for filtering operations, such as country fields or Chinese province fields. They can be represented by strings, such as "China", "USA", or "Beijing", "Xinjiang", etc., but if you preprocess it and use smallint to represent it, there will be a huge performance improvement for filtering operations.

2.2 Use the type with the smallest length that covers the range of the data range as much as possible

If the age field in the data is sufficient to represent it with smallint, then there is no need to use int32 or even int64 to represent it. For traditional execution engines, no matter what the numeric type is, the tuple in the final result is often represented by Datum, which occupies 64bits. Datum can represent a single value of any SQL data type, but itself does not contain any information about the data type. Therefore, using different types has no effect on processing speed. However, for vectorized execution engines, data is continuously and closely arranged. The shorter the value type, the more CPU cache-friendly, and the SIMD registers can store more data. Therefore, filtering or aggregating a certain column, using int64 will be four times slower than int16.

3 Performance analysis and optimization

3.1 Use only vectorized execution engine

Notes!
Due to time and resource constraints, we have not implemented data types, expressions and queries supported by all database standards. However, in order to ensure availability, when we encounter unsupported situations, we will partially or completely use the traditional execution engine for operations, which is reflected in the performance improvement that is not obvious or not improved.

You can manually turn off the matrix.enable_mxv_fallback_expression parameter to determine whether the query uses the traditional execution engine, which is enabled by default.

Parameters Description
matrix.enable_mxv_fallback_expression For expressions that cannot be processed by the current vectorized execution engine, whether to fall back to the traditional execution engine execution is enabled by default
set matrix.enable_mxv_fallback_expression to off;

When matrix.enable_mxv_fallback_expression is closed and the following error is encountered, it means that the query contains data types, expressions, or queries that the current version of the vectorization engine cannot process.

not implemented yet

3.2 Hash Agg vs. Group Agg

Hash Agg and Group Agg are two specific implementations of aggregation operators, with their own applicable scenarios. Hash Agg uses a hash table to implement aggregation, and Group Agg aggregates the grouped data. Normally, the optimizer can select the appropriate operator for the query based on the statistical information. However, if the statistics are inaccurate or the query is not optimized enough, a non-optimal path may be selected. Here are two GUCs to help analyze and optimize queries.

Parameters Description
matrix.enable_mxv_aggregate Used to control whether the Group Agg function is enabled, and it is enabled by default
matrix.enable_mxv_hash_aggregate Used to control whether the Hash Agg function is enabled, and it is enabled by default

There are four possibilities in combination.

  • All are turned off, using the aggregation function of the traditional execution engine;
  • All are enabled, and the optimizer selects the best path;
  • Enable matrix.enable_mxv_aggregate, then force the implementation of Group Aggregation;
  • Enable matrix.enable_mxv_hash_aggregate, then force the implementation of Hash Aggregation.

3.3 Vectorization combined with MARS2 storage engine

3.3.1 Selection of MARS2 sort key

In MARS2, data is stored in an orderly manner. When creating a table, you need to create an index to sort the order. The fields involved in this sort order are called sort keys. The sort key can only be specified once, cannot be modified or deleted. In order to maximize the performance improvement brought by the order, it is best to choose fields that are frequently used and have good filtering effects as the sorting key.
MARS2 can specify a set of sort keys through the following SQL.

CREATE INDEX ON t USING mars2_btree (ci, cj, ck...);

The selection of sort keys is crucial to the performance of the query:

  • Ordered data can accelerate data scanning and block filtering of storage engines.
  • It is also helpful for sorting and aggregation. For example, storage is ordered according to c1 and c2.

Only sorted queries

SELECT * FROM t ORDER BY c1, c2;
SELECT * FROM t ORDER BY c1, c3;

The first query no longer requires sorting, and the vectorization execution engine will optimize the sorting operator. For the second query, c1 is already ordered, so the vectorization execution engine will have special optimizations for this situation, only sorting different c3s of the same c1.

Only gathered queries

SELECT sum(c1), sum(c2) FROM t GROUP BY c1, c2;
SELECT sum(c1), sum(c3) FROM t GROUP BY c1, c3;

If you use Group Agg, it can greatly improve performance similarly to sorting, and you only need to scan the data in sequence for aggregation operations. For Hash Agg, whether the data is ordered has little effect on performance. As mentioned earlier, the vectorized execution engine of sequential operators can maximize its performance, so using Group Agg is often less time-consuming for the above two queries, especially the first one.
In summary, it is necessary to analyze and combine business and select the sort key based on the characteristics of the query (columns that need to be sorted, aggregated, and filtered).

3.3.2 minmax index

According to the sorting key, the data stored in MARS2 is ordered, and a continuous and ordered piece of data is called Run. Run's meta information stores the minimum maximum value, which is used to filter when querying.
Because it is filtered through meta information, the data itself does not need to be loaded, and its I/O overhead is lower than sequential scanning (visiting the data itself first, and then filtering) and can be filtered faster. When using MARS2 to create tables, the minimum maximum value will not be recorded by default and needs to be declared explicitly.
The minmax index not only enables the storage engine to perform block filtering, but also accelerates the calculation of the vectorized execution engine. Here are examples of its function: First, a minmax index is established on all columns of the MARS2 table (it can be established as needed in actual work).

CREATE TABLE fast_filter (
    tag int encoding (minmax),
    ts timestamp encoding (minmax),
    name text encoding (minmax),
    region int encoding (minmax)
    ) USING MARS2;
CREATE INDEX ON fast_filter USING mars2_btree(tag);

We want to query data from a certain area this year

SELECT * FROM fast_filter WHERE ts > '2022-01-01' AND region = 2;

From the perspective of the storage engine, if the maximum value on the ts of a block (several rows containing a column of data) is less than 2022-01-01, the storage engine no longer reads this block, saving I/O resources, and reducing the engine's computing effort. The vectorized execution engine will accelerate calculations using minmax information. For example, if the maximum and minimum values ​​of a certain region are both 2, then the execution engine will only apply the predicate ts > '2022-01-01' to filter out the data that meets the conditions, without performing any Region-related operations to improve performance.
The minmax index and sort keys are set independently, the former can be used for all columns, while the latter tends to select only a few columns. The impact on operators such as filtering and sorting is different, but there are also crossovers, so you can comprehensively consider using these two to give full play to their respective strengths.

3.3.3 Disable index scanning

MARS2 provides index scanning, but the current version of the vectorized execution engine only connects to the minmax index and sequential scanning, and does not connect to the index scanning. Therefore, for some queries, although the switch of the vectorized execution engine is turned on, the vectorized execution engine is not used. The words Custom Scan (sortheapscan) will appear in the plan. At this point, the vectorized execution engine can be used by disabling index scanning.

set enable_indexscan to off;

For more details, see Storage Engine