Storage Engine Principles

1 MARS2 Overview

The MARS2 storage engine is mainly designed for high-speed data loading and querying, and it uses ordered storage to save search.

1.1 Internal Principles

The MARS2 table is the same as MARS3 and also uses the LSM Tree structure and is stored.

The internal schematic diagram of MARS2 is as follows:

We will interpret the above picture in the form of conceptual analysis layer by layer.

1.1.1 Sort key

In MARS2, data is stored in an orderly manner. When creating a table, you need to formulate the order of sorting by creating an index. The fields involved in this sort order are called sort keys.
The sort key can only be specified once, cannot be modified or deleted.
In order to maximize the performance improvement brought by utilizing order, it is best to choose fields that are frequently used and have good filtering effects as the sort key. For example, in the device monitoring table, the event timestamp and device ID can be used as the sort keys.
If the sort key is of text type and can accept sorting in byte order, then using COLLATE C in this column can speed up sorting.

1.1.2 Run and meta information

According to the sort key, the data stored in MARS2 is ordered, and a continuous and ordered piece of data is called Run.
In order to be able to find where Run is stored, the storage location of Run is recorded, called Run's meta information.
At the same time, the meta information also records the minimum maximum value of Run, so that it can be filtered during query.
When inserting data, the data is sorted in memory first, so the size of Run is subject to the sorted memory size.

1.1.3 Merge

If the data ranges overlap in Run, it will cause read amplification and reduce query efficiency. Therefore, when the number of runs on disk exceeds a certain value, MARS2 will load multiple runs on disk into memory, sort, and finally output as a run. This process is called merge.
During the merge process, the data is still readable and writable:
- When reading data, only the merged input files will be read
- When writing data, the merge process will not read newly written data
- Read, write, and merge will not block each other

1.1.4 Level

In order to make the merged input files similar in size (avoiding merging of oversized files with small files), Run is organized into Level, with 3 layers: L0, L1, L2**.
When the data merge exceeds a certain size, it will be upgraded to a higher level. The process is as follows:
- 1. The newly inserted data is in L0. When the number of runs reaches a certain number (configurable, see the configuration parameter mars2_automerge_threshold), triggers L0 merge and merges all runs of L0 into one Run.
- 1. If the result Run size exceeds 25MB (configurable, see the configuration parameter level0_upgrade_size), upgrade it to L1.
- 1. If the sum of the Run sizes of L1 exceeds 1000MB (configurable, see level1_upgrade_size) after upgrading to L1, a merge is triggered and all runs of L1 are merged into one Run.
- 1. If the result Run size exceeds 1000MB (configurable, see level1_upgrade_size), upgrade it to L2.
- 1. Run after reaching L2 will no longer be merged.

1.1.5 Column storage

MARS2 uses ** column storage method to store data. When accessing disk, only the columns used need to be accessed, which reduces I/O.
The same column data type is the same, making it easier to compress and save disk space.
In addition, columnar storage is suitable for vectorized execution, which can greatly accelerate query execution speed. For details, see Vectorized Execution Engine.

1.1.6 MINMAX Filtering

The meta information of Run mentioned earlier stores the minimum maximum value for filtering during query.
Because it is filtered through meta information, the data itself does not need to be loaded, and its I/O overhead is lower than sequential scanning (visiting the data itself first, and then filtering) and can be filtered faster.
When using MARS2 to create tables, the minimum maximum value will not be recorded by default, and it is necessary to explicitly declare.

First create the matrixts extension:

=# CREATE EXTENSION matrixts ;

An explicit declaration of record minimum maximum value.

=# CREATE TABLE metrics (
    ts              timestamp   ENCODING(minmax),
    dev_id          bigint      ENCODING(minmax),
    power           float,
    speed           float,
    message         text
) USING MARS2;

Creates the mars2_btree index.

=# CREATE INDEX ON metrics
USING mars2_btree (ts, dev_id);

Both sort keys and non-sorting keys can define MINMAX. The effect is more significant when defining the sort key. In the expression, MINMAX can be in all uppercase or all lowercase.
In addition, MINMAX can be defined for up to 32 columns.

1.1.7 MARS2 Btree Index

In addition to MINMAX, MARS2 also has a built-in index (that is, the index created when creating a table).
Currently, only one index can be defined, that is, only one global sort.
Unlike the normal Btree index, MARS2 Btree index:
- is sparse, so the size is smaller
- Since the data itself is ordered, there is no random I/O generated by index scans

1.1.8 Compression

Same as MARS3.

By default, all data columns are compressed using lz4.
Support manual specification of encoding chain compression algorithm, which can be specified in the entire table or a single column.

1.1.9 Data writing

Data is written into memory through INSERT, and then goes through these processes to generate a Run.
If the amount of insertion data is large and exceeds the sorted memory, multiple runs will be generated.
The size of the sorting memory can be configured, see "1.2.2 Configuration Items" for details.

1.1.10 Update and Delete

MARS2 does not support updates and deletion for the time being.
-Update operation can be replaced by Unique Mode.
The deletion operation can only be done by deleting or TRUNCATE partition table.

1.2 MARS2 usage

1.2.1 Creating a MARS2 table

On the premise that the matrixts extension has been created, the simplest way to create tables is to add the USING clause to the CREATE TABLE statement and create an index. For extended examples, see Table Design Best Practices.

=# CREATE TABLE metrics (
    ts              timestamp,
    dev_id          bigint,
    power           float,
    speed           float,
    message         text
) USING MARS2;

=# CREATE INDEX ON metrics
USING mars2_btree (ts, dev_id);

1.2.2 Configuration Items

Notes!
The following contents table-level configuration item refers to configuration items that can only be configured using the WITH clause when creating data tables. They are suitable for single tables and cannot be modified once configured. Global configuration item refers to configuration items that can be configured at the session or system level, and system-level modifications require execution of mxstop -u to take effect. For more information, see Data Table Configuration Parameters.

The following parameters are used to control merges (Merge), and the way they affect merges is shown in the Level section above.

Merge Control Parameters	Units	Default Value	Value Range	Description
mars2_automerge_threshold	run	32	10 - 2048	Used to control all *** MARS2 tables, how many Runs can trigger merges in L0, which is a global configuration item. If you want to specify individually for a form you can use the table option level0_merge_threshold
level0_merge_threshold	run	32	1 - 2048	Used to control how many Runs to trigger merges to configure items for table-level configuration
level0_upgrade_size	MB	25	1 - 10000	Control Single table L0 -> L1 upgraded size. When L0 merges the result Run exceeds this size, it will be upgraded to L1, and the table-level configuration item is configured.
level1_upgrade_size	MB	1000	1 - 10000	ControlSingle tableL1 ->L2 upgrade size. When L1 merges the result Run exceeds this size, it will be upgraded to L2, and the table-level configuration item is configured.

If less data is inserted each time (one run is corresponding to each insert), Merge will be triggered soon, which is not very efficient. At this time, you can increase mars2_automerge_threshold / level0_merge_threshold to increase the number of runs input in Merge and reduce the frequency of Merge.
When the Merge output of L0/L1 does not have the size upgraded to L1, the next round of Merge will merge it with the new Run, resulting in write amplification. To avoid this, you can lower level0_upgrade_size / level1_upgrade_size to upgrade it to the next level.

The following parameters are used for compression control and adjust the compression effect. If the compression effect is not obvious if the configuration is too low, the configuration is too high and consumes more memory.

Compression Control Parameters	Units	Default Value	Value Range	Description
compress_threshold	Tuple	1200	1 - 100000	Compression threshold. Used to control how many tuples (tuples) are compressed in a single table. It is the upper limit of the number of tuples compressed in the same unit and is a table-level configuration item.

The following parameters are used for memory control. This part of the parameters controls the size of the sorting memory used when inserting data. When multiple partition tables are inserted, each partition table will allocate the memory configured in mars2_sort_mem_core; if it is not enough, it will be expanded, but the total amount will not exceed mars2_sort_mem.

Sorting memory parameters	Units	Default value	Value range	Description
mars2_sort_mem	KB	2097152KB (2GB)	128KB - 2147483647KB (~ 2048GB)	Controls the sorted memory size of each ** individual insert. If the insert target table is a partition table, they will share this size and configure the global item
mars2_sort_mem_core	KB	16384KB (16MB)	128KB - 2147483647KB (~ 2048GB)	Control at least how much sorted memory is allocated for each single partition table, and configure the global configuration item

Example table configuration items:

=# CREATE TABLE metrics (
    ts              timestamp,
    dev_id          bigint,
    power           float,
    speed           float,
    message         text
) 
USING MARS2
WITH (compress_threshold=1200,level0_merge_threshold=32);

=# CREATE INDEX ON metrics
USING mars2_btree (ts, dev_id);

Example of global configuration items modifying values at the session level:

=# SET mars2_sort_mem TO 2097152;

Example of global configuration items modifying values at the system level:

=# gpconfig -c mars2_sort_mem -v 2097152
=# \q
$ mxstop -u

2 HEAP Overview

HEAP is YMatrix's default storage engine, also known as heap storage, inherited from PostgreSQL, only supports row storage, not column storage and compression, and supports partition tables. It is implemented based on the MVCC mechanism and is suitable for scenarios with a large number of updates and deletion requirements.

2.1 Using the MVCC mechanism

The MVCC (Multiversion Concurrency Control) mechanism is usually called multiversion management. Its core is the update, modification and deletion of data.
In multi-version management, data updates and deletion do not necessarily modify them on the original data, but rather create a new version, mark the original data as invalid data, and then add new data to the new version, and the data has multiple versions. Each data has a version information and the historical version will be saved.
Under the influence of the MVCC mechanism, when the HEAP table handles update and delete operations, it does not really delete the data, but only relies on the data version information to block the old data (controls the visibility of the data). Therefore, if the HEAP table is updated or deleted in large quantities, the physical space it consumes will continue to increase, and you need to clean up old data regularly in a planned and regular manner.

2.2 HEAP usage

You can use the following SQL statement to create a HEAP table in YMatrix.

=# CREATE TABLE disk_heap(
    time timestamp with time zone,
    tag_id int,
    read float,
    write float
)
DISTRIBUTED BY (tag_id);

3 AO Overview

Tables with AOCO and AORO as storage engines are collectively called AO (Append-Optimized) tables, also known as append optimization tables, which can support insertion, update and delete operations and support compression.
AORO supports row storage, and AOCO supports column storage.
AO tables are very different from HEAP tables in terms of the logical structure and physical structure of the table. As mentioned in the previous section, the HEAP table uses the MVCC mechanism to control the visibility of data after update and deletion operations, while the AO table uses an additional bitmap table to implement it. The content of this table is to indicate which data in the AO table is visible. For AO tables with a large number of updates and deletion operations, old data also needs to be cleaned regularly in a planned manner. However, in AO tables, the data cleaning tool vacuumcum needs to reset the bitmap and compress physical files, so it is usually slower than HEAP.

Notes!
For details of MARS2, HEAP, AO storage engines, please refer to Table Design Best Practices

简体中文