存储引擎原理简述


1 MARS3 Overview

The MARS3 storage engine uses a row-column hybrid storage method to achieve high-speed data writing and extreme compression.

MARS3 supports data updates and deletions through the UPDATE and DELETE clauses (except in Unique Mode).

MARS3 supports adding and deleting columns, as well as COPY and pg_dump operations.

1.1 internal mechanism

For each MARS3 single table, the internal storage uses an LSM Tree structure. An LSM Tree (Log-Structured Merge Tree) is a hierarchical, ordered, disk-oriented data structure. Its core principle is to fully leverage disk performance for batch sequential write operations, achieving significantly higher performance than random writes.

The internal architecture of MARS3 is illustrated below:

We will analyze the above diagram layer by layer in a conceptual manner.

1.1.1 sort key

  • In MARS3, data is stored in an ordered manner. When creating a table, you must specify the sorting order by designating a sort column (multiple columns are allowed). The fields involved in this sorting order are referred to as sort keys.
  • Sort keys can only be specified once and cannot be modified, added, or deleted.
  • To maximize the performance benefits of ordering, it is best to choose fields that are frequently used and have good filtering effects as sort keys. For example, in a device monitoring table, event timestamps and device IDs can be used as sort keys.
  • If the sort key is a text type and can be sorted by byte order, using COLLATE C on this column can accelerate sorting.
  • The SQL keyword for specifying the sort key is: ORDER BY.

1.1.2 Row storage and column storage

  • MARS3 supports storing data using a row-first, column-second storage method, and also supports direct column storage. You can set the loading mode using the prefer_load_mode parameter. For details, see the following configuration items.
  • If a mixed row-column storage model is used, the written data is first stored in row storage format, and then converted to column storage when the data volume reaches a certain level.
  • Compared to directly converting data to column storage, this has the following advantages:
    • Faster write speed for high-frequency, small-batch data
    • No need for large amounts of memory for data caching
    • Ensures uniform number of tuples in each data block

1.1.3 Run, Incremental Files, and Metadata

  • Based on the sort key, the data stored in MARS3 is ordered. A continuous segment of ordered data is referred to as a Run.
  • Runs are divided into two types. To enable high-speed writing, inserted data is stored in row-based Runs. Subsequently, to facilitate reading and compression, row-based Runs are converted into column-based Runs.
  • Each Run has its own incremental file. In addition to the Data file that records the main data, there are also Toast files that store large data blocks, Delta files that store deletion information, Index files that store index information, and link files that store merge information (see the next section, “Merge and Reclaim”) (the incremental files for row-stored Runs and column-stored Runs are slightly different).
  • Additionally, to track the locations of these files, we record supplementary information known as metadata. Metadata includes the file's storage location, file size, number of data rows, compression status, and other details.
  • The size of a Run (configured using the rowstore_size parameter) can be flexibly adjusted to optimize performance across different scenarios.

1.1.4 Merger and Acquisition

  • If there is overlap in the data range during a Run, it can cause read amplification and reduce query efficiency. Therefore, when the number of Runs on disk exceeds a certain threshold, MARS3 will merge and sort multiple Runs on disk, ultimately outputting a single Run. This process is referred to as merging.
  • During the merging process, data remains readable and writable:
  • When reading data, only the merged input files are read
    • When writing data, the merge process does not read newly written data
    • Reading, writing, and merging do not block each other
  • After merging is complete, the Runs involved in the merge automatically determine when they are no longer needed based on the transaction ID and are marked as recyclable.

1.1.5 Level

  • To ensure that the merged input files are of similar size (to avoid merging extremely large files with small ones), Runs are organized into Levels, with a maximum of 10 levels: L0, L1, L2...L9.
  • When the number of Runs in a level reaches a certain threshold, or the total size of multiple Runs in the same level exceeds a threshold, a merge is triggered. After merging into a single Run, it is promoted to a higher level. Additionally, to accelerate the promotion of Runs, multiple merge tasks can be performed simultaneously within the same level.

1.1.6 MARS3 Brin Index

  • MARS3 supports the creation of Brin indexes, as well as the deletion and addition of Brin indexes.
  • Each Run creates its own independent Brin index file when it is generated.
  • Example: CREATE INDEX brin_idx ON t1 USING mars3_brin(time,tag_id);

1.1.7 Compression

  • By default, all data columns in MARS3 are compressed using lz4.
  • Supports manually specifying encoding chain compression algorithms, which can be specified for the entire table or for individual columns.

1.1.8 Supports MVCC mechanism

  • The MVCC (Multiversion Concurrency Control) mechanism is commonly referred to as multi-version management. Its core function is to handle updates, modifications, and deletions of data.
  • In multi-version management, updates and deletions of data are not necessarily performed on the original data. Instead, a new version is created, the original data is marked as invalid, and new data is added to the new version, resulting in multiple versions of the data. Each piece of data is associated with a version identifier, and all historical versions are retained.
  • In MARS3, update and delete operations do not modify the original data in place. Instead, they use Delta files and version information to mask out old data, thereby controlling data visibility.
  • Note: Continuously updating or deleting data from the same Run will cause the physical space occupied by the Delta files for that Run to increase continuously. However, once all data from the current Run has been deleted, the space will no longer increase. Additionally, MARS3's merge operation can automatically clear Dead data, though you can also schedule regular use of VACUUM to clean up Dead data.

1.1.9 Data writing

  • Data is written to memory via INSERT, then written to Run in L0.
  • The size of Run in L0 is configurable. See the configuration options section below for details.

1.1.10 Update and delete

  • MARS3 uses DELETE for deletion. Deletions are recorded in the Delta file corresponding to the Run, and the data is actually deleted during Run merging.
  • MARS3 uses UPDATE for updates. Updates first delete the original data and then reinsert a new data record.
  • MARS3's Unique Mode does not currently support deletion. Updates do not require an explicit UPDATE clause; simply executing an INSERT clause will automatically complete the operation. To update a specific data record associated with a Unique Key (i.e., the specific key value corresponding to the sort key specified when creating the table), simply insert a new data record with the same Unique Key. For example, CREATE TABLE mars3_t(c1 int NOT NULL, c2 int) USING MARS3 WITH (uniquemode=true) ORDER BY (c1, c2);, where the Unique Key is (c1, c2).

Notes!
If Unique Mode is enabled, the first field in the ORDER BY clause must have a NOT NULL constraint added when it is defined.

1.2 MARS3 Use

1.2.1 Create MARS3 Table

Assuming that the matrixts extension has already been created, the simplest way to create a table is to add the USING clause to the CREATE TABLE statement and append the ORDER BY clause. For more examples, see Table Design Best Practices.

=# CREATE TABLE metrics (
    ts              timestamp,
    dev_id          bigint,
    power           float,
    speed           float,
    message         text
) USING MARS3 
  ORDER BY (dev_id,ts);

Notes!
The MARS3 table supports creating Brin indexes, but is not required; the MARS3 table must use the ORDER BY clause to create the sort key.


1.2.2 Configuration Items

Notes!
This part of the configuration items is a table-level configuration item. You can only use the WITH(mars3options='a=1,b=2,...') clause configuration when creating data tables (except the parameter compress_threshold that is also supported by MARS2, and you can only use the WITH clause configuration directly). It is suitable for a single table and cannot be modified once configured. For more information, see Data Table Configuration Parameters.

The following parameters are used to adjust the size of the L0 layer Run, and can also indirectly control the size of the Run above the L1 layer.

| Parameters | Units | Defaults | Value Range | Description | | -- | -- | -- | -- | -- | -- | -- | | rowstore_size | MB | 64 | 8 ~ 1024 | Used to control when L0 Run switches. When the data size exceeds this value, the next Run will be switched |

The following parameters are used to set the compression threshold, which can be used to adjust the compression effect and improve the reading efficiency. If the compression effect is not obvious when configured too low, the compression effect is consumed more memory when configured too high.

| Parameters | Units | Defaults | Value Range | Description | | -- | -- | -- | -- | -- | -- | -- | | compress_threshold | Tuple | 1200 | 1 ~ 100000 | Compression threshold. Used to control how many tuples (tuples) in each column of a single table are compressed at one time, which is the upper limit of the number of Tuples compressed in the same unit |

The following parameters are used to specify the loading mode of data in MARS3.

| Parameters | Units | Defaults | Value Range | Description | | -- | -- | -- | -- | -- | -- | -- | | prefer_load_mode | | normal | normal / bluk | Data loading mode. normal represents the normal mode. The newly written data is first written to the row storage run in the L0 layer. After accumulating in the rowstore_size, the column storage run falls to the L1 layer. Compared with the bulk mode, the column storage conversion will change from synchronous to asynchronous, but is suitable for high-frequency small batch writing scenerios with sufficient I/O capabilities and delay-sensitive high-frequency small batch writing scenerios; bluk represents the batch loading mode, suitable for low-frequency large batch writing scenerios, directly writing to the column storage run at the L1 layer. Compared with the normal mode, one I/O is reduced. The column storage conversion will change from asynchronous to synchronous, and is suitable for low-frequency large batch writing scenarios with insufficient I/O capabilities and insensitive to delay |

The following parameters are used to specify the enlargement consistency of Level size.

| Parameters | Units | Defaults | Value Range | Description | | -- | -- | -- | -- | -- | -- | -- | | level_size_amplifier | | 8 | 1 ~ 1000 | Level Size Amplification Factor. Level The threshold for triggering a merge operation is calculated as: rowstore_size * (level_size_amplifier ^ level). The larger the value, the slower the reading speed and the faster the writing speed. The specific value can be determined based on the specific scenario information (more writes, less reads, less reads, compression rate, etc.). Note: Make sure that the number of runs per Level does not exceed the number of runs, otherwise it will affect query performance and even prevent new data from being inserted |

Configuration example:

=# CREATE TABLE metrics (
    ts              timestamp,
    dev_id          bigint,
    power           float,
    speed           float,
    message         text
) USING MARS3
WITH (compress_threshold=1200,mars3options='rowstore_size=64')
ORDER BY (dev_id,ts);

1.2.3 Tool Functions

  • matrixts_internal.mars3_level_stats: Check the status of each Level level of the MARS3 table, and based on this, you can judge the health of the MARS3 table, such as whether Run has merged as expected, whether its number meets expectations, etc.;
  • matrixts_internal.mars3_files: Check the file status of the MARS3 table, which can be used to see if the extended files and incremental files of the MARS3 table (Data files, Delta files, Index files, etc.) meet expectations;
  • matrixts_internal.mars3_info_brin: View the status of a Brin index in the MARS3 table.


2 MARS2 Overview

The MARS2 storage engine is mainly designed for high-speed data loading and querying, and it uses ordered storage to save search.

2.1 Internal Principles

The MARS2 table is the same as MARS3 and also uses the LSM Tree structure and is stored. The internal schematic diagram of MARS2 is as follows: We will interpret the above picture in the form of conceptual analysis layer by layer.

2.1.1 Sort key

  • In MARS2, data is stored in an orderly manner. When creating a table, you need to formulate the order of sorting by creating an index. The fields involved in this sort order are called sort keys.
  • The sort key can only be specified once, cannot be modified or deleted.
  • In order to maximize the performance improvement brought by utilizing order, it is best to choose fields that are frequently used and have good filtering effects as the sort key. For example, in the device monitoring table, the event timestamp and device ID can be used as the sort keys.
  • If the sort key is of text type and can accept sorting in byte order, then using COLLATE C in this column can speed up sorting.

2.1.2 Run and meta information

  • According to the sort key, the data stored in MARS2 is ordered, and a continuous and ordered piece of data is called Run.
  • In order to be able to find where Run is stored, the storage location of Run is recorded, called Run's meta information.
  • At the same time, the meta information also records the minimum maximum value of Run, so that it can be filtered during query.
  • When inserting data, the data is sorted in memory first, so the size of Run is subject to the sorted memory size.

2.1.3 Merge

  • If the data ranges overlap in Run, it will cause read amplification and reduce query efficiency. Therefore, when the number of runs on disk exceeds a certain value, MARS2 will load multiple runs on disk into memory, sort, and finally output as a run. This process is called merge.
  • During the merge process, the data is still readable and writable:
    • When reading data, only the merged input files will be read
    • When writing data, the merge process will not read newly written data
    • Read, write, and merge will not block each other

2.1.4 Level

  • In order to make the merged input files similar in size (avoiding merge of oversized files with small files), Run is organized into Level, with 3 layers: L0, L1, L2**.
  • When the data merge exceeds a certain size, it will be upgraded to a higher level. The process is as follows:
      1. The newly inserted data is in L0. When the number of runs reaches a certain number (configurable, see the configuration parameter mars2_automerge_threshold), triggers L0 merge and merges all runs of L0 into one Run.
      1. If the result Run size exceeds 25MB (configurable, see the configuration parameter level0_upgrade_size), upgrade it to L1.
      1. If the sum of the Run sizes of L1 exceeds 1000MB (configurable, see level1_upgrade_size) after upgrading to L1, a merge is triggered and all runs of L1 are merged into one Run.
      1. If the result Run size exceeds 1000MB (configurable, see level1_upgrade_size), upgrade it to L2.
      1. Run after reaching L2 will no longer be merged.

2.1.5 Column storage- MARS2 uses ** column storage method to store data. When accessing disk, only the columns used need to be accessed, which reduces I/O.

  • The same column data type is the same, making it easier to compress and save disk space.
  • In addition, columnar storage is suitable for vectorized execution, which can greatly accelerate query execution speed. For details, see [Vectorized Execution Engine] (/doc/5.1/reference/mxvector/overview).

2.1.6 MINMAX Filtering

  • The meta information of Run mentioned earlier stores the minimum maximum value for filtering during query.
  • Because it is filtered through meta information, the data itself does not need to be loaded, and its I/O overhead is lower than sequential scanning (visiting the data itself first, and then filtering) and can be filtered faster.
  • When using MARS2 to create tables, the minimum maximum value will not be recorded by default, and it is necessary to explicitly declare .

First create the matrixts extension:

=# CREATE EXTENSION matrixts ;

An explicit declaration of record minimum maximum value.

=# CREATE TABLE metrics (
    ts              timestamp   ENCODING(minmax),
    dev_id          bigint      ENCODING(minmax),
    power           float,
    speed           float,
    message         text
) USING MARS2;

Creates the mars2_btree index.

=# CREATE INDEX ON metrics
USING mars2_btree (ts, dev_id);
  • Both sort keys and non-sorting keys can define MINMAX. The effect is more significant when defining the sort key. In the expression, MINMAX can be in all uppercase or all lowercase.
  • In addition, MINMAX can be defined for up to 32 columns.

2.1.7 MARS2 Btree Index

  • In addition to MINMAX, MARS2 also has a built-in index (that is, the index created when creating a table).
  • Currently, only one index can be defined, that is, only one global sort.
  • Unlike the normal Btree index, MARS2 Btree index:
    • is sparse, so the size is smaller
    • Since the data itself is ordered, there is no random I/O generated by index scans

2.1.8 Compression

Same as MARS3.

  • By default, all data columns are compressed using lz4.
  • Support manual specification of [encoding chain compression algorithm] (/doc/5.1/reference/storage/compression), which can be specified in the entire table or a single column.

2.1.9 Data writing

  • Data is written into memory through INSERT, and then goes through these processes to generate a Run.
  • If the amount of insertion data is large and exceeds the sorted memory, multiple runs will be generated.
  • The size of the sorting memory can be configured, see "2.2.2 Configuration Items" for details.

2.1.10 Update and Delete

  • MARS2 does not support updates and deletion for the time being. -Update operation can be replaced by Unique Mode.
  • The deletion operation can only be done by deleting or TRUNCATE partition table.

2.2 MARS2 usage

2.2.1 Creating a MARS2 table

On the premise that the matrixts extension has been created, the simplest way to create tables is to add the USING clause to the CREATE TABLE statement and create an index. For extended examples, see [Table Design Best Practices] (/doc/5.1/reference/storage/table_design).

=# CREATE TABLE metrics (
    ts              timestamp,
    dev_id          bigint,
    power           float,
    speed           float,
    message         text
) USING MARS2;

=# CREATE INDEX ON metrics
USING mars2_btree (ts, dev_id);

2.2.2 Configuration Items

Notes!
The following contents table-level configuration item refer to configuration items that can only be configured using the WITH clause when creating data tables. They are suitable for single tables and cannot be modified once configured. Global configuration item refer to configuration items that can be configured at the session or system level, and system-level modifications require execution of mxstop -u to take effect. For more information, see Data Table Configuration Parameters.

The following parameters are used to control merges (Merge), and the way they affect merges is shown in the Level section above.

Merge Control Parameters Units Default Value Value Range Description
mars2_automerge_threshold run 32 10 - 2048 Used to control all *** MARS2 tables, how many Runs can trigger merges in L0, which is a global configuration item. If you want to specify individually for a form you can use the table option level0_merge_threshold
level0_merge_threshold run 32 1 - 2048 Used to control how many Runs to trigger merges to configure items for table-level configuration
level0_upgrade_size MB 25 1 - 10000 Control Single table L0 -> L1 upgraded size. When L0 merges the result Run exceeds this size, it will be upgraded to L1, and the table-level configuration item is configured.
level1_upgrade_size MB 1000 1 - 10000 ControlSingle tableL1 ->L2 upgrade size. When L1 merges the result Run exceeds this size, it will be upgraded to L2, and the table-level configuration item is configured.
  • If less data is inserted each time (one run is corresponding to each insert), Merge will be triggered soon, which is not very efficient. At this time, you can increase mars2_automerge_threshold / level0_merge_threshold to increase the number of runs input in Merge and reduce the frequency of Merge.
  • When the Merge output of L0/L1 does not have the size upgraded to L1, the next round of Merge will merge it with the new Run, resulting in write amplification. To avoid this, you can lower level0_upgrade_size / level1_upgrade_size to upgrade it to the next level.

The following parameters are used for compression control and adjust the compression effect. If the compression effect is not obvious if the configuration is too low, the configuration is too high and consumes more memory.

Compression Control Parameters Units Default Value Value Range Description
compress_threshold Tuple 1200 1 - 100000 Compression threshold. Used to control how many tuples (tuples) are compressed in a single table. It is the upper limit of the number of tuples compressed in the same unit and is a table-level configuration item.

The following parameters are used for memory control. This part of the parameters controls the size of the sorting memory used when inserting data. When multiple partition tables are inserted, each partition table will allocate the memory configured in mars2_sort_mem_core; if it is not enough, it will be expanded, but the total amount will not exceed mars2_sort_mem.

Sorting memory parameters Units Default value Value range Description
mars2_sort_mem KB 2097152KB (2GB) 128KB - 2147483647KB (~ 2048GB) Controls the sorted memory size of each ** individual insert. If the insert target table is a partition table, they will share this size and configure the global item
mars2_sort_mem_core KB 16384KB (16MB) 128KB - 2147483647KB (~ 2048GB) Control at least how much sorted memory is allocated for each single partition table, and configure the global configuration item

Example table configuration items:

=# CREATE TABLE metrics (
    ts              timestamp,
    dev_id          bigint,
    power           float,
    speed           float,
    message         text
) 
USING MARS2
WITH (compress_threshold=1200,level0_merge_threshold=32);

=# CREATE INDEX ON metrics
USING mars2_btree (ts, dev_id);

Example of global configuration items modifying values ​​​​​at the session level:

=# SET mars2_sort_mem TO 2097152;

Example of global configuration items modifying values ​​​​​at the system level:

=# gpconfig -c mars2_sort_mem -v 2097152
=# \q
$ mxstop -u


3 HEAP Overview

HEAP is YMatrix's default storage engine, also known as heap storage, inherited from PostgreSQL and only supports row storage, not column storage and compression. It is implemented based on the MVCC mechanism and is suitable for scenarios with a large number of updates and deletion requirements.

3.1 Using the MVCC mechanism

Under the influence of the MVCC mechanism, the HEAP table does not really delete the data when handling updates and deletion operations, but only relies on the data version information to block the old data (controls the visibility of the data). Therefore, if the HEAP table is updated or deleted in large quantities, the physical space it consumes will continue to increase, and you need to use VACUUM regularly to clean up old data in a planned and regular manner.

3.2 HEAP usage

You can use the following SQL statement to create a HEAP table in YMatrix.

=# CREATE TABLE disk_heap(
    time timestamp with time zone,
    tag_id int,
    read float,
    write float
)
DISTRIBUTED BY (tag_id);


4 AO Overview

Tables with AOCO and AORO as storage engines are collectively called AO (Append-Optimized) tables, also known as append optimization tables, which can support insertion, update and delete operations and support compression. AORO supports row storage, and AOCO supports column storage.

AO tables are very different from HEAP tables in terms of the logical structure and physical structure of the table. As mentioned in the previous section, the HEAP table uses the MVCC mechanism to control the visibility of data after update and deletion operations, while the AO table uses an additional bitmap table to implement it. The content of this table is to indicate which data in the AO table is visible.

For AO tables with a large number of updates and deletion operations, old data also needs to be cleaned regularly in schedule. However, in AO tables, the data cleaning tool vacuum requires resetting bitmap and compressing physical files, so it is usually slower than HEAP.

Notes!
For storage engine details, usage and best practices, please see [Table Design Best Practices] (/doc/5.1/reference/storage/table_design).