Storage Engine Principles Overview


1 MARS3 Overview

The MARS3 storage engine builds upon MARS2 with multiple optimizations. It adopts a hybrid row-column storage format, enabling both high-speed data ingestion and extreme compression.

MARS3 supports data updates and deletes via the UPDATE and DELETE clauses (except in Unique Mode).

MARS3 supports adding and dropping columns, as well as COPY and pg_dump operations.

1.1 Internal Principles

Each MARS3 table internally uses an LSM Tree structure for storage. LSM Tree (Log-Structured Merge Tree) is a layered, ordered, disk-oriented data structure. Its core idea is to leverage disk performance through batched sequential writes, which significantly outperforms random writes.

Below is a diagram illustrating the internal structure of MARS3:

We will analyze the above diagram layer by layer.

1.1.1 Sort Key

  • In MARS3, data is stored in sorted order. When creating a table, you must specify one or more sort columns to define the sort order. These columns are known as the sort key.
  • The sort key can only be defined once. It cannot be modified, added, or removed afterward.
  • To maximize performance benefits from data ordering, choose frequently queried columns with good filtering selectivity as the sort key. For example, in a device monitoring table, use event timestamp and device ID as the sort key.
  • If the sort key is of text type and byte-order sorting is acceptable, using COLLATE C on this column can accelerate sorting.
  • SQL keyword for specifying the sort key: ORDER BY.

1.1.2 Rowstore and Columnstore

  • MARS3 supports a row-then-column storage mode, or direct columnar storage. The loading mode can be set via the prefer_load_mode parameter. See Configuration below for details.
  • In hybrid mode, incoming data is first stored in row format. After accumulating sufficient volume, it is converted into columnar format.
  • Compared to direct columnar loading, this approach offers several advantages:
    • Faster write speed for high-frequency, small-batch inserts
    • Lower memory usage for buffering
    • Uniform tuple count per data block

1.1.3 Run, Delta Files, and Metadata

  • Based on the sort key, data in MARS3 is stored in sorted order. A contiguous segment of sorted data is called a Run.
  • Runs come in two types: for fast ingestion, inserted data is stored as row-format Runs; later, for efficient reads and compression, row Runs are converted into columnar Runs.
  • Each Run has its own set of delta files. Besides the main data (Data file), there are Toast files for large values, Delta files for delete records, Index files for indexing, and Link files for merge tracking (delta file types differ slightly between row and column Runs).
  • Additionally, metadata is maintained to track file locations. This metadata includes file paths, sizes, number of tuples, and compression status.
  • Run size (configured via the rowstore_size parameter) can be tuned to optimize performance for different workloads.

1.1.4 Merge and Garbage Collection

  • Overlapping data ranges among Runs cause read amplification and degrade query performance. When the number of Runs on disk exceeds a threshold, MARS3 merges multiple Runs into one via external sorting. This process is called merge.

  • During merging, data remains readable and writable:

    • Reads access only input files of the merge
    • Writes do not interfere with ongoing merges
    • Read, write, and merge operations do not block each other
  • After a merge completes, the original Runs are marked as obsolete based on transaction ID and enter a reclaimable state.

1.1.5 Level

  • To ensure input files for merging are of similar size (avoiding merging large and small files), Runs are organized into Levels, up to 10 levels: L0, L1, L2, ..., L9.
  • A merge is triggered when the number of Runs in a level reaches a threshold, or when the total size of Runs in a level exceeds a limit. The merged Run is promoted to the next higher level.
  • To accelerate Run promotion, multiple concurrent merge tasks are allowed within the same level.

1.1.6 MARS3 BRIN Index

  • MARS3 supports creating, dropping, and adding BRIN indexes.
  • Each Run generates its own independent BRIN index file upon creation.
  • Example: CREATE INDEX brin_idx ON t1 USING mars3_brin(time,tag_id);

1.1.7 Compression

  • By default, all columns in MARS3 are compressed using lz4.
  • Manual specification of compression encoding chains is supported, either at the table level or per column.

1.1.8 MVCC Support

  • MVCC (Multiversion Concurrency Control), also known as multi-version management, primarily handles data updates, modifications, and deletions.
  • In multi-version management, updates and deletes do not modify data in place. Instead, a new version is created, the old data is marked as obsolete, and new data is written to the new version. Data exists in multiple versions, each with version metadata, and historical versions are preserved.
  • MARS3 does not perform in-place updates or deletes. Instead, it uses Delta files and version visibility to mask old data.
  • Note: Continuously updating or deleting data within the same Run increases the physical size of its Delta file. However, growth stops once all data in the Run is deleted. MARS3 merge operations automatically remove Dead data. You may also periodically run VACUUM to clean up Dead data.

1.1.9 Data Ingestion

  • Data is written into memory via INSERT, then flushed into L0 Runs.
  • The size of L0 Runs is configurable. See Configuration below.

1.1.10 Update and Delete

  • MARS3 uses DELETE for deletion. Deletions are recorded in the Delta file of the corresponding Run and physically removed during Run merging.
  • MARS3 uses UPDATE for updates. An update first deletes the old record, then inserts a new one.
  • In MARS3 Unique Mode, deletion is not supported. Updates do not require explicit UPDATE clauses; simply using INSERT automatically performs the operation. To update a record with a specific Unique Key (i.e., the value of the sort key defined at table creation), insert a new record with the same Unique Key. For example: CREATE TABLE mars3_t(c1 int NOT NULL, c2 int) USING MARS3 WITH (uniquemode=true) ORDER BY (c1, c2);, where the Unique Key is (c1, c2).

Note!
When Unique Mode is enabled, the first field in the ORDER BY clause must include the NOT NULL constraint during definition.

1.2 Using MARS3

1.2.1 Creating a MARS3 Table

After creating the matrixts extension, the simplest way to create a MARS3 table is to add the CREATE TABLE clause with USING and ORDER BY. For extended examples, see Table Design Best Practices.

=# CREATE TABLE metrics (
    ts              timestamp,
    dev_id          bigint,
    power           float,
    speed           float,
    message         text
) USING MARS3 
  ORDER BY (dev_id,ts);

Note!
Creating a BRIN index on MARS3 tables is supported but not required.
Starting from version 6.3.0, the requirement to use the ORDER BY clause when creating MARS3 tables has been removed.


1.2.2 Configuration Parameters

Note!
These parameters are table-level settings. They can only be specified at table creation using the WITH(mars3options='a=1,b=2,...') clause (except for compress_threshold, which is shared with MARS2 and can be set directly via WITH). Once set, they cannot be changed. These settings apply to a single table. For more information, see Database Table Parameters.

The following parameter controls the size of Runs in L0, indirectly affecting Run sizes in higher levels.

Parameter Unit Default Range Description
rowstore_size MB 64 8 – 1024 Controls when an L0 Run switches. A new Run is created when data exceeds this size.

The following parameter sets the compression threshold, balancing compression efficiency and read performance. Too low a value reduces compression; too high increases memory usage.

Parameter Unit Default Range Description
compress_threshold Tuples 1200 1 – 100000 Compression threshold. Maximum number of tuples per column to compress in one batch.

The following parameter sets the data loading mode in MARS3.

Parameter Unit Default Range Description
prefer_load_mode normal normal / bulk Data loading mode. normal means normal mode: new data is first written to row-format Runs in L0, then moved to columnar Runs in L1 after reaching rowstore_size. This involves one extra I/O compared to bulk, and column conversion is asynchronous. Suitable for high-frequency, small-batch writes where I/O capacity is sufficient and latency is critical. bluk means bulk load mode: used for low-frequency, large-batch inserts. Data is written directly to columnar Runs in L1, reducing one I/O compared to normal. Column conversion is synchronous. Suitable for scenarios with limited I/O capacity and low latency sensitivity.

The following parameter controls the level size amplification factor.

Parameter Unit Default Range Description
level_size_amplifier 8 1 – 1000 Level size amplification factor. Threshold for triggering level merges: rowstore_size * (level_size_amplifier ^ level). A higher value slows reads but speeds up writes. Choose based on workload (write-heavy vs. read-heavy, compression ratio, etc.). Note: Avoid excessive Run counts per level, as this may degrade query performance or block new inserts.

Configuration example:

=# CREATE TABLE metrics (
    ts              timestamp,
    dev_id          bigint,
    power           float,
    speed           float
) USING MARS3
WITH (compress_threshold=1200,mars3options='rowstore_size=64',compresstype=zstd, compresslevel=1)
DISTRIBUTED BY (dev_id)
ORDER BY (dev_id,ts)
PARTITION BY RANGE (ts)
( START ('2023-07-01 00:00:00') INCLUSIVE
  END ('2023-08-01 00:00:00') EXCLUSIVE
  EVERY (INTERVAL '1 day')
,DEFAULT PARTITION OTHERS);

1.2.3 Utility Functions

  • matrixts_internal.mars3_level_stats: View the status of each Level in a MARS3 table. Useful for assessing table health, such as whether Runs are merging as expected and if Run counts are within normal range.
  • matrixts_internal.mars3_files: View file status of a MARS3 table, including auxiliary and delta files (Data, Delta, Index files, etc.), to verify expected file states.
  • matrixts_internal.mars3_info_brin: View the status of a specific BRIN index on a MARS3 table.


2 MARS2 Overview

The MARS2 storage engine is designed for high-speed data loading and querying, using ordered storage to eliminate search overhead.

2.1 Internal Principles

Like MARS3, MARS2 uses an LSM Tree structure for storage.

Below is a diagram of MARS2's internal structure:

We will analyze this diagram concept by concept.

2.1.1 Sort Key

  • In MARS2, data is stored in sorted order. At table creation, the sort order is defined by creating an index. The columns involved in this sort order are called the sort key.
  • The sort key can only be defined once and cannot be modified or removed.
  • To maximize performance from data ordering, choose frequently used, high-selectivity columns as the sort key. For example, in a device monitoring table, use event timestamp and device ID.
  • If the sort key is text and byte-order sorting is acceptable, using COLLATE C on this column can speed up sorting.

2.1.2 Run and Metadata

  • Based on the sort key, data in MARS2 is stored in sorted order. A contiguous segment of sorted data is called a Run.
  • To locate Runs, metadata records their storage paths.
  • The metadata also stores the minimum and maximum values of the Run, enabling efficient query filtering.
  • During insertion, data is sorted in memory, so Run size is limited by available sort memory.

2.1.3 Merge

  • Overlapping data ranges among Runs cause read amplification and reduce query efficiency. When the number of Runs on disk exceeds a threshold, MARS2 loads multiple Runs into memory, sorts them, and outputs a single Run. This is called merge.
  • During merging, data remains readable and writable:
    • Reads access only the input files of the merge
    • Writes do not affect the merge process
    • Read, write, and merge operations do not block each other

2.1.4 Level

  • To ensure merge inputs are of similar size (avoiding large-small file merges), Runs are organized into Levels: three levels: L0, L1, L2.
  • When a merged Run exceeds a size threshold, it is promoted to a higher level:
    1. New inserts go to L0. When the number of Runs reaches a threshold (configurable via mars2_automerge_threshold), L0 merge is triggered, combining all L0 Runs into one.
    2. If the resulting Run exceeds 25MB (configurable via level0_upgrade_size), it is promoted to L1.
    3. If total L1 Run size exceeds 1000MB (configurable via level1_upgrade_size), L1 merge is triggered, combining all L1 Runs into one.
    4. If the resulting Run exceeds 1000MB (configurable via level1_upgrade_size), it is promoted to L2.
    5. Runs in L2 are not merged further.

2.1.5 Columnar Storage

  • MARS2 uses columnar storage, reading only required columns from disk, reducing I/O.
  • Uniform data types within a column enable better compression, saving disk space.
  • Columnar storage also supports vectorized execution, significantly accelerating query performance. See Vectorized Execution Engine.

2.1.6 MINMAX Filtering

  • As mentioned, Run metadata stores min/max values for query filtering.
  • Since filtering uses metadata without loading actual data, I/O cost is lower than sequential scans (which load data first, then filter), enabling faster filtering.
  • MINMAX is not recorded by default when creating a MARS2 table; it must be explicitly declared.

First, create the matrixts extension:

=# CREATE EXTENSION matrixts ;

Then explicitly declare min/max recording:

=# CREATE TABLE metrics (
    ts              timestamp   ENCODING(minmax),
    dev_id          bigint      ENCODING(minmax),
    power           float,
    speed           float,
    message         text
) USING MARS2;

Create the mars2_btree index:

=# CREATE INDEX ON metrics
USING mars2_btree (ts, dev_id);
  • MINMAX can be defined on both sort and non-sort keys. It is most effective on sort keys. MINMAX can be written in all uppercase or lowercase.
  • Up to 32 columns can have MINMAX defined.

2.1.7 MARS2 Btree Index

  • Besides MINMAX, MARS2 includes a built-in index (created at table creation).
  • Only one index can be defined, meaning only one global sort order is allowed.
  • Unlike standard Btree indexes, the MARS2 Btree index:
    • Is sparse, thus smaller in size
    • Avoids random I/O during index scans because the underlying data is already sorted

2.1.8 Compression

Same as MARS3.

  • By default, all columns are compressed using lz4.
  • Manual specification of compression encoding chains is supported, either at the table or column level.

2.1.9 Data Ingestion

  • Data is written into memory via INSERT, then processed into a Run.
  • If the insert size exceeds sort memory, multiple Runs are generated.
  • Sort memory size is configurable. See "2.2.2 Configuration Parameters" below.

2.1.10 Update and Delete

  • MARS2 does not currently support update or delete operations.
  • Updates can be emulated using Unique Mode.
  • Deletion can only be achieved by dropping or TRUNCATE partitioned tables.

2.2 Using MARS2

2.2.1 Creating a MARS2 Table

After creating the matrixts extension, the simplest way to create a MARS2 table is to use the USING clause in CREATE TABLE and create an index. For extended examples, see Table Design Best Practices.

=# CREATE TABLE metrics (
    ts              timestamp,
    dev_id          bigint,
    power           float,
    speed           float,
    message         text
) USING MARS2;

=# CREATE INDEX ON metrics
USING mars2_btree (ts, dev_id);

2.2.2 Configuration Parameters

Note!
Table-level parameters can only be set at table creation using the WITH clause, apply to a single table, and cannot be changed afterward. Global parameters can be set at session or system level. System-level changes require mxstop -u to take effect. For more information, see Database Table Parameters.

The following parameters control merging behavior, as described in the Level section above.

Merge Control Parameter Unit Default Range Description
mars2_automerge_threshold run 32 10 – 2048 Controls merge trigger for all MARS2 tables: number of L0 Runs to trigger merge. Global setting. Per-table override: level0_merge_threshold
level0_merge_threshold run 32 1 – 2048 Controls per-table L0 merge threshold. Table-level setting
level0_upgrade_size MB 25 1 – 10000 Controls per-table L0 → L1 upgrade size. If merged Run exceeds this, promote to L1. Table-level setting
level1_upgrade_size MB 1000 1 – 10000 Controls per-table L1 → L2 upgrade size. If merged Run exceeds this, promote to L2. Table-level setting
  • If small inserts (each creating one Run) occur frequently, merges trigger too often and are inefficient. Increase mars2_automerge_threshold / level0_merge_threshold to reduce merge frequency.
  • If merged Runs from L0/L1 are too small to promote, the next merge will reprocess them, causing write amplification. To avoid this, reduce level0_upgrade_size / level1_upgrade_size.

The following parameter controls compression behavior. Too low a value reduces compression; too high increases memory usage.

Compression Control Parameter Unit Default Range Description
compress_threshold tuples 1200 1 – 100000 Compression threshold. Number of tuples per table to compress in one batch. Maximum tuple count per compression unit. Table-level setting.

The following parameters control sort memory size during inserts. When inserting into multiple partitioned tables, each gets mars2_sort_mem_core; total usage does not exceed mars2_sort_mem.

Sort Memory Parameter Unit Default Range Description
mars2_sort_mem KB 2097152KB (2GB) 128KB – 2147483647KB (~2048GB) Sort memory per insert. For partitioned tables, shared among partitions. Global setting.
mars2_sort_mem_core KB 16384KB (16MB) 128KB – 2147483647KB (~2048GB) Minimum sort memory per partitioned table. Global setting.

Table-level configuration example:

=# CREATE TABLE metrics (
    ts              timestamp,
    dev_id          bigint,
    power           float,
    speed           float,
    message         text
) 
USING MARS2
WITH (compress_threshold=1200,level0_merge_threshold=32);

=# CREATE INDEX ON metrics
USING mars2_btree (ts, dev_id);

Session-level global parameter example:

=# SET mars2_sort_mem TO 2097152;

System-level global parameter example:

=# gpconfig -c mars2_sort_mem -v 2097152
=# \q
$ mxstop -u


3 HEAP Overview

HEAP is YMatrix's default storage engine, inherited from PostgreSQL. It supports only row storage, not column storage or compression. Based on MVCC, it is suitable for workloads with frequent updates and deletes.

3.1 MVCC Mechanism

Under MVCC, HEAP tables do not physically delete data during updates or deletes. Instead, old data is masked using version visibility metadata. As a result, frequent updates and deletes cause physical storage size to grow. You should periodically run VACUUM to reclaim space.

3.2 Using HEAP

Use the following SQL to create a HEAP table in YMatrix:

=# CREATE TABLE disk_heap(
    time timestamp with time zone,
    tag_id int,
    read float,
    write float
)
DISTRIBUTED BY (tag_id);


4 AO Overview

Tables using AOCO or AORO as storage engines are collectively called AO (Append-Optimized) tables. Also known as append-optimized tables, they support insert, update, and delete operations and support compression.
AORO supports row storage; AOCO supports column storage.

AO tables differ significantly from HEAP tables in both logical and physical structure. As described earlier, HEAP tables use MVCC to manage data visibility after updates and deletes. In contrast, AO tables use an auxiliary bitmap table to track visible rows.

For AO tables with frequent updates and deletes, regular cleanup is also required. However, the cleanup tool vacuum must reset the bitmap and compress physical files, which is typically slower than HEAP cleanup.

Note!
For detailed information on storage engines, usage, and best practices, see Table Design Best Practices.