Storage Engine Overview


1 MARS3 Overview

The MARS3 storage engine builds upon MARS2 with multiple optimizations. It adopts a hybrid row-column storage format, enabling both high-speed data ingestion and aggressive compression.

MARS3 supports data updates and deletes via the UPDATE and DELETE clauses (except in Unique Mode).

MARS3 supports adding and dropping columns, as well as COPY and pg_dump operations.

1.1 Internal Architecture

Each MARS3 table internally uses an LSM Tree structure for storage. LSM Tree (Log-Structured Merge Tree) is a layered, ordered, disk-oriented data structure. Its core idea is to leverage disk performance through batched sequential writes, which significantly outperforms random writes.

The internal architecture of MARS3 is illustrated below:

We will analyze the above diagram layer by layer.

1.1.1 Sort Key

  • In MARS3, data is stored in sorted order. When creating a table, you must specify one or more sort columns to define the sort order. These columns are referred to as the sort key.
  • The sort key can only be defined once. It cannot be modified, added, or removed afterward.
  • To maximize performance benefits from data ordering, choose frequently queried columns with good filtering selectivity as the sort key. For example, in a device monitoring table, use event timestamp and device ID as the sort key.
  • If the sort key is of text type and byte-order sorting is acceptable, using COLLATE C on this column can accelerate sorting.
  • SQL keyword for specifying the sort key: ORDER BY.

1.1.2 Rowstore and Columnstore

  • MARS3 supports storing data in row-first, then column format, or directly in columnstore format. The loading mode can be set via the prefer_load_mode parameter. See Configuration Parameters below.
  • In hybrid row-column mode, incoming data is first written in rowstore format. After sufficient accumulation, it is converted into columnstore format.
  • Compared to direct columnstore loading, this approach offers several advantages:
    • Faster write performance for high-frequency, small-batch inserts
    • Lower memory usage for buffering
    • Ensures uniform tuple count per data block

1.1.3 Run, Delta Files, and Metadata

  • Based on the sort key, data in MARS3 is stored in sorted order. A contiguous segment of sorted data is called a Run.
  • Runs come in two types: for fast ingestion, inserted data is initially stored as a rowstore Run; later, for efficient reads and compression, it is converted into a columnstore Run.
  • Each Run has its own set of delta files. Besides the main data (Data file), there are Toast files for large values, Delta files for delete records, Index files for indexing, and Link files for merge tracking (delta file types differ slightly between rowstore and columnstore Runs).
  • Additionally, metadata is maintained to track file locations. This includes file paths, sizes, number of tuples, and compression status.
  • The size of a Run (configured via the rowstore_size parameter) can be tuned to optimize performance for different workloads.

1.1.4 Merge and Garbage Collection

  • Overlapping data ranges across Runs cause read amplification and degrade query performance. When the number of Runs on disk exceeds a threshold, MARS3 performs a merge: multiple Runs are sorted and merged into a single Run. This process is called merge.
  • During merging, data remains readable and writable:
    • Reads access only the input files of the merge
    • Writes do not interfere with the merge process
    • Read, write, and merge operations do not block each other
  • After a merge completes, the original Runs are marked as obsolete based on transaction IDs and marked for garbage collection.

1.1.5 Level

  • To ensure input files for merging are of similar size (avoiding merging large and small files), Runs are organized into Levels, up to 10 levels: L0, L1, L2, ..., L9.
  • When the number of Runs in a level reaches a threshold, or their total size exceeds a limit, they are merged into a single Run and promoted to the next higher level.
  • Multiple concurrent merge tasks are allowed within the same level to accelerate Run promotion.

1.1.6 MARS3 BRIN Index

  • MARS3 supports creating, dropping, and adding BRIN indexes.
  • Each Run generates its own independent BRIN index file upon creation.
  • Example: CREATE INDEX brin_idx ON t1 USING mars3_brin(time,tag_id);

1.1.7 Compression

  • By default, all columns in MARS3 are compressed using lz4.
  • Custom compression encoding chains can be specified, either at the table level or per column.

1.1.8 MVCC Support

  • MVCC (Multiversion Concurrency Control), also known as multi-version management, primarily handles data updates, modifications, and deletions.
  • In multi-version management, updates and deletes do not modify data in place. Instead, a new version is created, the old data is marked as obsolete, and new data is added to the new version. Data exists in multiple versions, each with version metadata, and historical versions are preserved.
  • MARS3 does not perform in-place updates or deletes. Instead, it uses Delta files and version visibility to mask old data.
  • Note: Continuously updating or deleting data within the same Run increases the physical size of its Delta file. This growth stops once all data in the Run is deleted. MARS3 automatically removes Dead data during merges. You can also periodically run VACUUM to manually clean up Dead data.

1.1.9 Data Ingestion

  • Data is first written into memory via INSERT, then flushed into a Run in L0.
  • The size of Runs in L0 is configurable. See Configuration Parameters below.

1.1.10 Update and Delete

  • MARS3 uses DELETE for deletion. Deletions are recorded in the Delta file of the corresponding Run and physically removed during merging.
  • MARS3 uses UPDATE for updates. An update first deletes the old row and then inserts a new one.
  • In Unique Mode, deletion is not supported. Updates do not require explicit UPDATE clauses; simply using INSERT automatically performs the operation. To update a row with a specific Unique Key (i.e., the value of the sort key defined at table creation), insert a new row with the same Unique Key. For example: CREATE TABLE mars3_t(c1 int NOT NULL, c2 int) USING MARS3 WITH (uniquemode=true) ORDER BY (c1, c2);, where the Unique Key is (c1, c2).

Note!
When Unique Mode is enabled, the first field in the ORDER BY clause must have the NOT NULL constraint.

1.2 Using MARS3

1.2.1 Creating a MARS3 Table

After creating the matrixts extension, the simplest way to create a MARS3 table is to add the CREATE TABLE clause with USING and ORDER BY. For extended examples, see Table Design Best Practices.

=# CREATE TABLE metrics (
    ts              timestamp,
    dev_id          bigint,
    power           float,
    speed           float,
    message         text
) USING MARS3 
  ORDER BY (dev_id,ts);

Note!
MARS3 tables support BRIN indexes, but they are not required. However, the ORDER BY clause must be used when creating a MARS3 table to define the sort key.


1.2.2 Configuration Parameters

Note!
These are table-level parameters. They can only be set at table creation time using the WITH(mars3options='a=1,b=2,...') clause (except for compress_threshold, which is also supported by MARS2 and can be set directly via WITH). Once set, they cannot be changed. These parameters apply to a single table. For more information, see Database Table Parameters.

The following parameter controls the size of Runs in L0, indirectly affecting Run sizes in higher levels.

Parameter Unit Default Range Description
rowstore_size MB 64 8 ~ 1024 Controls when an L0 Run switches. A new Run is created when data size exceeds this value.

The following parameter sets the compression threshold, balancing compression efficiency and read performance. Too low a value reduces compression; too high increases memory usage.

Parameter Unit Default Range Description
compress_threshold Tuples 1200 1 ~ 100000 Compression threshold. Maximum number of tuples per column to compress in one batch.

The following parameter defines the data loading mode in MARS3.

Parameter Unit Default Range Description
prefer_load_mode normal normal / bulk Data loading mode. normal means normal mode: new data is first written to rowstore Runs in L0, then moved to columnstore Runs in L1 after reaching rowstore_size. This involves one extra I/O compared to bulk, and column conversion is asynchronous. Suitable for high-frequency, small-batch writes where I/O capacity is sufficient and latency is critical. bluk means bulk load mode: data is written directly to columnstore Runs in L1. This reduces I/O compared to normal, and column conversion is synchronous. Suitable for low-frequency, large-batch writes where I/O capacity is limited and latency is less critical.

The following parameter sets the level size amplifier.

Parameter Unit Default Range Description
level_size_amplifier 8 1 ~ 1000 Level size amplifier. Threshold for triggering level merges: rowstore_size * (level_size_amplifier ^ level). A higher value slows reads but speeds up writes. Choose based on workload (write-heavy vs. read-heavy, compression ratio, etc.). Ensure the number of Runs per level does not grow too large, as this can degrade query performance or block new inserts.

Configuration example:

=# CREATE TABLE metrics (
    ts              timestamp,
    dev_id          bigint,
    power           float,
    speed           float
) USING MARS3
WITH (compress_threshold=1200,mars3options='rowstore_size=64',compresstype=zstd, compresslevel=1)
DISTRIBUTED BY (dev_id)
ORDER BY (dev_id,ts)
PARTITION BY RANGE (ts)
( START ('2023-07-01 00:00:00') INCLUSIVE
  END ('2023-08-01 00:00:00') EXCLUSIVE
  EVERY (INTERVAL '1 day')
,DEFAULT PARTITION OTHERS);

1.2.3 Utility Functions

  • matrixts_internal.mars3_level_stats: View the status of each Level in a MARS3 table. Useful for assessing table health, such as whether Runs are merging as expected and if Run counts are within normal range.
  • matrixts_internal.mars3_files: View file status of a MARS3 table, including auxiliary and delta files (Data, Delta, Index files, etc.), to verify expected file states.
  • matrixts_internal.mars3_info_brin: View the status of a specific BRIN index on a MARS3 table.


2 MARS2 Overview

The MARS2 storage engine is designed for high-speed data loading and querying, using ordered storage to eliminate the need for searching.

2.1 Internal Architecture

Like MARS3, MARS2 uses an LSM Tree structure for storage.

The internal architecture of MARS2 is shown below:

We will analyze this diagram layer by layer.

2.1.1 Sort Key

  • In MARS2, data is stored in sorted order. At table creation, the sort order is defined by creating an index. The columns involved in this sort order are called the sort key.
  • The sort key can only be defined once and cannot be modified or removed.
  • To maximize performance from data ordering, choose frequently used, high-selectivity columns as the sort key. For example, in a device monitoring table, use event timestamp and device ID.
  • If the sort key is text and byte-order sorting is acceptable, using COLLATE C on this column can speed up sorting.

2.1.2 Run and Metadata

  • Based on the sort key, data in MARS2 is stored in sorted order. A contiguous segment of sorted data is called a Run.
  • Metadata is recorded to track the location of each Run.
  • The metadata also stores the minimum and maximum values of the Run, enabling efficient filtering during queries.
  • During insertion, data is sorted in memory, so Run size is limited by available sort memory.

2.1.3 Merge

  • Overlapping data ranges across Runs cause read amplification and reduce query efficiency. When the number of Runs on disk exceeds a threshold, MARS2 loads multiple Runs into memory, sorts them, and outputs a single Run. This is called merge.
  • During merging, data remains readable and writable:
    • Reads access only the input files of the merge
    • Writes do not interfere with the merge
    • Read, write, and merge operations do not block each other

2.1.4 Level

  • To ensure merge inputs are of similar size (avoiding merging large and small files), Runs are organized into 3 levels: L0, L1, L2.
  • When a merged Run exceeds a size threshold, it is promoted to a higher level:
    1. New inserts go to L0. When the number of Runs reaches a threshold (configurable via mars2_automerge_threshold), L0 merge is triggered, combining all L0 Runs into one.
    2. If the resulting Run exceeds 25MB (configurable via level0_upgrade_size), it is promoted to L1.
    3. If total L1 Run size exceeds 1000MB (configurable via level1_upgrade_size), L1 merge is triggered, combining all L1 Runs into one.
    4. If the resulting Run exceeds 1000MB (configurable via level1_upgrade_size), it is promoted to L2.
    5. Runs in L2 are not merged further.

2.1.5 Columnstore

  • MARS2 uses columnstore format. During disk access, only required columns are read, reducing I/O.
  • Data in the same column has uniform type, enabling better compression and saving disk space.
  • Columnstore is also optimized for vectorized execution, significantly accelerating query performance. See Vectorized Execution Engine.

2.1.6 MINMAX Filtering

  • As mentioned, Run metadata stores min/max values for query filtering.
  • Since filtering uses metadata without loading actual data, I/O overhead is lower than sequential scans (which load data first, then filter), enabling faster filtering.
  • MINMAX is not enabled by default in MARS2. It must be explicitly declared.

First, create the matrixts extension:

=# CREATE EXTENSION matrixts ;

Then explicitly enable MINMAX:

=# CREATE TABLE metrics (
    ts              timestamp   ENCODING(minmax),
    dev_id          bigint      ENCODING(minmax),
    power           float,
    speed           float,
    message         text
) USING MARS2;

Create the mars2_btree index:

=# CREATE INDEX ON metrics
USING mars2_btree (ts, dev_id);
  • MINMAX can be defined on both sort and non-sort keys. It is most effective on sort keys. MINMAX can be written in all uppercase or lowercase.
  • Up to 32 columns can have MINMAX enabled.

2.1.7 MARS2 Btree Index

  • Besides MINMAX, MARS2 has a built-in index (created at table creation).
  • Only one index can be defined per table, meaning only one global sort order is allowed.
  • Unlike standard Btree indexes, the MARS2 Btree index:
    • Is sparse, so it is smaller in size
    • Avoids random I/O during index scans because the underlying data is already sorted

2.1.8 Compression

Same as MARS3.

2.1.9 Data Ingestion

  • Data is written into memory via INSERT, then processed into a Run.
  • If the inserted data exceeds sort memory, multiple Runs are created.
  • Sort memory size is configurable. See "2.2.2 Configuration Parameters".

2.1.10 Update and Delete

  • MARS2 does not currently support update or delete operations.
  • Updates can be emulated using Unique Mode.
  • Deletion can only be done by dropping tables or using TRUNCATE on partitioned tables.

2.2 Using MARS2

2.2.1 Creating a MARS2 Table

After creating the matrixts extension, the simplest way to create a MARS2 table is to use the USING clause in CREATE TABLE and create an index. For extended examples, see Table Design Best Practices.

=# CREATE TABLE metrics (
    ts              timestamp,
    dev_id          bigint,
    power           float,
    speed           float,
    message         text
) USING MARS2;

=# CREATE INDEX ON metrics
USING mars2_btree (ts, dev_id);

2.2.2 Configuration Parameters

Note!
Table-level parameters can only be set at table creation using the WITH clause, apply to a single table, and cannot be changed afterward. Global parameters can be set at session or system level. System-level changes require mxstop -u to take effect. For more information, see Database Table Parameters.

The following parameters control merging behavior (see Level section above).

Merge Control Parameter Unit Default Range Description
mars2_automerge_threshold run 32 10 - 2048 Controls all MARS2 tables: number of L0 Runs to trigger merge. Global parameter. Per-table override: level0_merge_threshold
level0_merge_threshold run 32 1 - 2048 Controls per-table L0 merge threshold. Table-level parameter
level0_upgrade_size MB 25 1 - 10000 Controls per-table L0 → L1 upgrade size. If merged Run exceeds this, promote to L1. Table-level parameter
level1_upgrade_size MB 1000 1 - 10000 Controls per-table L1 → L2 upgrade size. If merged Run exceeds this, promote to L2. Table-level parameter
  • If small inserts (each creating a Run) trigger frequent merges, which are inefficient, increase mars2_automerge_threshold / level0_merge_threshold to reduce merge frequency.
  • If merged Runs in L0/L1 do not reach upgrade size, they are re-merged later, causing write amplification. To avoid this, lower level0_upgrade_size / level1_upgrade_size to promote Runs earlier.

The following parameter controls compression behavior.

Compression Control Parameter Unit Default Range Description
compress_threshold Tuples 1200 1 - 100000 Compression threshold. Number of tuples per table to compress in one batch. Maximum tuples per compression unit. Table-level parameter

The following parameters control sort memory. When inserting into multiple partitioned tables, each gets mars2_sort_mem_core; total usage does not exceed mars2_sort_mem.

Sort Memory Parameter Unit Default Range Description
mars2_sort_mem KB 2097152KB (2GB) 128KB - 2147483647KB (~2048GB) Sort memory per insert. For partitioned tables, shared among partitions. Global parameter
mars2_sort_mem_core KB 16384KB (16MB) 128KB - 2147483647KB (~2048GB) Minimum sort memory per partitioned table. Global parameter

Table-level configuration example:

=# CREATE TABLE metrics (
    ts              timestamp,
    dev_id          bigint,
    power           float,
    speed           float,
    message         text
) 
USING MARS2
WITH (compress_threshold=1200,level0_merge_threshold=32);

=# CREATE INDEX ON metrics
USING mars2_btree (ts, dev_id);

Session-level global parameter example:

=# SET mars2_sort_mem TO 2097152;

System-level global parameter example:

=# gpconfig -c mars2_sort_mem -v 2097152
=# \q
$ mxstop -u


3 HEAP Overview

HEAP is the default storage engine in YMatrix, inherited from PostgreSQL. It supports only rowstore format and does not support columnstore or compression. It uses MVCC and is suitable for workloads with frequent updates and deletes.

3.1 MVCC Mechanism

Under MVCC, HEAP tables do not physically delete data during updates or deletes. Instead, old data is masked using version visibility. As a result, frequent updates and deletes increase physical storage usage. You should periodically run VACUUM to reclaim space.

3.2 Using HEAP

Use the following SQL to create a HEAP table in YMatrix:

=# CREATE TABLE disk_heap(
    time timestamp with time zone,
    tag_id int,
    read float,
    write float
)
DISTRIBUTED BY (tag_id);


4 AO Overview

Tables using AOCO or AORO storage engines are collectively called AO (Append-Optimized) tables. They support insert, update, and delete operations and support compression.
AORO supports rowstore; AOCO supports columnstore.

AO tables differ significantly from HEAP tables in both logical and physical structure. While HEAP uses MVCC to manage data visibility after updates and deletes, AO tables use an auxiliary bitmap table to track visible rows.

For AO tables with frequent updates and deletes, regular cleanup is required. The cleanup tool vacuum must reset the bitmap and compress physical files, which is typically slower than HEAP vacuuming.

Note!
For detailed information on storage engines, usage, and best practices, see Table Design Best Practices.