Using Compression

This document describes the compression algorithms available in YMatrix and how to use them.

General-Purpose Compression Algorithms

Concept

General-purpose compression refers to compression algorithms that operate without knowledge of the internal structure of the data. These algorithms directly compress data blocks by encoding binary patterns to reduce redundancy. The compressed data cannot be randomly accessed and must be decompressed as a whole block.

YMatrix supports three general-purpose compression algorithms for data blocks: zlib, lz4, and zstd.

Usage

The lz4, zstd, and zlib compression algorithms are specified during table creation using the WITH clause. Example:

=# WITH (compresstype=zstd, compresslevel=3, compress_threshold=1200)

Note!
For more information about the WITH clause, see CREATE TABLE.

Parameter descriptions:

Parameter Default Min Max Description
compress_threshold 1200 1 8000 Compression threshold. Controls how many tuples are compressed per block. It defines the maximum number of tuples in a compression unit.
compresstype none Compression algorithm. Supported values:
1. zstd
2. zlib
3. lz4
compresslevel 0 1 Compression level. A lower value means faster compression but lower ratio; higher values mean slower compression but better ratio. Valid ranges vary by algorithm:
zstd: 1–19
zlib: 1–9
lz4: 1–20

Note!
When compresslevel > 0 and compresstype is not specified, the default compresstype is zlib.
When compresstype is specified but compresslevel is not, the default compresslevel is 1.

Note!
Generally, higher zstd compression levels yield better ratios but slower performance. However, this is not always true.

Encoding Chain

In addition to general-purpose compression, we recommend trying YMatrix's proprietary customized compression algorithm — Encoding Chain (mxcustom).

Concept

Unlike general-purpose compression, the encoding chain leverages knowledge of the internal format and semantics of data. In relational databases, data is organized into tables where each column has a fixed data type, ensuring logical similarity among values in the same column. In many cases, adjacent rows also exhibit data similarity. By compressing and storing data column-wise, significantly better compression can be achieved.

The encoding chain provides the following capabilities:

  • Multiple encoding/compression algorithms: A series of encoding and compression methods are developed for different data types and patterns. Each algorithm has specific use cases, compression ratios, and performance characteristics. Fine-grained selection enables optimal compression.
  • Combined compression: Column-level encoding algorithms can be combined with general-purpose algorithms (e.g., zstd, lz4) for enhanced compression.
  • Column-level customization: Different compression strategies can be applied per column, enabling fine-tuned control. Multiple algorithms can be chained for multi-level compression on a single column.

Advantages

The encoding chain offers significant advantages for time-series data, which exhibits strong characteristics such as regular time intervals, column independence, and gradual value changes over time. General-purpose algorithms like lz4 and zstd operate on byte streams and fail to exploit these patterns, resulting in suboptimal compression.

The encoding chain fully leverages time-series characteristics for deep compression, delivering three key benefits:

  • Reduced storage costs: Smaller data size significantly lowers storage costs, enabling more data to be stored within the same hardware footprint.
  • Lower disk I/O overhead: Reduced data size decreases disk I/O, improving query performance—especially for I/O-intensive queries on cold data stored on HDDs.
  • Query acceleration: Custom algorithms are simpler and highly optimized, enabling faster decompression and improved query speed.

Limitations

  • The encoding chain is highly dependent on data characteristics and is not a universal solution. Careful algorithm selection is required.
  • It is only supported on MARS2 and MARS3 tables.

Note!
The easiest way to use the encoding chain is to enable Adaptive Encoding (AutoEncode) mode, which automatically detects data patterns and selects appropriate encoding methods at runtime. See below for details.

Usage

The main usage patterns of the encoding chain are listed below:

No. Usage
1 Column-level compression
2 Table-level compression (supports algorithm modification)
3 Table-level and column-level compression combined
4 Adaptive Encoding (AutoEncode)

Before using any of these methods, create the required extension:

=# CREATE EXTENSION matrixts;

Column-Level Compression

Custom compression can be specified for each column in table t1. The ENCODING clause defines the encoding chain (single or multiple algorithms, separated by commas). Example:

=# CREATE TABLE t1(
   f1 int8 ENCODING(encodechain='deltadelta(7), zstd', compresstype='mxcustom'),
   f2 int8 ENCODING(encodechain='lz4', compresstype='mxcustom')
   )
   USING MARS3
   ORDER BY (f1);

Alternatively, use the following syntax:

=# CREATE TABLE t1_1(
   f1 int8, COLUMN f1 ENCODING (encodechain='lz4, zstd', compresstype='mxcustom'),
   f2 int8, COLUMN f2 ENCODING(encodechain='lz4', compresstype='mxcustom')
   )
   USING MARS3
   ORDER BY (f1);

Using DEFAULT COLUMN ENCODING applies a default compression method to all columns, equivalent to table-level compression:

=# CREATE TABLE t1_2(
   f1 int8,
   f2 int8,
   DEFAULT COLUMN ENCODING (encodechain='auto', compresstype='mxcustom')
   )
   USING MARS3
   ORDER BY (f1);

Table-Level Compression

You can apply table-level compression to table t2 using either the encoding chain or general-purpose algorithms. The key difference is that only with the encoding chain can you modify the compression algorithm after table creation via SQL.

Example: Apply zstd compression at the table level using the encoding chain:

=# CREATE TABLE t2_1 (
   f1 int8,
   f2 int8
   ) 
   USING MARS3
   WITH(
   compresstype='mxcustom',
   encodechain='zstd'
   )
   ORDER BY (f1);

Example: Apply a zstd + lz4 compression chain at the table level:

=# CREATE TABLE t2_2 (
   f1 int8,
   f2 int8
   ) 
   USING MARS3
   WITH(
   compresstype='mxcustom',
   encodechain='zstd, lz4'
   )
   ORDER BY (f1);   

Modify the table-level compression to adaptive encoding:

=# ALTER TABLE t2_1 SET (encodechain='auto');

Combining Table-Level and Column-Level Compression

In Example 1, table t3_1 and column f1 are assigned lz4 and auto compression, respectively. Since column-level settings take precedence, column f1 uses lz4, while other columns (e.g., f2) use adaptive encoding.

=# CREATE TABLE t3_1 (
   f1 int8 ENCODING(compresstype='lz4'),
   f2 int8
   ) 
   USING MARS3
   WITH(
   compresstype='mxcustom',
   encodechain='auto'
   )
   ORDER BY (f1);

In Example 2, both table t3_2 and column f1 have compression settings. Column f1 uses the specified chain lz4, deltazigzag, while f2 inherits the table-level auto setting.

=# CREATE TABLE t3_2 (
   f1 int8 ENCODING(compresstype='mxcustom', encodechain='lz4, deltazigzag'),
   f2 int8
   ) 
   USING MARS3
   WITH(
   compresstype='mxcustom',
   encodechain='auto'
   )
   ORDER BY (f1);   

Adaptive Encoding

YMatrix's encoding chain supports Adaptive Encoding, where the system automatically selects an optimal encoding method based on runtime data characteristics.

Enable adaptive encoding at the table level for table t4:

=# CREATE TABLE t4 (
   f1 int8,
   f2 int8
   ) 
   USING MARS3
   WITH(
   compresstype=mxcustom
   )
   ORDER BY (f1);

Alternatively, explicitly specify encodechain=auto. Either method is acceptable.

=# CREATE TABLE t4 (
   f1 int8,
   f2 int8
   ) 
   USING MARS3
   WITH(
   compresstype=mxcustom,
   encodechain=auto
   )
   ORDER BY (f1);

Apply both table-level and column-level adaptive encoding on table t5. Column f1 uses the column-level setting, while f2 inherits the table-level lz4 compression.

=# CREATE TABLE t5 (
   f1 int8 ENCODING (
              compresstype=mxcustom,
              encodechain=auto
              ),
   f2 int8
   ) 
   USING MARS3
   WITH(
   compresstype=mxcustom,
   encodechain=lz4
   )
   ORDER BY (f1);

In adaptive mode, you can set the automode parameter at the table level to prioritize either compression ratio or speed. The example below enables ratio-first mode for table t6. automode=1 prioritizes compression ratio; automode=2 prioritizes speed.

-- automode=1, auto for cost
-- automode=2, auto for speed
=# CREATE TABLE t6 (
   f1 int8,
   f2 int8
   ) 
   USING MARS3
   WITH(
   compresstype=mxcustom,
   automode=1
   )
   ORDER BY (f1);

Note!
Adaptive encoding cannot be combined with other compression algorithms.

Appendix: Compression Algorithms

Algorithm Parameters Description
lz4 & zstd compresslevel Integrates lz4 and zstd into the encoding chain using system compression libraries. lz4 excels in speed, especially decompression. zstd offers a better balance. At default levels, lz4 decompresses faster than zstd, while zstd achieves higher compression ratios. Generally, higher zstd levels yield better ratios but slower performance—though exceptions exist.
deltadelta Scaling factor (optional). E.g., deltadelta(7) scales differences by 7 bits before storage. Default: no scaling. Applies second-order differencing, ideal for sorted timestamps without gaps. A perfect sequence becomes all zeros, enabling high compression. Works only on integers and is effective when second differences are small.
deltazigzag Scaling factor (optional) Performs first-order differencing, then uses zigzag encoding to convert negatives to positives, followed by variable-length integer encoding. Suitable for small-range integer columns without ordering requirements.
Gorilla None Designed for floating-point compression. Uses XOR between consecutive values to eliminate leading/trailing zeros. Currently supports only double (8-byte) values.
Gorilla2 None An improved version of Gorilla that captures broader data patterns. Offers significantly better compression than Gorilla in most time-series scenarios. Matches zstd in compression ratio and time, but outperforms zstd in decompression speed. Supports float4 and float8.
Floatint Scaling factor (required) Useful when Gorilla performs poorly on slowly changing floats (e.g., GPS coordinates). Converts floats to scaled integers before compression. Note: Introduces precision loss. Error depends on scaling factor; e.g., factor 4 implies a maximum error of 0.0001.
simple8b None Ideal for small-range integers. Packs multiple small integers into 8 bytes. For example, values < 8 can be stored using 3 bits each, achieving good compression. lz4 may perform poorly on such irregular data.
fds None Designed for cases where floating-point columns store integer values (common in time-series). Detects integer patterns, converts to binary integer format, then compresses. On the TSBS cpu-only dataset (13 columns, 10 float8 random integers), fds achieves 2x better compression than zstd (30% of zstd's size).