Use compression

This document introduces the algorithms used for compression in YMatrix and their usage methods.

General Compression Algorithm

concept

General compression refers to the compression algorithm that directly compresses the data blocks when it has no understanding of the internal structure of the data. This kind of compression is often encoded based on the characteristics of binary data to reduce redundancy in stored data, and the compressed data cannot be accessed randomly. Compression and decompression must be carried out in units of the entire data block.

YMatrix supports three compression algorithms: zlib, lz4 and zstd for data blocks.

Usage

Three general compression algorithms, lz4, zstd, and zlib, need to be implemented in the WITH statement when building tables. The examples are as follows:

=# WITH (compresstype=zstd, compresslevel=3, compress_threshold=1200)

Notice! For more information on the usage of the WITH statement, see CREATE TABLE.

The parameters are as follows:

Parameter name Default value Min value Maximum value Description
compress_threshold 1200 1 8000 Compression threshold. Used to control how many tuples (tuples) are compressed in a single table, and is the upper limit of the number of Tuples compressed in the same unit.
compressiontype lz4 compression algorithm, supported:
1. zstd
2. zlib
3. lz4
compresslevel 1 1 Compress level. The smaller the value, the faster the compression, but the worse the compression effect; the larger the value, the slower the compression, but the better the compression effect. Different algorithms have different valid values ​​ranges:
zstd: 1-19
zlib: 1-9
lz4: 1-20

Notes!
Generally speaking, the higher the compression level, the higher the compression rate and the lower the speed. But that's not absolute.

Coding Chain

In addition to the general compression algorithm, we hope you can try YMatrix's self-developed custom compression algorithm - coding chain (mxcustom)**.

concept

Unlike general compression, the basis of the encoding chain is based on the compression algorithm that perceives the format and semantics of the data inside the data block. In a relational database, data is organized in the form of a table, and each column of data in the table has its own fixed data type, which ensures that the same column of data has a certain logical similarity. Moreover, in some scenarios, the data between adjacent rows in the business table may also be more similar, so if the data is compressed in columns and stored together, it can bring better compression effect.

The coding chain has the following abilities:

  • Supports multiple encoding/compression algorithms. A series of encoding and compression algorithms have been developed for different data types and combined with data characteristics. Different algorithms have the characteristics of applicable scenarios, compression rates and compression algorithms. By finely specifying algorithms, good compression effects can be achieved.
  • Combination compression. The encoding/compression algorithm based on data type can also be further combined with general algorithms to achieve better compression effect.
  • Column-oriented specifies compression. The encoding chain can specify different compression combinations to columns, releasing the ability to fine-tune adjustments. Allows multiple compression algorithms to be specified in one column to implement multi-stage compression.

Advantages

The coding chain has obvious advantages over time series data compression. Time series data has strong characteristics, such as regular time intervals, independence between columns, graduality over time, etc. Common compression algorithms such as lz4 and zstd are all oriented towards byte streams. Because these characteristics are not perceived and utilized, the effect of brute force compression is very different from the ideal result.

The encoding chain can make full use of the characteristics of time sequence data to deeply compress the table data. The benefits of deep compression are as follows:

  • Significant savings in storage costs. The smaller data size greatly saves storage costs, making it possible to accommodate more data under the same space and the same machine scale, and to store more data assets.
  • Disk I/O overhead is reduced. Also due to the reduction in data size, disk I/O overhead is reduced. For query speeds involving a large number of disk I/O, it is significantly improved, especially on disk drives (HDD, Hard Disk Drive), for scenarios where data is cold (data is queried at low frequency).
  • Deep optimization, accelerate query. On the one hand, targeted compression algorithms are simpler and have the opportunity to deeply optimize, so they can be decompressed at a higher speed. There is a chance to speed up the query further.

limit

  • The characteristics of the encoding chain determine that it is a means that are extremely correlated with data characteristics, not a general means, and has certain thresholds for use. Therefore, you should carefully choose the corresponding algorithm when using it.
  • The encoding chain is only applicable to MARS2 and MARS3 tables.

Notice! The easiest way to use the coding chain compression algorithm is to directly configure the adaptive coding mode, which can judge data characteristics at runtime and automatically select a reasonable encoding method. See below for details.

Usage

The main usage of the encoding chain is shown in the following table:

Serial number Usage
1 Column level compression
2 Table-level compression (supports modification algorithm)
3 Specify both table level and column level compression
4 Adaptive encoding (AutoEncode)

The specific usage method is described below. No matter which usage, you need to create an extension first.

=# CREATE EXTENSION matrixts;

Column level compression

Customized compression specifications for each column of t1. encodechain is used to specify encoding combinations (can specify a single algorithm or multiple algorithms). If multiple algorithms are specified, they need to be separated by "," and the example is as follows.

=# CREATE TABLE t1(
   f1 int8 ENCODING(encodechain='deltadelta(7), zstd', compresstype='mxcustom'),
   f2 int8 ENCODING(encodechain='lz4', compresstype='mxcustom')
   )
   USING MARS3
   ORDER BY (f1);

The following SQL can also be used to specify column level compression.

=# CREATE TABLE t1_1(
   f1 int8,COLUMN f1 ENCODING (encodechain='lz4, zstd', compresstype='mxcustom'),
   f2 int8,COLUMN f2 ENCODING(encodechain='lz4', compresstype='mxcustom')
   )
   USING MARS3
   ORDER BY (f1);

DEFAULT COLUMN ENCODING means that by default, a specific compression algorithm is specified for all columns, which is equivalent to table-level compression.

=# CREATE TABLE t1_2(
   f1 int8,
   f2 int8,
   DEFAULT COLUMN ENCODING (encodechain='auto', compresstype='mxcustom')
   )
   USING MARS3
   ORDER BY (f1);

Table level compression

Assuming you want to use the zstd compression algorithm to perform table-level compression of table t2, then there are two options: using coded chains and not using coded chains. The main difference here is that using coding chains for table-level compression can realize the use of SQL statements to modify the compression algorithm again after table creation.

Use the encoding chain to perform table-level compression of table t2_1 based on the zstd algorithm. The example is as follows:

=# CREATE TABLE t2_1 (
   f1 int8,
   f2 int8
   ) 
   USING MARS3
   WITH(
   compresstype='mxcustom',
   encodechain='zstd'
   )
   ORDER BY (f1);

Use a coding chain to combine the table t2_2 based on the zstd, lz4 algorithm. The example is as follows:

=# CREATE TABLE t2_2 (
   f1 int8,
   f2 int8
   ) 
   USING MARS3
   WITH(
   compresstype='mxcustom',
   encodechain='zstd, lz4'
   )
   ORDER BY (f1);   

Modify the table-level compression algorithm to adaptive encoding:

=# ALTER TABLE t2_1 SET (encodechain='auto');

Specify both table level and column level compression

In Example 1, the auto and lz4 compression algorithms are specified for table t3_1 and its column f1 respectively. At this time, since the specification of column-level compression takes precedence over table-level**, the f1 column will eventually be compressed and the remaining columns (f2 columns) in the t3_1 table are adaptively encoded.

=# CREATE TABLE t3_1 (
   f1 int8 ENCODING(compresstype='lz4'),
   f2 int8
   ) 
   USING MARS3
   WITH(
   compresstype='mxcustom',
   encodechain='auto'
   )
   ORDER BY (f1);

In Example 2, the auto and lz4, deltazigzag compression algorithms are specified for table t3_2 and its column f1 respectively.

=# CREATE TABLE t3_2 (
   f1 int8 ENCODING(compresstype='mxcustom', encodechain='lz4, deltazigzag'),
   f2 int8
   ) 
   USING MARS3
   WITH(
   compresstype='mxcustom',
   encodechain='auto'
   )
   ORDER BY (f1);   

Adaptive encoding

YMatrix's encoding chain supports adaptive encoding, that is, the runtime system judges data characteristics and automatically selects a reasonable set of encoding methods.

Use table-level adaptive encoding for t4 table.

=# CREATE TABLE t4 (
   f1 int8,
   f2 int8
   ) 
   USING MARS3
   WITH(
   compresstype=mxcustom
   )
   ORDER BY (f1);

You can also explicitly specify the encoding chain as auto, and you can choose one of the two usages.

=# CREATE TABLE t4 (
   f1 int8,
   f2 int8
   ) 
   USING MARS3
   WITH(
   compresstype=mxcustom,
   encodechain=auto
   )
   ORDER BY (f1);

Specify both the table-level lz4 and the adaptive encoding of the column-level for the t4 table. In this case, column f2 is used because no specific compression algorithm is specified.

=# CREATE TABLE t5 (
   f1 int8 ENCODING (
              compresstype=mxcustom,
              encodechain=auto
              ),
   f2 int8
   ) 
   USING MARS3
   WITH(
   compresstype=mxcustom,
   encodechain=lz4
   )
   ORDER BY (f1);

Under the adaptive encoding function, the Automode is supported at the table level, and the options are compression rate priority and speed priority. In the example, the compression rate priority mode is enabled for the t4 table. automode=1 means compression rate is preferred, and automode=2 means speed is preferred.

-- automode=1, auto for cost
-- automode=2, auto for speed
=# CREATE TABLE t6 (
   f1 int8,
   f2 int8
   ) 
   USING MARS3
   WITH(
   compresstype=mxcustom,
   automode=1
   )
   ORDER BY (f1);

Notice! Adaptive encoding cannot be specified simultaneously with other compression algorithms.

Appendix: Compression Algorithm

Algorithm Supported Parameters Description
lz4 & zstd compresslevel Coding chain incorporates lz4 and zstd into the encoding combination, and calls the compression library provided by the system to achieve compression and decompression. lz4 is suitable for scenarios that focus on speed and especially decompress speed. zstd is more balanced. At the default compression level, the decompression speed of lz4 is significantly better than zstd, and the compression rate of zstd is obviously greater than lz4. Generally speaking, the higher the compression level of zstd, the higher the compression rate and the lower the speed. But this is not absolutely
deltadelta Scaling factor (optional). The scaling factor refers to the number of scaled bits. For example, deltadelta(7) means that the difference is scaled 7 bits and then stored. Default does not scale Delta The principle of Delta is to make quadratic difference between adjacent data, which is especially suitable for sorted timestamps. The processing result of the time stamp sequence that is strictly sequenced and without missing values ​​is full 0 sequences, which can achieve good compression. If some timestamps have missing values, they may still be a large value after the difference. deltadelta is only for integers and is suitable for the case where quadratic difference is a small integer
deltazigzag Scaling factor (optional) deltazigzag's principle is to perform a difference, then use zigzag to convert possible negative numbers into positive numbers, and then use variable-length integer encoding to compress them into smaller sizes. Suitable for integer sequences with smaller intervals, no sorting requirements
Gorilla Gorilla encoding is used for compression of floating-point numbers. The principle is to perform an exclusive OR operation of numerical values ​​and preamble values ​​to compress the zero values ​​of the prefix and suffix. Currently, only double type is supported, that is, 8 bits are used as a data unit
Gorilla2 The optimized version of Gorilla, Gorilla2 captures more common data characteristics than Gorilla. In most timing scenarios, the compression rate of Gorilla2 is significantly higher than that of Gorilla. In addition, Gorilla2 is the same level as zstd in compression rate and compression time, and has a clear advantage over zstd in decompression speed. It currently supports float4 and float8 types
Floatint Scaling Factor (Required) In some cases, Gorilla's compression of floating-point numbers is not necessarily effective. For example, in the Internet of Vehicles timing scenario, the latitude and longitude of the car is a slowly changing floating point value, and the compression rate with Gorilla is almost 0. The combination solution of floatint and deltadelta can reach more than ten times as much as before. This result is because floating-point numbers have a special internal representation format, and the changes in adjacent data are relatively small. The floating-point part does not necessarily produce more zero values ​​through XOR operation, but the integer sequence obtained after a certain scaling can retain similarity well, and instead it is easier to compress. **It should be noted that the scaling of floatint has a certain accuracy loss, and the introduced error is related to the scaling factor. If the scaling factor is 4, the maximum error is 0.0001
simple8b simple8b is suitable for integer numbers with a smaller range. The principle is to store multiple small integers in 8 bytes of space. For example, if a piece of data is filled with integers < 8, a number can be stored in every 3 digits, thus achieving a better compression effect. In this case, lz4 may have poor compression effect due to irregular data.
fds In timing scenarios, floating point data is often used to store integer information. fds is a coding scheme provided for this scenario. By identifying that the data is actually an integer, the data is first converted into binary integers and then further compressed; evaluated by the TSBS cpu-only (there are 13 columns in total, of which 10 numerical columns are random integers stored in float8) data set, the compression rate of fds is 2 times higher than zstd (the size after compression is 30% of zstd)