This document introduces the algorithms used for compression in YMatrix and their usage methods.
General compression refers to the compression algorithm that directly compresses the data blocks when it has no understanding of the internal structure of the data. This kind of compression is often encoded based on the characteristics of binary data to reduce redundancy in stored data, and the compressed data cannot be accessed randomly. Compression and decompression must be carried out in units of the entire data block.
YMatrix supports three compression algorithms: zlib, lz4 and zstd for data blocks.
Three general compression algorithms, lz4, zstd, and zlib, need to be implemented in the WITH
statement when building tables. The examples are as follows:
=# WITH (compresstype=zstd, compresslevel=3, compress_threshold=1200)
Notice!
For more information on the usage of the WITH
statement, see CREATE TABLE.
The parameters are as follows:
Parameter name | Default value | Min value | Maximum value | Description |
---|---|---|---|---|
compress_threshold | 1200 | 1 | 8000 | Compression threshold. Used to control how many tuples (tuples) are compressed in a single table, and is the upper limit of the number of Tuples compressed in the same unit. |
compressiontype | lz4 | compression algorithm, supported: 1. zstd 2. zlib 3. lz4 |
||
compresslevel | 1 | 1 | Compress level. The smaller the value, the faster the compression, but the worse the compression effect; the larger the value, the slower the compression, but the better the compression effect. Different algorithms have different valid values ranges: zstd: 1-19 zlib: 1-9 lz4: 1-20 |
Notes!
Generally speaking, the higher the compression level, the higher the compression rate and the lower the speed. But that's not absolute.
In addition to the general compression algorithm, we hope you can try YMatrix's self-developed custom compression algorithm - coding chain (mxcustom)**.
Unlike general compression, the basis of the encoding chain is based on the compression algorithm that perceives the format and semantics of the data inside the data block. In a relational database, data is organized in the form of a table, and each column of data in the table has its own fixed data type, which ensures that the same column of data has a certain logical similarity. Moreover, in some scenarios, the data between adjacent rows in the business table may also be more similar, so if the data is compressed in columns and stored together, it can bring better compression effect.
The coding chain has the following abilities:
The coding chain has obvious advantages over time series data compression. Time series data has strong characteristics, such as regular time intervals, independence between columns, graduality over time, etc. Common compression algorithms such as lz4 and zstd are all oriented towards byte streams. Because these characteristics are not perceived and utilized, the effect of brute force compression is very different from the ideal result.
The encoding chain can make full use of the characteristics of time sequence data to deeply compress the table data. The benefits of deep compression are as follows:
Notice! The easiest way to use the coding chain compression algorithm is to directly configure the adaptive coding mode, which can judge data characteristics at runtime and automatically select a reasonable encoding method. See below for details.
The main usage of the encoding chain is shown in the following table:
Serial number | Usage |
---|---|
1 | Column level compression |
2 | Table-level compression (supports modification algorithm) |
3 | Specify both table level and column level compression |
4 | Adaptive encoding (AutoEncode) |
The specific usage method is described below. No matter which usage, you need to create an extension first.
=# CREATE EXTENSION matrixts;
Customized compression specifications for each column of t1. encodechain
is used to specify encoding combinations (can specify a single algorithm or multiple algorithms). If multiple algorithms are specified, they need to be separated by "," and the example is as follows.
=# CREATE TABLE t1(
f1 int8 ENCODING(encodechain='deltadelta(7), zstd', compresstype='mxcustom'),
f2 int8 ENCODING(encodechain='lz4', compresstype='mxcustom')
)
USING MARS3
ORDER BY (f1);
The following SQL can also be used to specify column level compression.
=# CREATE TABLE t1_1(
f1 int8,COLUMN f1 ENCODING (encodechain='lz4, zstd', compresstype='mxcustom'),
f2 int8,COLUMN f2 ENCODING(encodechain='lz4', compresstype='mxcustom')
)
USING MARS3
ORDER BY (f1);
DEFAULT COLUMN ENCODING
means that by default, a specific compression algorithm is specified for all columns, which is equivalent to table-level compression.
=# CREATE TABLE t1_2(
f1 int8,
f2 int8,
DEFAULT COLUMN ENCODING (encodechain='auto', compresstype='mxcustom')
)
USING MARS3
ORDER BY (f1);
Assuming you want to use the zstd
compression algorithm to perform table-level compression of table t2
, then there are two options: using coded chains and not using coded chains. The main difference here is that using coding chains for table-level compression can realize the use of SQL statements to modify the compression algorithm again after table creation.
Use the encoding chain to perform table-level compression of table t2_1
based on the zstd
algorithm. The example is as follows:
=# CREATE TABLE t2_1 (
f1 int8,
f2 int8
)
USING MARS3
WITH(
compresstype='mxcustom',
encodechain='zstd'
)
ORDER BY (f1);
Use a coding chain to combine the table t2_2
based on the zstd, lz4
algorithm. The example is as follows:
=# CREATE TABLE t2_2 (
f1 int8,
f2 int8
)
USING MARS3
WITH(
compresstype='mxcustom',
encodechain='zstd, lz4'
)
ORDER BY (f1);
Modify the table-level compression algorithm to adaptive encoding:
=# ALTER TABLE t2_1 SET (encodechain='auto');
In Example 1, the auto
and lz4
compression algorithms are specified for table t3_1
and its column f1
respectively. At this time, since the specification of column-level compression takes precedence over table-level**, the f1
column will eventually be compressed and the remaining columns (f2
columns) in the t3_1
table are adaptively encoded.
=# CREATE TABLE t3_1 (
f1 int8 ENCODING(compresstype='lz4'),
f2 int8
)
USING MARS3
WITH(
compresstype='mxcustom',
encodechain='auto'
)
ORDER BY (f1);
In Example 2, the auto
and lz4, deltazigzag
compression algorithms are specified for table t3_2
and its column f1
respectively.
=# CREATE TABLE t3_2 (
f1 int8 ENCODING(compresstype='mxcustom', encodechain='lz4, deltazigzag'),
f2 int8
)
USING MARS3
WITH(
compresstype='mxcustom',
encodechain='auto'
)
ORDER BY (f1);
YMatrix's encoding chain supports adaptive encoding, that is, the runtime system judges data characteristics and automatically selects a reasonable set of encoding methods.
Use table-level adaptive encoding for t4
table.
=# CREATE TABLE t4 (
f1 int8,
f2 int8
)
USING MARS3
WITH(
compresstype=mxcustom
)
ORDER BY (f1);
You can also explicitly specify the encoding chain as auto
, and you can choose one of the two usages.
=# CREATE TABLE t4 (
f1 int8,
f2 int8
)
USING MARS3
WITH(
compresstype=mxcustom,
encodechain=auto
)
ORDER BY (f1);
Specify both the table-level lz4
and the adaptive encoding of the column-level for the t4
table. In this case, column f2
is used because no specific compression algorithm is specified.
=# CREATE TABLE t5 (
f1 int8 ENCODING (
compresstype=mxcustom,
encodechain=auto
),
f2 int8
)
USING MARS3
WITH(
compresstype=mxcustom,
encodechain=lz4
)
ORDER BY (f1);
Under the adaptive encoding function, the Automode is supported at the table level, and the options are compression rate priority and speed priority. In the example, the compression rate priority mode is enabled for the t4 table. automode=1
means compression rate is preferred, and automode=2
means speed is preferred.
-- automode=1, auto for cost
-- automode=2, auto for speed
=# CREATE TABLE t6 (
f1 int8,
f2 int8
)
USING MARS3
WITH(
compresstype=mxcustom,
automode=1
)
ORDER BY (f1);
Notice! Adaptive encoding cannot be specified simultaneously with other compression algorithms.
Algorithm | Supported Parameters | Description |
---|---|---|
lz4 & zstd | compresslevel | Coding chain incorporates lz4 and zstd into the encoding combination, and calls the compression library provided by the system to achieve compression and decompression. lz4 is suitable for scenarios that focus on speed and especially decompress speed. zstd is more balanced. At the default compression level, the decompression speed of lz4 is significantly better than zstd, and the compression rate of zstd is obviously greater than lz4. Generally speaking, the higher the compression level of zstd, the higher the compression rate and the lower the speed. But this is not absolutely |
deltadelta | Scaling factor (optional). The scaling factor refers to the number of scaled bits. For example, deltadelta(7) means that the difference is scaled 7 bits and then stored. Default does not scale | Delta The principle of Delta is to make quadratic difference between adjacent data, which is especially suitable for sorted timestamps. The processing result of the time stamp sequence that is strictly sequenced and without missing values is full 0 sequences, which can achieve good compression. If some timestamps have missing values, they may still be a large value after the difference. deltadelta is only for integers and is suitable for the case where quadratic difference is a small integer |
deltazigzag | Scaling factor (optional) | deltazigzag's principle is to perform a difference, then use zigzag to convert possible negative numbers into positive numbers, and then use variable-length integer encoding to compress them into smaller sizes. Suitable for integer sequences with smaller intervals, no sorting requirements |
Gorilla | Gorilla encoding is used for compression of floating-point numbers. The principle is to perform an exclusive OR operation of numerical values and preamble values to compress the zero values of the prefix and suffix. Currently, only double type is supported, that is, 8 bits are used as a data unit | |
Gorilla2 | The optimized version of Gorilla, Gorilla2 captures more common data characteristics than Gorilla. In most timing scenarios, the compression rate of Gorilla2 is significantly higher than that of Gorilla. In addition, Gorilla2 is the same level as zstd in compression rate and compression time, and has a clear advantage over zstd in decompression speed. It currently supports float4 and float8 types | |
Floatint | Scaling Factor (Required) | In some cases, Gorilla's compression of floating-point numbers is not necessarily effective. For example, in the Internet of Vehicles timing scenario, the latitude and longitude of the car is a slowly changing floating point value, and the compression rate with Gorilla is almost 0. The combination solution of floatint and deltadelta can reach more than ten times as much as before. This result is because floating-point numbers have a special internal representation format, and the changes in adjacent data are relatively small. The floating-point part does not necessarily produce more zero values through XOR operation, but the integer sequence obtained after a certain scaling can retain similarity well, and instead it is easier to compress. **It should be noted that the scaling of floatint has a certain accuracy loss, and the introduced error is related to the scaling factor. If the scaling factor is 4, the maximum error is 0.0001 |
simple8b | simple8b is suitable for integer numbers with a smaller range. The principle is to store multiple small integers in 8 bytes of space. For example, if a piece of data is filled with integers < 8, a number can be stored in every 3 digits, thus achieving a better compression effect. In this case, lz4 may have poor compression effect due to irregular data. | |
fds | In timing scenarios, floating point data is often used to store integer information. fds is a coding scheme provided for this scenario. By identifying that the data is actually an integer, the data is first converted into binary integers and then further compressed; evaluated by the TSBS cpu-only (there are 13 columns in total, of which 10 numerical columns are random integers stored in float8) data set, the compression rate of fds is 2 times higher than zstd (the size after compression is 30% of zstd) |