You can use the general compression algorithms lz4, zstd, and zlib in YMatrix. The parameters are as follows, implemented in WITH statements, such as WITH (compresstype=zstd, compresslevel=3)
.
Parameter name | Default value | Min value | Maximum value | Description |
---|---|---|---|---|
compress_threshold | 1200 | 1 | 8000 | Compression threshold. Used to control how many tuples (tuples) are compressed in a single table, and is the upper limit of the number of Tuples compressed in the same unit. |
compressiontype | lz4 | compression algorithm, supported: 1. zstd 2. zlib 3. lz4 |
||
compresslevel | 1 | 1 | Compress level. The smaller the value, the faster the compression, but the worse the compression effect; the larger the value, the slower the compression, but the better the compression effect. Different algorithms have different valid values ranges: zstd: 1-19 zlib: 1-9 lz4: 1-20 |
Notes!
Generally speaking, the higher the compression level, the higher the compression rate and the lower the speed. But that's not absolute.
In addition to the general compression algorithm, we hope you can try YMatrix's self-developed custom compression algorithm - coding chain (mxcustom)**. Time series data has strong characteristics, such as regular time intervals, independence between columns, graduality over time, etc. Common compression algorithms such as lz4 and zstd are all oriented towards byte streams. Because these characteristics are not perceived and utilized, the effect of brute force compression is very different from the ideal result.
The encoding chain can make full use of the characteristics of time sequence data to deeply compress the table data. The benefits of deep compression are as follows:
Algorithm | Supported Parameters | Description |
---|---|---|
lz4 & zstd | compresslevel | Coding chain incorporates lz4 and zstd into the encoding combination, and calls the compression library provided by the system to achieve compression and decompression. lz4 is suitable for scenarios that focus on speed and especially decompress speed. zstd is more balanced. At the default compression level, the decompression speed of lz4 is significantly better than zstd, and the compression rate of zstd is obviously greater than lz4. Generally speaking, the higher the compression level of zstd, the higher the compression rate and the lower the speed. But this is not absolutely |
deltadelta | Scaling factor (optional). The scaling factor refers to the number of scaled bits. For example, deltadelta(7) means that the difference is scaled 7 bits and then stored. Default does not scale | Delta The principle of Delta is to make quadratic difference between adjacent data, which is especially suitable for sorted timestamps. The processing result of the time stamp sequence that is strictly sequenced and without missing values is full 0 sequences, which can achieve good compression. If some timestamps have missing values, they may still be a large value after the difference. deltadelta is only for integers and is suitable for the case where quadratic difference is a small integer |
deltazigzag | Scaling factor (optional) | deltazigzag's principle is to perform a difference, then use zigzag to convert possible negative numbers into positive numbers, and then use variable-length integer encoding to compress them into smaller sizes. Suitable for integer sequences with smaller intervals, no sorting requirements |
Gorilla | Gorilla encoding is used for compression of floating-point numbers. The principle is to perform an exclusive OR operation of numerical values and preamble values to compress the zero values of the prefix and suffix. Currently, only double type is supported, that is, 8 bits are used as a data unit | |
Floatint | Scaling Factor (Required) | In some cases, Gorilla's compression of floating-point numbers is not necessarily effective. For example, in the Internet of Vehicles timing scenario, the latitude and longitude of the car is a slowly changing floating point value, and the compression rate with Gorilla is almost 0. The combination solution of floatint and deltadelta can reach more than ten times as much as before. This result is because floating-point numbers have a special internal representation format, and the changes in adjacent data are relatively small. The floating-point part does not necessarily produce more zero values through XOR operation, but the integer sequence obtained after a certain scaling can retain similarity well, and instead it is easier to compress. **It should be noted that the scaling of floatint has a certain accuracy loss, and the introduced error is related to the scaling factor. If the scaling factor is 4, the maximum error is 0.0001 |
simple8b | simple8b is suitable for integer numbers with a smaller range. The principle is to store multiple small integers in 8 bytes of space. For example, if a piece of data is filled with integers < 8, a number can be stored in every 3 digits, thus achieving a better compression effect. In this case, lz4 may have poor compression effect due to irregular data. |
The main functions of the coding chain are shown in the following table:
Serial number | Usage |
---|---|
1 | Column level compression |
2 | Table-level compression (supports modification algorithm) |
3 | Specify both table level and column level compression |
4 | Adaptive encoding (AutoEncode) |
The specific usage method is described below. No matter which usage, you need to create an extension first.
CREATE EXTENSION matrixts;
Customized compression specifications for each column of t1. encodechain
is used to specify the encoding combination, the parameters are placed in brackets and separated by "," as shown below.
=# CREATE EXTENSION matrixts;
=# CREATE TABLE t1(
f1 int8 ENCODING(encodechain='deltadelta(7)', compresstype='mxcustom'),
f2 int8 ENCODING(encodechain='lz4', compresstype='mxcustom')
)
USING MARS2;
Create a mars2_btree index.
=# CREATE INDEX t1_index ON t1
USING mars2_btree(f1);
The following SQL can also be used to specify column level compression.
=# CREATE TABLE t1_1(
f1 int8,COLUMN f1 ENCODING (encodechain='lz4', compresstype='mxcustom'),
f2 int8,COLUMN f2 ENCODING(encodechain='lz4', compresstype='mxcustom')
)
USING MARS2;
=# CREATE INDEX t1_1_index ON t1_1
USING mars2_btree(f1);
DEFAULT COLUMN ENCODING
means that by default, a specific compression algorithm is specified for all columns, which is equivalent to table-level compression.
=# CREATE TABLE t1_2(
f1 int8,
f2 int8,
DEFAULT COLUMN ENCODING (encodechain='auto', compresstype='mxcustom')
)
USING MARS2;
=# CREATE INDEX t1_2_index ON t1_2
USING mars2_btree(f1);
Assuming you want to use the zstd
compression algorithm to perform table-level compression on table t2, there are two options: using coded chains and not using coded chains. The main difference here is that using coding chains for table-level compression can realize the use of SQL statements to modify the compression algorithm again after table creation.
If you use the encoding chain to perform table-level compression of table t2 based on the zstd
algorithm, the example is as follows:
=# CREATE TABLE t2 (
f1 int8
, f2 int8
)
USING MARS2
WITH(
compresstype=mxcustom
, encodechain=zstd
);
=# CREATE INDEX t2_index ON t2
USING mars2_btree(f1);
Modify the table-level compression algorithm to adaptive encoding:
=# ALTER TABLE t2 SET (encodechain=auto);
Notes!
encodechain
applies only to MARS2 tables. You can use the original compression method in MARS2 tables as in the past. For example, specify table-level compression in the WITH clause.
In the example, the auto
and lz4
compression algorithms are specified for table t3 and its column f1 respectively. At this time, since the column-level compression specification takes precedence over the table-level (the column specifies ENCODING(compresstype=none)/ENCODING(minmax)
exception, see below for details), the f1 column will eventually be lz4 compressed and the remaining columns (f2 columns) in the t3 table are adaptively encoded.
=# CREATE TABLE t3 (
f1 int8 ENCODING(compresstype=lz4)
, f2 int8
)
USING MARS2
WITH(
compresstype=mxcustom
, encodechain=auto
);
YMatrix's encoding chain supports adaptive encoding, that is, the runtime system judges data characteristics and automatically selects a reasonable set of encoding methods.
encodechain=auto
must be explicitly specified).=# CREATE TABLE t4 (
f1 int8
, f2 int8
)
USING MARS2
WITH(
compresstype=mxcustom
, encodechain=auto
);
lz4
and the adaptive encoding of the column level for the t4 table. In this case, column f2 uses table-level algorithms because no specific compression algorithm is specified.=# CREATE TABLE t4 (
f1 int8 ENCODING(encodechain=auto,compresstype=mxcustom)
, f2 int8
)
USING MARS2
WITH(
compresstype=mxcustom,
, encodechain=lz4
);
lz4
at the table level and the none
encoding chain compression algorithm for the t4 table. In this case, column f1 does not compress **, column f2 does not specify a specific compression algorithm, table-level algorithms are used.=# CREATE TABLE t4 (
f1 int8 ENCODING(encodechain=none,compresstype=mxcustom)
, f2 int8
)
USING MARS2
WITH(
compresstype=mxcustom,
, encodechain=lz4
);
lz4
encoded chain compression algorithm at the table level and the **none
non-encoded chain compression algorithm at the column level for the t4 table at the same time. In this case, the non-coded chain compression of column f1 and column f2 will be overwritten by the table-level coded chain compression algorithm.=# CREATE TABLE t4 (
f1 int8 ENCODING(compresstype=none)
, f2 int8 ENCODING(minmax)
)
USING MARS2
WITH(
compresstype=mxcustom,
, encodechain=lz4
);
Under the adaptive encoding function, the Automode is supported at the table level, and the options are compression rate priority and speed priority. In the example, the compression rate priority mode is enabled for the t4 table. automode=1
means compression rate is preferred, and automode=2
means speed is preferred.
## automode=1, auto for cost
## automode=2, auto for speed
CREATE TABLE t4 (
f1 int8
, f2 int8
)
USING MARS2
WITH(
compresstype=mxcustom
, automode=1
);