Use compression

You can use the general compression algorithms lz4, zstd, and zlib in YMatrix. The parameters are as follows, implemented in WITH statements, such as WITH (compresstype=zstd, compresslevel=3).

1 General Compression Algorithm

Parameter name Default value Min value Maximum value Description
compress_threshold 1200 1 8000 Compression threshold. Used to control how many tuples (tuples) are compressed in a single table, and is the upper limit of the number of Tuples compressed in the same unit.
compressiontype lz4 compression algorithm, supported:
1. zstd
2. zlib
3. lz4
compresslevel 1 1 Compress level. The smaller the value, the faster the compression, but the worse the compression effect; the larger the value, the slower the compression, but the better the compression effect. Different algorithms have different valid values ​​ranges:
zstd: 1-19
zlib: 1-9
lz4: 1-20

Notes!
Generally speaking, the higher the compression level, the higher the compression rate and the lower the speed. But that's not absolute.

2 Coding Chain

In addition to the general compression algorithm, we hope you can try YMatrix's self-developed custom compression algorithm - coding chain (mxcustom)**. Time series data has strong characteristics, such as regular time intervals, independence between columns, graduality over time, etc. Common compression algorithms such as lz4 and zstd are all oriented towards byte streams. Because these characteristics are not perceived and utilized, the effect of brute force compression is very different from the ideal result.
The encoding chain can make full use of the characteristics of time sequence data to deeply compress the table data. The benefits of deep compression are as follows:

  • Significant savings in storage costs. The smaller data size greatly saves storage costs, making it possible to accommodate more data under the same space and the same machine scale, and to store more data assets.
  • Disk I/O overhead is reduced. Also due to the reduction in data size, disk I/O overhead is reduced. For query speeds involving a large number of disk I/O, it is significantly improved, especially on disk drives (HDD, Hard Disk Drive), for scenarios where data is cold (data is queried at low frequency).
  • Deep optimization, accelerate query. On the one hand, targeted compression algorithms are simpler and have the opportunity to deeply optimize, so they can be decompressed at a higher speed. There is a chance to speed up the query further.

2.1 Introduction

Algorithm Supported Parameters Description
lz4 & zstd compresslevel Coding chain incorporates lz4 and zstd into the encoding combination, and calls the compression library provided by the system to achieve compression and decompression. lz4 is suitable for scenarios that focus on speed and especially decompress speed. zstd is more balanced. At the default compression level, the decompression speed of lz4 is significantly better than zstd, and the compression rate of zstd is obviously greater than lz4. Generally speaking, the higher the compression level of zstd, the higher the compression rate and the lower the speed. But this is not absolutely
deltadelta Scaling factor (optional). The scaling factor refers to the number of scaled bits. For example, deltadelta(7) means that the difference is scaled 7 bits and then stored. Default does not scale Delta The principle of Delta is to make quadratic difference between adjacent data, which is especially suitable for sorted timestamps. The processing result of the time stamp sequence that is strictly sequenced and without missing values ​​is full 0 sequences, which can achieve good compression. If some timestamps have missing values, they may still be a large value after the difference. deltadelta is only for integers and is suitable for the case where quadratic difference is a small integer
deltazigzag Scaling factor (optional) deltazigzag's principle is to perform a difference, then use zigzag to convert possible negative numbers into positive numbers, and then use variable-length integer encoding to compress them into smaller sizes. Suitable for integer sequences with smaller intervals, no sorting requirements
Gorilla Gorilla encoding is used for compression of floating-point numbers. The principle is to perform an exclusive OR operation of numerical values ​​and preamble values ​​to compress the zero values ​​of the prefix and suffix. Currently, only double type is supported, that is, 8 bits are used as a data unit
Floatint Scaling Factor (Required) In some cases, Gorilla's compression of floating-point numbers is not necessarily effective. For example, in the Internet of Vehicles timing scenario, the latitude and longitude of the car is a slowly changing floating point value, and the compression rate with Gorilla is almost 0. The combination solution of floatint and deltadelta can reach more than ten times as much as before. This result is because floating-point numbers have a special internal representation format, and the changes in adjacent data are relatively small. The floating-point part does not necessarily produce more zero values ​​through XOR operation, but the integer sequence obtained after a certain scaling can retain similarity well, and instead it is easier to compress. **It should be noted that the scaling of floatint has a certain accuracy loss, and the introduced error is related to the scaling factor. If the scaling factor is 4, the maximum error is 0.0001
simple8b simple8b is suitable for integer numbers with a smaller range. The principle is to store multiple small integers in 8 bytes of space. For example, if a piece of data is filled with integers < 8, a number can be stored in every 3 digits, thus achieving a better compression effect. In this case, lz4 may have poor compression effect due to irregular data.

2.2 Use

The main functions of the coding chain are shown in the following table:

Serial number Usage
1 Column level compression
2 Table-level compression (supports modification algorithm)
3 Specify both table level and column level compression
4 Adaptive encoding (AutoEncode)

The specific usage method is described below. No matter which usage, you need to create an extension first.

CREATE EXTENSION matrixts;

2.2.1 Column level compression

Customized compression specifications for each column of t1. encodechain is used to specify the encoding combination, the parameters are placed in brackets and separated by "," as shown below.

=# CREATE EXTENSION matrixts;
=# CREATE TABLE t1(
  f1 int8 ENCODING(encodechain='deltadelta(7)', compresstype='mxcustom'),
  f2 int8 ENCODING(encodechain='lz4', compresstype='mxcustom')
)
USING MARS2;

Create a mars2_btree index.

=# CREATE INDEX t1_index ON t1 
USING mars2_btree(f1);

The following SQL can also be used to specify column level compression.

=# CREATE TABLE t1_1(
  f1 int8,COLUMN f1 ENCODING (encodechain='lz4', compresstype='mxcustom'),
  f2 int8,COLUMN f2 ENCODING(encodechain='lz4', compresstype='mxcustom')
)
USING MARS2;
=# CREATE INDEX t1_1_index ON t1_1 
USING mars2_btree(f1);

DEFAULT COLUMN ENCODING means that by default, a specific compression algorithm is specified for all columns, which is equivalent to table-level compression.

=# CREATE TABLE t1_2(
  f1 int8,
  f2 int8,
  DEFAULT COLUMN ENCODING (encodechain='auto', compresstype='mxcustom')
)
USING MARS2;
=# CREATE INDEX t1_2_index ON t1_2 
USING mars2_btree(f1);

2.2.2 Table-level compression

Assuming you want to use the zstd compression algorithm to perform table-level compression on table t2, there are two options: using coded chains and not using coded chains. The main difference here is that using coding chains for table-level compression can realize the use of SQL statements to modify the compression algorithm again after table creation.

If you use the encoding chain to perform table-level compression of table t2 based on the zstd algorithm, the example is as follows:

=# CREATE TABLE t2 (
      f1 int8
    , f2 int8
) 
USING MARS2
WITH(
      compresstype=mxcustom
    , encodechain=zstd
);
=# CREATE INDEX t2_index ON t2 
USING mars2_btree(f1);

Modify the table-level compression algorithm to adaptive encoding:

=# ALTER TABLE t2 SET (encodechain=auto);

Notes!
encodechain applies only to MARS2 tables. You can use the original compression method in MARS2 tables as in the past. For example, specify table-level compression in the WITH clause.

2.2.3 Specify both table level and column level compression

In the example, the auto and lz4 compression algorithms are specified for table t3 and its column f1 respectively. At this time, since the column-level compression specification takes precedence over the table-level (the column specifies ENCODING(compresstype=none)/ENCODING(minmax) exception, see below for details), the f1 column will eventually be lz4 compressed and the remaining columns (f2 columns) in the t3 table are adaptively encoded.

=# CREATE TABLE t3 (
      f1 int8 ENCODING(compresstype=lz4)
    , f2 int8
) 
USING MARS2
WITH(
      compresstype=mxcustom
    , encodechain=auto
);

2.2.4 Adaptive encoding

YMatrix's encoding chain supports adaptive encoding, that is, the runtime system judges data characteristics and automatically selects a reasonable set of encoding methods.

  1. Use table-level adaptive encoding for the t4 table (the encodechain=auto must be explicitly specified).
    =# CREATE TABLE t4 (
       f1 int8
     , f2 int8
    ) 
    USING MARS2
    WITH(
       compresstype=mxcustom
     , encodechain=auto
    );
  2. Specify both the table level lz4 and the adaptive encoding of the column level for the t4 table. In this case, column f2 uses table-level algorithms because no specific compression algorithm is specified.
    =# CREATE TABLE t4 (
       f1 int8 ENCODING(encodechain=auto,compresstype=mxcustom)
     , f2 int8 
    ) 
    USING MARS2
    WITH(
       compresstype=mxcustom,
     , encodechain=lz4
    );
  3. Specify both the lz4 at the table level and the none encoding chain compression algorithm for the t4 table. In this case, column f1 does not compress **, column f2 does not specify a specific compression algorithm, table-level algorithms are used.
    =# CREATE TABLE t4 (
       f1 int8 ENCODING(encodechain=none,compresstype=mxcustom)
     , f2 int8 
    ) 
    USING MARS2
    WITH(
       compresstype=mxcustom,
     , encodechain=lz4
    );
  4. Specify the lz4 encoded chain compression algorithm at the table level and the **none non-encoded chain compression algorithm at the column level for the t4 table at the same time. In this case, the non-coded chain compression of column f1 and column f2 will be overwritten by the table-level coded chain compression algorithm.
    =# CREATE TABLE t4 (
       f1 int8 ENCODING(compresstype=none)
     , f2 int8 ENCODING(minmax)
    ) 
    USING MARS2
    WITH(
       compresstype=mxcustom,
     , encodechain=lz4
    );

    Under the adaptive encoding function, the Automode is supported at the table level, and the options are compression rate priority and speed priority. In the example, the compression rate priority mode is enabled for the t4 table. automode=1 means compression rate is preferred, and automode=2 means speed is preferred.

    ## automode=1, auto for cost
    ## automode=2, auto for speed
    CREATE TABLE t4 (
       f1 int8
     , f2 int8
    ) 
    USING MARS2
    WITH(
       compresstype=mxcustom
     , automode=1
    );