Storage Engine

This document introduces the storage engine and storage diagnostic tools supported by MatrixDB, mainly including the following contents:

The storage engine is the storage base of the database system. The database uses the storage engine to create, query, update and delete data. Depending on your needs, different storage engines will provide different storage mechanisms and design in different dimensions such as physical layout, index type, and lock granularity.
Currently, the MatrixDB database system supports the use of the following storage engines: HEAP based on PostgreSQL, MARS2 developed by MatrixDB, and column storage engine AOCO and row storage engine AORO.
HEAP tables are storage engines provided by traditional PostgreSQL databases, also known as heap tables. This type of table supports a large number of concurrent read and write, transaction, index and other features.
The MARS2 table reduces the number of I/O addressing times with its physically ordered merging method, thereby improving the query performance of data in the table. At the same time, MARS2 supports compression, column storage, automatic archiving, pre-aggregation and other functions, which performs superiorly in timing scenarios.
MARS2 does not currently support updates and deletion.


1 Create tables under different storage engines

Depending on the different characteristics of the two storage engines, you can flexibly create different tables under different scenario requirements. We give the following example.

1.1 Create a HEAP table

The HEAP table is the default storage engine of MatrixDB, so if you do not specify the storage engine when creating the table, the created HEAP tables are all HEAP tables.

=# CREATE TABLE disk(
    time timestamp with time zone,
    tag_id int,
    read float,
    write float
)
DISTRIBUTED BY (tag_id);

1.2 Create a MARS2 table

MARS2 tables depend on matrixts timing extension. Before building a table, you first need to create an extension in the database using the storage engine.

Notes!
matrixts is extended to the database level, and it can be created once in a database without repeated creation.

=# CREATE EXTENSION matrixts;

When creating tables, use USING MARS2 to specify the storage engine:

=# CREATE TABLE vehicle_basic_data_mars2(
daq_time timestamp  encoding (minmax),
vin varchar(32)  COLLATE "C" encoding (minmax),
lng float encoding (minmax),
lat float encoding (minmax),
speed float ,
license_template varchar(16) ,
flag integer
)
USING MARS2
WITH (compresstype=zstd, compresslevel=3)
DISTRIBUTED BY (vin)
PARTITION BY RANGE (daq_time)
( START ('2022-07-01 00:00:00') INCLUSIVE
END ('2022-08-01 00:00:00') EXCLUSIVE
EVERY (INTERVAL '1 day')
,DEFAULT PARTITION others );

We will explain the specific parameters and statements in the vehicle_basic_data_mars2 table:

  1. encoding (minmax) This option is used to improve computing power. If you want to perform aggregated queries such as min, max, avg, sum, count based on this field, or use the WHERE conditional filtering option. The addition of this option needs to be determined based on the business scenario. Do not add it to all fields without thinking, otherwise this operation will bring a lot of unreasonable data writing to the CPU and disk I/O consumption.
  2. COLLATE "C" This option is only added in the Device Encoding field, and no other fields need to be added. With this operation, you can improve the sorting and querying speed of text types.
  3. USING MARS2 It is a fixed writing method using the MARS2 table and cannot change the content.
  4. WITH (compresstype=zstd, compresslevel=3) Specify the compression algorithm as zstd, as well as rle_type, zlib, lz4 compression algorithms. zstd is recommended in this table, and the compression level is 3 level. For detailed compression performance parameters, please refer to the table below.
  5. DISTRIBUTED BY (vin): Use this statement to select the distribution key. We recommend using the Device Encoding Field as the distribution key, so that the data of the same device can be queried and related calculations can be performed. This avoids performance losses caused by data redistribution between nodes.
  6. PARTITION BY RANGE (daq_time) Specifies the partition key for the partition table. We recommend using the Time of data acquisition by the device as the partition key. Usually a large number of queries are the time to filter data collection. For example, when you want to query the data within a day and perform related calculations, you need to add the filtering condition `WHERE daq_time >= CURRENT_DATE - INTERVAL '1 day', so that the database will quickly determine which partition subtable the data is on, so as to quickly locate the data and query the table.
  7. ( START ('2022-07-01 00:00:00') INCLUSIVE END ('2022-08-01 00:00:00') EXCLUSIVE EVERY (INTERVAL '1 days') ,DEFAULT PARTITION others); This SQL statement indicates that you will use the START...END statement and the INCLUSIVE and EXCLUSIVE keywords to create a subpartition table that starts from 2022-07-01 00:00:00 to 2022-08-01 00:00:00.
  8. EVERY (INTERVAL '1 day') The time span of the subpartition table is 1 day. In addition to using day, you can also use hour, month, year, etc., and determine it based on the size of the data volume. For example, in a day, your server receives up to 1 million to 10 million pieces of data, and the 1 day time interval is the best choice. If the data volume of a day is only tens of thousands or hundreds of thousands, then choosing 1 month is very suitable; if the data volume per day is only tens of thousands, then the 1 year sub-partition table can be generated once.
  9. DEFAULT PARTITION others Indicates the default partition. If the corresponding subtable storage of data cannot be found in other partitions, it will be stored in this partition.

Notes!
We hope you will design ideas based on our suggestions, but we don’t want you to blindly apply them. The specific timing scenarios are ever-changing, and specific analysis of specific situations is still very necessary.

The MARS2 table compression performance related parameters mentioned above:

Parameter name Default value Min value Maximum value Description
compress_threshold 1200 1 8000 Compression threshold. Used to control how many tuples (tuples) are compressed in a single table, and is the upper limit of the number of Tuples compressed in the same unit.
compressiontype lz4 -- -- Compression algorithm, supported:
1. zstd
2. zlib
3. lz4
compresslevel 1 1 -- Compression level. Usually, the smaller the value, the smaller the compression rate, but the faster the compression; the larger the value, the higher the compression rate, but the slower the compression. Different algorithms have different valid values ​​ranges:
zstd: 1-19
zlib: 1-9
lz4: 1-20

Notes!
Generally speaking, the higher the compression level, the higher the compression rate and the lower the speed. But that's not absolute.

After successfully creating the MARS2 table, you must create an additional index of type mars2_btree to perform normal data reading and writing. The purpose of using index sorting is to make data of the same dimension or similar characteristics physically close as possible to reduce the number of I/O addressing and improve query efficiency. Therefore, the selection of sorting keys needs to comply with the main business query characteristics.
For example, the requirement is a single-device point query, then the sort key is the device number (vin) in the timing scenario. If the requirement is a detailed query, aggregation query or multi-device query within a certain time period, then the sort key is the device number and the timestamp (vin,daq_time).

=# CREATE INDEX idx_mars2 ON vehicle_basic_data_mars2 
USING mars2_btree(vin, daq_time);

Notes!
When data from the same time point in the device is reported in batches, MARS2 can merge data from the same device (the value of daq_time in this table) at the same time (the value of vin in this table). The merge feature requires that uniquemode=true be manually specified when creating an index, because the default value of this option is false. For example, when you specify uniquemode=true, the device A01 passes back 3 data at 2022-01-01 00:00:00, and finally the original two data will be overwritten according to the last data, only one data will be retained; but if you default uniquemode=false, then the 3 data sent back by the device 2022-01-01 00:00:00 will be retained in the end without any processing.


2 Storage diagnostic tool datainspect

datainspect is a storage diagnostic tool built into MARS2, which provides underlying data insights to accurately optimize the storage and query performance of your data.
Storage optimization requires paying attention to the actual content of the data in the physical file. Similar to PostgreSQL's Pageinspect, you can use datainspect to easily extract data segments in MARS2 physical storage for further analysis. In addition, datainspect integrates relevant index and meta information in MARS2 to provide basic information such as NULL distribution and minmax, thereby aiding in the optimization of the I/O scanning process.

Notes!
This tool is only available for MARS2 tables.

2.1 Installation

datainspect is a system function built in MARS2 and can be used after MARS2 is installed correctly. MARS2 tables depend on matrixts timing extension. Before building a table, you first need to create an extension in the database using the storage engine.

Notes!
matrixts is extended to the database level, and it can be created once in a database without repeated creation.

=# CREATE EXTENSION matrixts;

2.2 Introduction to function functions

First, create a test table named tb1 in the example. We will use this test table as a use case to introduce three related functions.

=# CREATE TABLE tb1(
f1 int8 encoding(minmax), 
f2 int8 encoding(minmax), 
f3 float8 encoding(minmax), 
f4 text
) USING MARS2;

Create a MARS2 index.

=# CREATE INDEX ON tb1 USING mars2_btree(f1);

Construct 24,000 test data.

=# INSERT INTO tb1 SELECT 
generate_series(1, 24000), 
mod((random()*1000000*(generate_series(1, 1200)))::int8, (random()::int8/100 + 100)), 
(random() * generate_series(1, 24000))::float8, 
(random() * generate_series(1, 24000))::text;

2.2.1 desc_ranges

The desc_ranges function integrates the underlying meta information and index of MARS2 to provide such as minmax index information and null value information, and also supports accurate monitoring of the underlying storage space occupancy of data.

  • grammar

    SELECT <* / column1,column2,...> FROM matrixts_internal.desc_ranges(<tablename TEXT>)
  • Parameters

  • tablename: table name, the partition table name when partitioning the table. (Required)

  • return

Fields Description
segno Segment number, starting with 0
attno Attribute Number, column number, starting with 0
forkno Number of physical file shards. It is the underlying number of the database, which can be considered to correspond to a specific file. By default, it starts from the first fork
offno RANGE Offset in physical file, unit bytes. The default starts from 0. In MARS2, data is stored in batches and a storage RANGE is formed in units of set compress_threshold (compression threshold parameter), forming its own position identification with offsets in the physical file.
nbytes RANGE The actual space occupied in physical files, unit bytes
nrows compress_threshold set when creating tables, default is 1200. compress_threshold is the compression threshold. Used to control how many tuples (tuples) are compressed in a single table, and is the upper limit of the number of Tuples compressed in the same unit.
nrowsnotnull RANGE Non-null number of bars
mmin If this column supports minmax index, it displays the minimum value in the RANGE, and if it does not exist, it is NULL
mmax If this column supports minmax index, it is the maximum value in the RANGE. If it does not exist, it is NULL.
  • Example
  1. Given a table, view the usage of each column storage space in the entire cluster.
    =# SELECT attno, sum(nbytes)/1024 as "Size in KB" 
     FROM matrixts_internal.desc_ranges('tb1') GROUP BY attno ORDER BY attno;
    attno |      Size in KB      
    -------+--------------------------------------------------------------------------------------------------------------------------
      0 |  94.9062500000000000
      1 |   7.8203125000000000
      2 | 187.3437500000000000
      3 | 386.3515625000000000
    (4 rows)
  2. Given a table, look at the null value and data distribution rules on a certain segment.
    =# SELECT * FROM matrixts_internal.desc_ranges('tb1') WHERE segno = 1;
    segno | attno | forkname | forkno | offno  | nbytes | nrows | nrowsnotnull |        mmin        |        mmax        
    -------+---------+-------+-------+---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      1 |     0 | data1    |    304 |      0 |   4848 |  1200 |         1200 | 15                 | 7240
      1 |     1 | data1    |    304 |  16376 |    856 |  1200 |          199 | 0                  | 99
      1 |     2 | data1    |    304 |  17712 |   9072 |  1200 |         1200 | 1.4602231817218758 | 704.8010557110921
      1 |     3 | data1    |    304 |  50024 |  20272 |  1200 |         1200 | NULL               | NULL
      1 |     0 | data1    |    304 |   4848 |   4856 |  1200 |         1200 | 7243               | 14103
      1 |     1 | data1    |    304 |  17232 |    160 |  1200 |            0 | NULL               | NULL
      1 |     2 | data1    |    304 |  26784 |   9760 |  1200 |         1200 | 705.0931003474365  | 1372.9018354549075
      1 |     3 | data1    |    304 |  70296 |  19680 |  1200 |         1200 | NULL               | NULL
      1 |     0 | data1    |    304 |   9704 |   4856 |  1200 |         1200 | 14125              | 21417
      1 |     1 | data1    |    304 |  17392 |    160 |  1200 |            0 | NULL               | NULL
      1 |     2 | data1    |    304 |  36544 |   9760 |  1200 |         1200 | 1375.043496121433  | 2084.906658862494
      1 |     3 | data1    |    304 |  89976 |  19792 |  1200 |         1200 | NULL               | NULL
      1 |     0 | data1    |    304 |  14560 |   1816 |   445 |          445 | 21429              | 23997
      1 |     1 | data1    |    304 |  17552 |    160 |   445 |            0 | NULL               | NULL
      1 |     2 | data1    |    304 |  46304 |   3720 |   445 |          445 | 2086.0748374078717 | 2336.065046118657
      1 |     3 | data1    |    304 | 109768 |   7576 |   445 |          445 | NULL               | NULL
    (16 rows)

    2.2.2 show_range

    The show_range function selects a piece of physical data in MARS2 and displays this piece of data into readable data. The currently supported data types are: int2, int4, int8, float4, float8, timestamp, date, text.

  • grammar
    SELECT <* / column1,column2,...> FROM matrixts_internal.show_range(
     tablename text, 
     attno int4, 
     forkno int4, 
     offno int4, 
     nbytes int4
    )
  • Parameters
    • tablename: table name, if partition table is the partition table name. (Required)
    • attno: column number, columns when defining tables are numbered from 0 in order. (Required)
    • forkno: physical file shard number. (Required)
    • offno: The offset of data in a physical file, unit bytes. (Required)
    • nbytes: The actual space occupied by data in physical files, unit bytes. (Required)

Notes!
For details of the above parameters, please refer to the return introduction of the desc_ranges function above.

  • return
Fields Description
rowno row number. This number is determined by the relative offset in RANGE, rather than the absolute offset of the entire table
val Actual content

Notes!
There may be errors in the val displayed on floating point data.

  • Example
  1. Given a physically stored RANGE, view the contents of this RANGE.
    =# SELECT * FROM matrixts_internal.show_range('tb1', 1, 304, 16176, 808) LIMIT 20;
    rowno | val 
    -------+--------
      1 | 4
      2 | 36
      3 | 81
      4 | 58
      5 | 17
      6 | 75
      7 | 11
      8 | 84
      9 | 60
     10 | 78
     11 | 69
     12 | 0
     13 | 87
     14 | 40
     15 | 72
     16 | 58
     17 | 17
     18 | 48
     19 | 70
     20 | 6
    (20 rows)

    2.2.3 dump_range

    The dump_range function decompresses the physical data selected in MARS2 into a binary file for secondary analysis.

  • grammar
    =# SELECT <* / column1,column2,...> FROM matrixts_internal.dump_ranges(
    tablename text, 
    attno int4, 
    outfile text, 
    forkno int4, 
    offno int4,
    limits int4
    );
  • Parameters
    • tablename: table name, if the table is a partitioned table, it is a partition subtable name. (Required)
    • attno: Column number. (Required)
    • outfile: Exported file name. After the export is completed, different segments will produce independent suffixes identified with different numbers. For example, the data file is named tb1-f2.bin, and after export, it appears as tb1-f2.bin-seg1 on Segment1. (Required)
    • forkno: physical file shard number. (Optional)
    • offno: The offset of data in a physical file, unit bytes. (Optional)
    • limits: The default is 100, and the number of RANGEs is determined by the default position specified by forkno and offno as the starting position. (Optional)

Note! Here limits limits only the number of exported RANGEs on a single segment.

Notes!
For details of the above parameters, please refer to the return introduction of the desc_ranges function above.

  • return

    • nbytes: The amount of data exported on each Segment node, unit bytes.
  • Example

  1. Export a piece of physical data into a binary file
    =# SELECT * FROM matrixts_internal.dump_ranges('tb1', 1, '/data/demo/tb1-f2.bin', 304, 16176, 1) LIMIT 20;
    nbytes 
    --------
       0
       0
    1480
    1632
    1704
    1592
    (6 rows)

    In the example, after executing dump_ranges, it was found that 6 results were returned. This is because the export is performed independently by the Segment. Each of them represents the export results on a downstream Segment, which will appear in the corresponding directory. Among them, nbytes of 2 results is 0 because there is no data on the Segment that meets the filtering conditions.

After execution, each Segment host will generate a binary file with the original suffix and the independent suffix identifier .-seg<no>. ![](Binary file released by https://img.ymatrix.cn/ymatrix_home/datainspectddump_1676627032.png)