Storage diagnostic tool datainspect

datainspect is a storage diagnostic tool built into MARS2, which provides underlying data insights to accurately optimize the storage and query performance of your data.

Storage optimization requires paying attention to the actual content of the data in the physical file. Similar to PostgreSQL's Pageinspect, you can use datainspect to easily extract data segments in MARS2 physical storage for further analysis.
In addition, datainspect integrates relevant index and meta information in MARS2 to provide basic information such as NULL distribution and minmax, thereby aiding in the optimization of the I/O scanning process.

Notes!
This tool is only available for MARS2 tables.

1.1 Installation

datainspect is a system function built in MARS2 and can be used after MARS2 is installed correctly. MARS2 tables depend on matrixts timing extension. Before building a table, you first need to create an extension in the database using the storage engine.

Notes!
matrixts is extended to the database level, and it can be created once in a database without repeated creation.

=# CREATE EXTENSION matrixts;

1.2 Introduction to function functions

First, create a test table named tb1 in the example. We will use this test table as a use case to introduce four related functions.

=# CREATE TABLE tb1(
f1 int8 encoding(minmax), 
f2 int8 encoding(minmax), 
f3 float8 encoding(minmax), 
f4 text
) USING MARS2
WITH (compresstype=lz4);

Create a MARS2 index.

=# CREATE INDEX ON tb1 USING mars2_btree(f1);

Construct 24,000 test data.

=# INSERT INTO tb1 SELECT 
generate_series(1, 24000), 
mod((random()*1000000*(generate_series(1, 1200)))::int8, (random()::int8/100 + 100)), 
(random() * generate_series(1, 24000))::float8, 
(random() * generate_series(1, 24000))::text;

1.2.1 desc_ranges

The desc_ranges function integrates the underlying meta information and index of MARS2 to provide such as minmax index information and null value information, and also supports accurate monitoring of the underlying storage space occupancy of data.

  • grammar
    SELECT <* / column1,column2,...> FROM matrixts_internal.desc_ranges(<'tablename'>)
  • Parameters
Field Description Required
tablename table name, partition table name when partitioning table Yes
  • return
Fields Description
segno Segment number, starting with 0
attno Attribute Number, column number, starting with 0
forkno Number of physical file shards. It is the underlying number of the database, which can be considered to correspond to a specific file. By default, it starts from the first fork
offno RANGE Offset in physical file, unit bytes. The default starts from 0. In MARS2, data is stored in batches and a storage RANGE is formed in units of set compress_threshold (compression threshold parameter), forming its own position identification with offsets in the physical file.
nbytes RANGE The actual space occupied in physical files, unit bytes
nrows compress_threshold set when creating tables, default is 1200. compress_threshold is the compression threshold. Used to control how many tuples (tuples) are compressed in a single table, and is the upper limit of the number of Tuples compressed in the same unit.
nrowsnotnull RANGE Non-null number of bars
mmin If this column supports minmax index, it displays the minimum value in the RANGE, and if it does not exist, it is NULL
mmax If this column supports minmax index, it is the maximum value in the RANGE. If it does not exist, it is NULL.
  • Example
  1. Given a table, view the usage of each column storage space in the entire cluster.
    =# SELECT attno, sum(nbytes)/1024 as "Size in KB" 
     FROM matrixts_internal.desc_ranges('tb1') GROUP BY attno ORDER BY attno;
    attno |      Size in KB      
    -------+--------------------------------------------------------------------------------------------------------------------------
      0 |  94.9062500000000000
      1 |   7.8203125000000000
      2 | 187.3437500000000000
      3 | 386.3515625000000000
    (4 rows)
  2. Given a table, look at the null value and data distribution rules on a certain segment.
    =# SELECT * FROM matrixts_internal.desc_ranges('tb1') WHERE segno = 1;
    segno | attno | forkname | forkno | offno  | nbytes | nrows | nrowsnotnull |        mmin        |        mmax        
    -------+---------+-------+-------+---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      1 |     0 | data1    |    304 |      0 |   4848 |  1200 |         1200 | 15                 | 7240
      1 |     1 | data1    |    304 |  16376 |    856 |  1200 |          199 | 0                  | 99
      1 |     2 | data1    |    304 |  17712 |   9072 |  1200 |         1200 | 1.4602231817218758 | 704.8010557110921
      1 |     3 | data1    |    304 |  50024 |  20272 |  1200 |         1200 | NULL               | NULL
      1 |     0 | data1    |    304 |   4848 |   4856 |  1200 |         1200 | 7243               | 14103
      1 |     1 | data1    |    304 |  17232 |    160 |  1200 |            0 | NULL               | NULL
      1 |     2 | data1    |    304 |  26784 |   9760 |  1200 |         1200 | 705.0931003474365  | 1372.9018354549075
      1 |     3 | data1    |    304 |  70296 |  19680 |  1200 |         1200 | NULL               | NULL
      1 |     0 | data1    |    304 |   9704 |   4856 |  1200 |         1200 | 14125              | 21417
      1 |     1 | data1    |    304 |  17392 |    160 |  1200 |            0 | NULL               | NULL
      1 |     2 | data1    |    304 |  36544 |   9760 |  1200 |         1200 | 1375.043496121433  | 2084.906658862494
      1 |     3 | data1    |    304 |  89976 |  19792 |  1200 |         1200 | NULL               | NULL
      1 |     0 | data1    |    304 |  14560 |   1816 |   445 |          445 | 21429              | 23997
      1 |     1 | data1    |    304 |  17552 |    160 |   445 |            0 | NULL               | NULL
      1 |     2 | data1    |    304 |  46304 |   3720 |   445 |          445 | 2086.0748374078717 | 2336.065046118657
      1 |     3 | data1    |    304 | 109768 |   7576 |   445 |          445 | NULL               | NULL
    (16 rows)

    1.2.2 show_range

    The show_range function selects a piece of physical data in MARS2 and displays this piece of data into readable data. The currently supported data types are: int2, int4, int8, float4, float8, timestamp, date, text.

  • grammar
    SELECT <* / column1,column2,...> FROM matrixts_internal.show_range(
     tablename text, 
     attno int4, 
     forkno int4, 
     offno int4, 
     nbytes int4
    )
  • Parameters
Field Description Required
tablename table name, partition table name when partitioning table Yes
attno column number, when defining the table, the columns numbered from 0 in order Yes
forkno Physical file shard number Yes
offno Offset of data in physical file, unit bytes Yes
nbytes The actual space occupied by data in physical files, unit bytes Yes

Notes!
For details of the above parameters, please refer to the return introduction of the desc_ranges function above.

  • return
Fields Description
rowno row number. This number is determined by the relative offset in RANGE, rather than the absolute offset of the entire table
val Actual content

Notes!
There may be errors in the val displayed on floating point data.

  • Example
  1. Given a physically stored RANGE, view the contents of this RANGE.
    =# SELECT * FROM matrixts_internal.show_range('tb1', 1, 304, 16176, 808) LIMIT 20;
    rowno | val 
    -------+--------
      1 | 4
      2 | 36
      3 | 81
      4 | 58
      5 | 17
      6 | 75
      7 | 11
      8 | 84
      9 | 60
     10 | 78
     11 | 69
     12 | 0
     13 | 87
     14 | 40
     15 | 72
     16 | 58
     17 | 17
     18 | 48
     19 | 70
     20 | 6
    (20 rows)

    1.2.3 dump_range

    The dump_range function decompresses the physical data selected in MARS2 into a binary file for secondary analysis.

  • grammar
    =# SELECT <* / column1,column2,...> FROM matrixts_internal.dump_ranges(
    tablename text, 
    attno int4, 
    outfile text, 
    forkno int4, 
    offno int4,
    limits int4
    );
  • Parameters
Field Description Required
tablename table name, partition table name when partitioning table Yes
attno Column number Yes
outfile Exported file name. After the export is completed, different segments will produce independent suffixes identified with different numbers. For example, the data file is named tb1-f2.bin, and after export, it appears as tb1-f2.bin-seg1 on Segment1 Yes
forkno Physical file shard number No
offno Offset of data in physical file, unit bytes No
limits Default is 100, and the number of RANGEs limited by the specified positions of forkno and offno is the starting position No

Note! Here limits limits only the number of exported RANGEs on a single segment.

Notes!
For details of the above parameters, please refer to the return introduction of the desc_ranges function above.

  • return

    • nbytes: The amount of data exported on each Segment node, unit bytes.
  • Example

  1. Export a piece of physical data into a binary file
    =# SELECT * FROM matrixts_internal.dump_ranges('tb1', 1, '/data/demo/tb1-f2.bin', 304, 16176, 1) LIMIT 20;
    nbytes 
    --------
       0
       0
    1480
    1632
    1704
    1592
    (6 rows)

    In the example, after executing dump_ranges, it was found that 6 results were returned. This is because the export is performed independently by the Segment. Each of them represents the export results on a downstream Segment, which will appear in the corresponding directory. Among them, nbytes of 2 results is 0 because there is no data on the Segment that meets the filtering conditions.

After execution, each Segment host will generate a binary file with the original suffix and the independent suffix identifier .-seg<no>. ![](Binary file released by https://img.ymatrix.cn/ymatrix_home/datainspectddump_1676627032.png)

1.2.4 desc_compress

desc_compress compares two user-specified compression algorithms for a continuous segment of physical data selected in the MARS2 column store.

  • grammar
    =# SELECT <* / column1,column2,...> FROM matrixts_internal.desc_compress(
    tablename text, 
    attno int4, 
    <method1> text, 
    <method2> int4,
    forkno int4, 
    offno int4,
    limits int4
    );
  • Parameters
Field Description Required
tablename table name, partition table name when partitioning table Yes
attno Column number Yes
Compression algorithm name 1 such as zstd, lz4, deltadelta, etc. Yes
Compression algorithm name 2 such as zstd, lz4, deltadelta, etc. Yes
forkno Physical file shard number No
offno Offset of data in physical file, unit bytes No
limits Default is 100, and the number of RANGEs limited by the specified positions of forkno and offno is the starting position No
  • return
Fields Description
segno Segment number, starting with 0
attno Attribute Number, column number, starting with 0
forkno Number of physical file shards. It is the underlying number of the database, which can be considered to correspond to a specific file. By default, it starts from the first fork
offno RANGE Offset in physical file, unit bytes. The default starts from 0. In MARS2, data is stored in batches and a storage RANGE is formed in units of set compress_threshold (compression threshold parameter), forming its own position identification with offsets in the physical file.
nbytes RANGE The actual space occupied in physical files, unit bytes
nrows compress_threshold set when creating tables, default is 1200. compress_threshold is the compression threshold. Used to control how many tuples (tuples) are compressed in a single table, and is the upper limit of the number of Tuples compressed in the same unit.
nrowsnotnull RANGE Non-null number of bars
mmin If this column supports minmax index, it displays the minimum value in the RANGE, and if it does not exist, it is NULL
mmax If this column supports minmax index, it is the maximum value in the RANGE. If it does not exist, it is NULL.
decompresstime0 Decompress the current column's original compression algorithm used, unit clock cycle (CPU Cycle)
compressedsize1 Use <method1> to compress the data size, unit bytes
compressedtime1 Use <method1> to compress the time of this data, in CPU Cycle
selectiontime1 If <method1> is the system-matched Auto, then the meaning of this parameter is the time taken to analyze data characteristics (CPU Cycle)
decompressedtime1 Use <method1> to decompress this data in CPU Cycle
iscompressible1 Use <method1> to compress. When the compression algorithm compresses the result close to the original data, or even larger, it is recommended to give up compression
dataloss1 Is there any accuracy loss after using <method1>
compressedsize2 Use <method2> to compress the data in bytes
compressedtime2 Use <method2> to compress this data in unit CPU Cycle
selectiontime2 If <method2> is adaptive encoding, then this parameter means the time taken to analyze data characteristics (CPU Cycle)
decompressedtime2 Use <method2> to decompress this data in unit CPU Cycle
iscompressible2 Use <method2> to compress. When the compression algorithm compresses the result close to the original data, or even larger, it is recommended to give up compression
dataloss2 Whether there will be accuracy loss after using <method2>
  • Example
  1. It is known that the sample table in this article uses the lz4 compression algorithm. Now we use desc_compress to analyze what compression algorithm to use for f1 column data to obtain a higher compression rate.
    =# SELECT compressedsize1,compressedsize2,dataloss1,dataloss2,compressedtime1,compressedtime2,iscompressible1,iscompressible2 
            FROM matrixts_internal.desc_compress(
                'tb1', 0, 'lz4', 'deltadelta') limit 10;
    compressedsize1 | compressedsize2 | dataloss1 | dataloss2 | compressedtime1 | compressedtime2 | iscompressible1 | iscompressibl
    e2
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    ---
             4835 |             975 | NO        | NO        |          275976 |          171902 | YES             | YES
             4848 |            1041 | NO        | NO        |          241742 |          111632 | YES             | YES
             4843 |             995 | NO        | NO        |          161580 |          123520 | YES             | YES
             4843 |             976 | NO        | NO        |          180704 |          118966 | YES             | YES
             4846 |            1009 | NO        | NO        |          157268 |          123994 | YES             | YES
             4844 |            1025 | NO        | NO        |           93118 |           70050 | YES             | YES
             4846 |            1018 | NO        | NO        |           91896 |           64120 | YES             | YES
             4843 |             997 | NO        | NO        |           89732 |           64062 | YES             | YES
             4845 |            1013 | NO        | NO        |           95010 |           71106 | YES             | YES
             4561 |             975 | NO        | NO        |           84664 |           82220 | YES             | YES
    (10 rows)

    What is displayed in this table is actually the comparison data for each RANGE in the selected data segment. It can be seen that the compression rate of the deltadelta algorithm in column f1 is much higher than lz4, and the compression time is relatively short. Therefore, compared with lz4, it is obviously a better choice to use the deltadelta algorithm in column f1.