Storage diagnostic tool datainspect

datainspect is a storage diagnostic tool built into MARS2, which provides underlying data insights to accurately optimize the storage and query performance of your data.

Storage optimization requires paying attention to the actual content of the data in the physical file. Similar to PostgreSQL's Pageinspect, you can use datainspect to easily extract data segments in MARS2 physical storage for further analysis.
In addition, datainspect integrates relevant index and meta information in MARS2 to provide basic information such as NULL distribution and minmax, thereby aiding in the optimization of the I/O scanning process.

Notes!
This tool is only available for MARS2 tables.

1.1 Installation

datainspect is a system function built in MARS2 and can be used after MARS2 is installed correctly. MARS2 tables depend on matrixts timing extension. Before building a table, you first need to create an extension in the database using the storage engine.

Notes!
matrixts is extended to the database level, and it can be created once in a database without repeated creation.
=# CREATE EXTENSION matrixts;
1.2 Introduction to function functions

First, create a test table named tb1 in the example. We will use this test table as a use case to introduce four related functions.
=# CREATE TABLE tb1(
f1 int8 encoding(minmax), 
f2 int8 encoding(minmax), 
f3 float8 encoding(minmax), 
f4 text
) USING MARS2
WITH (compresstype=lz4);
Create a MARS2 index.
=# CREATE INDEX ON tb1 USING mars2_btree(f1);
Construct 24,000 test data.
=# INSERT INTO tb1 SELECT 
generate_series(1, 24000), 
mod((random()*1000000*(generate_series(1, 1200)))::int8, (random()::int8/100 + 100)), 
(random() * generate_series(1, 24000))::float8, 
(random() * generate_series(1, 24000))::text;
1.2.1 desc_ranges

The desc_ranges function integrates the underlying meta information and index of MARS2 to provide such as minmax index information and null value information, and also supports accurate monitoring of the underlying storage space occupancy of data.

grammar

SELECT <* / column1,column2,...> FROM matrixts_internal.desc_ranges(<'tablename'>)

Parameters

Field	Description	Required
tablename	table name, partition table name when partitioning table	Yes

return

Fields	Description
segno	Segment number, starting with `0`
attno	Attribute Number, column number, starting with `0`
forkno	Number of physical file shards. It is the underlying number of the database, which can be considered to correspond to a specific file. By default, it starts from the first fork
offno	RANGE Offset in physical file, unit bytes. The default starts from `0`. In MARS2, data is stored in batches and a storage RANGE is formed in units of set `compress_threshold` (compression threshold parameter), forming its own position identification with offsets in the physical file.
nbytes	RANGE The actual space occupied in physical files, unit bytes
nrows	`compress_threshold` set when creating tables, default is `1200`. `compress_threshold` is the compression threshold. Used to control how many tuples (tuples) are compressed in a single table, and is the upper limit of the number of Tuples compressed in the same unit.
nrowsnotnull	RANGE Non-null number of bars
mmin	If this column supports minmax index, it displays the minimum value in the RANGE, and if it does not exist, it is NULL
mmax	If this column supports minmax index, it is the maximum value in the RANGE. If it does not exist, it is NULL.

Example

Given a table, view the usage of each column storage space in the entire cluster.

=# SELECT attno, sum(nbytes)/1024 as "Size in KB" 
 FROM matrixts_internal.desc_ranges('tb1') GROUP BY attno ORDER BY attno;
attno |      Size in KB      
-------+--------------------------------------------------------------------------------------------------------------------------
  0 |  94.9062500000000000
  1 |   7.8203125000000000
  2 | 187.3437500000000000
  3 | 386.3515625000000000
(4 rows)

Given a table, look at the null value and data distribution rules on a certain segment.

=# SELECT * FROM matrixts_internal.desc_ranges('tb1') WHERE segno = 1;
segno | attno | forkname | forkno | offno  | nbytes | nrows | nrowsnotnull |        mmin        |        mmax        
-------+---------+-------+-------+---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  1 |     0 | data1    |    304 |      0 |   4848 |  1200 |         1200 | 15                 | 7240
  1 |     1 | data1    |    304 |  16376 |    856 |  1200 |          199 | 0                  | 99
  1 |     2 | data1    |    304 |  17712 |   9072 |  1200 |         1200 | 1.4602231817218758 | 704.8010557110921
  1 |     3 | data1    |    304 |  50024 |  20272 |  1200 |         1200 | NULL               | NULL
  1 |     0 | data1    |    304 |   4848 |   4856 |  1200 |         1200 | 7243               | 14103
  1 |     1 | data1    |    304 |  17232 |    160 |  1200 |            0 | NULL               | NULL
  1 |     2 | data1    |    304 |  26784 |   9760 |  1200 |         1200 | 705.0931003474365  | 1372.9018354549075
  1 |     3 | data1    |    304 |  70296 |  19680 |  1200 |         1200 | NULL               | NULL
  1 |     0 | data1    |    304 |   9704 |   4856 |  1200 |         1200 | 14125              | 21417
  1 |     1 | data1    |    304 |  17392 |    160 |  1200 |            0 | NULL               | NULL
  1 |     2 | data1    |    304 |  36544 |   9760 |  1200 |         1200 | 1375.043496121433  | 2084.906658862494
  1 |     3 | data1    |    304 |  89976 |  19792 |  1200 |         1200 | NULL               | NULL
  1 |     0 | data1    |    304 |  14560 |   1816 |   445 |          445 | 21429              | 23997
  1 |     1 | data1    |    304 |  17552 |    160 |   445 |            0 | NULL               | NULL
  1 |     2 | data1    |    304 |  46304 |   3720 |   445 |          445 | 2086.0748374078717 | 2336.065046118657
  1 |     3 | data1    |    304 | 109768 |   7576 |   445 |          445 | NULL               | NULL
(16 rows)

1.2.2 show_range

The show_range function selects a piece of physical data in MARS2 and displays this piece of data into readable data. The currently supported data types are: int2, int4, int8, float4, float8, timestamp, date, text.

grammar

SELECT <* / column1,column2,...> FROM matrixts_internal.show_range(
 tablename text, 
 attno int4, 
 forkno int4, 
 offno int4, 
 nbytes int4
)

Parameters

Field	Description	Required
tablename	table name, partition table name when partitioning table	Yes
attno	column number, when defining the table, the columns numbered from `0` in order	Yes
forkno	Physical file shard number	Yes
offno	Offset of data in physical file, unit bytes	Yes
nbytes	The actual space occupied by data in physical files, unit bytes	Yes

Notes!
For details of the above parameters, please refer to the return introduction of the desc_ranges function above.

return

Fields	Description
rowno	row number. This number is determined by the relative offset in RANGE, rather than the absolute offset of the entire table
val	Actual content

Notes!
There may be errors in the val displayed on floating point data.

Example

Given a physically stored RANGE, view the contents of this RANGE.

=# SELECT * FROM matrixts_internal.show_range('tb1', 1, 304, 16176, 808) LIMIT 20;
rowno | val 
-------+--------
  1 | 4
  2 | 36
  3 | 81
  4 | 58
  5 | 17
  6 | 75
  7 | 11
  8 | 84
  9 | 60
 10 | 78
 11 | 69
 12 | 0
 13 | 87
 14 | 40
 15 | 72
 16 | 58
 17 | 17
 18 | 48
 19 | 70
 20 | 6
(20 rows)

1.2.3 dump_range

The dump_range function decompresses the physical data selected in MARS2 into a binary file for secondary analysis.

grammar

=# SELECT <* / column1,column2,...> FROM matrixts_internal.dump_ranges(
tablename text, 
attno int4, 
outfile text, 
forkno int4, 
offno int4,
limits int4
);

Parameters

Field	Description	Required
tablename	table name, partition table name when partitioning table	Yes
attno	Column number	Yes
outfile	Exported file name. After the export is completed, different segments will produce independent suffixes identified with different numbers. For example, the data file is named `tb1-f2.bin`, and after export, it appears as `tb1-f2.bin-seg1` on Segment1	Yes
forkno	Physical file shard number	No
offno	Offset of data in physical file, unit bytes	No
limits	Default is 100, and the number of RANGEs limited by the specified positions of `forkno` and `offno` is the starting position	No

Note! Here limits limits only the number of exported RANGEs on a single segment.

Notes!
For details of the above parameters, please refer to the return introduction of the desc_ranges function above.

return
- nbytes: The amount of data exported on each Segment node, unit bytes.
Example

Export a piece of physical data into a binary file
```
=# SELECT * FROM matrixts_internal.dump_ranges('tb1', 1, '/data/demo/tb1-f2.bin', 304, 16176, 1) LIMIT 20;
nbytes 
--------
   0
   0
1480
1632
1704
1592
(6 rows)
```
In the example, after executing dump_ranges, it was found that 6 results were returned. This is because the export is performed independently by the Segment. Each of them represents the export results on a downstream Segment, which will appear in the corresponding directory. Among them, nbytes of 2 results is 0 because there is no data on the Segment that meets the filtering conditions.

After execution, each Segment host will generate a binary file with the original suffix and the independent suffix identifier .-seg<no>.

1.2.4 desc_compress

desc_compress compares two user-specified compression algorithms for a continuous segment of physical data selected in the MARS2 column store.

grammar

=# SELECT <* / column1,column2,...> FROM matrixts_internal.desc_compress(
tablename text, 
attno int4, 
<method1> text, 
<method2> int4,
forkno int4, 
offno int4,
limits int4
);

Parameters

Field	Description	Required
tablename	table name, partition table name when partitioning table	Yes
attno	Column number	Yes
Compression algorithm name 1	such as zstd, lz4, deltadelta, etc.	Yes
Compression algorithm name 2	such as zstd, lz4, deltadelta, etc.	Yes
forkno	Physical file shard number	No
offno	Offset of data in physical file, unit bytes	No
limits	Default is 100, and the number of RANGEs limited by the specified positions of `forkno` and `offno` is the starting position	No

return

Fields	Description
segno	Segment number, starting with `0`
attno	Attribute Number, column number, starting with `0`
forkno	Number of physical file shards. It is the underlying number of the database, which can be considered to correspond to a specific file. By default, it starts from the first fork
offno	RANGE Offset in physical file, unit bytes. The default starts from `0`. In MARS2, data is stored in batches and a storage RANGE is formed in units of set `compress_threshold` (compression threshold parameter), forming its own position identification with offsets in the physical file.
nbytes	RANGE The actual space occupied in physical files, unit bytes
nrows	`compress_threshold` set when creating tables, default is `1200`. `compress_threshold` is the compression threshold. Used to control how many tuples (tuples) are compressed in a single table, and is the upper limit of the number of Tuples compressed in the same unit.
nrowsnotnull	RANGE Non-null number of bars
mmin	If this column supports minmax index, it displays the minimum value in the RANGE, and if it does not exist, it is NULL
mmax	If this column supports minmax index, it is the maximum value in the RANGE. If it does not exist, it is NULL.
decompresstime0	Decompress the current column's original compression algorithm used, unit clock cycle (CPU Cycle)
compressedsize1	Use `<method1>` to compress the data size, unit bytes
compressedtime1	Use `<method1>` to compress the time of this data, in CPU Cycle
selectiontime1	If `<method1>` is the system-matched Auto, then the meaning of this parameter is the time taken to analyze data characteristics (CPU Cycle)
decompressedtime1	Use `<method1>` to decompress this data in CPU Cycle
iscompressible1	Use `<method1>` to compress. When the compression algorithm compresses the result close to the original data, or even larger, it is recommended to give up compression
dataloss1	Is there any accuracy loss after using `<method1>`
compressedsize2	Use `<method2>` to compress the data in bytes
compressedtime2	Use `<method2>` to compress this data in unit CPU Cycle
selectiontime2	If `<method2>` is adaptive encoding, then the meaning of this parameter is the time taken to analyze data characteristics (CPU Cycle)
decompressedtime2	Use `<method2>` to decompress this data in unit CPU Cycle
iscompressible2	Use `<method2>` to compress. When the compression algorithm compresses the result close to the original data, or even larger, it is recommended to give up compression
dataloss2	Whether there will be accuracy loss after using `<method2>`

Example

It is known that the sample table in this article uses the lz4 compression algorithm. Now we use desc_compress to analyze what compression algorithm to use for f1 column data to obtain a higher compression rate.

=# SELECT compressedsize1,compressedsize2,dataloss1,dataloss2,compressedtime1,compressedtime2,iscompressible1,iscompressible2 
        FROM matrixts_internal.desc_compress(
            'tb1', 0, 'lz4', 'deltadelta') limit 10;
compressedsize1 | compressedsize2 | dataloss1 | dataloss2 | compressedtime1 | compressedtime2 | iscompressible1 | iscompressibl
e2
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---
         4835 |             975 | NO        | NO        |          275976 |          171902 | YES             | YES
         4848 |            1041 | NO        | NO        |          241742 |          111632 | YES             | YES
         4843 |             995 | NO        | NO        |          161580 |          123520 | YES             | YES
         4843 |             976 | NO        | NO        |          180704 |          118966 | YES             | YES
         4846 |            1009 | NO        | NO        |          157268 |          123994 | YES             | YES
         4844 |            1025 | NO        | NO        |           93118 |           70050 | YES             | YES
         4846 |            1018 | NO        | NO        |           91896 |           64120 | YES             | YES
         4843 |             997 | NO        | NO        |           89732 |           64062 | YES             | YES
         4845 |            1013 | NO        | NO        |           95010 |           71106 | YES             | YES
         4561 |             975 | NO        | NO        |           84664 |           82220 | YES             | YES
(10 rows)

What is displayed in this table is actually the comparison data for each RANGE in the selected data segment. It can be seen that the compression rate of the deltadelta algorithm in column f1 is much higher than lz4, and the compression time is relatively short. Therefore, compared with lz4, it is obviously a better choice to use the deltadelta algorithm in column f1.

简体中文

Storage diagnostic tool datainspect

1.1 Installation

1.2 Introduction to function functions

1.2.1 desc_ranges

1.2.2 show_range

1.2.3 dump_range

1.2.4 desc_compress