datainspect is a storage diagnostic tool built into MARS2, which provides underlying data insights to accurately optimize the storage and query performance of your data.
Storage optimization requires paying attention to the actual content of the data in the physical file. Similar to PostgreSQL's Pageinspect, you can use datainspect to easily extract data segments in MARS2 physical storage for further analysis.
In addition, datainspect integrates relevant index and meta information in MARS2 to provide basic information such as NULL distribution and minmax, thereby aiding in the optimization of the I/O scanning process.
Notes!
This tool is only available for MARS2 tables.
datainspect is a system function built in MARS2 and can be used after MARS2 is installed correctly. MARS2 tables depend on matrixts
timing extension. Before building a table, you first need to create an extension in the database using the storage engine.
Notes!
matrixts
is extended to the database level, and it can be created once in a database without repeated creation.=# CREATE EXTENSION matrixts;
1.2 Introduction to function functions
First, create a test table named
tb1
in the example. We will use this test table as a use case to introduce four related functions.=# CREATE TABLE tb1( f1 int8 encoding(minmax), f2 int8 encoding(minmax), f3 float8 encoding(minmax), f4 text ) USING MARS2 WITH (compresstype=lz4);
Create a MARS2 index.
=# CREATE INDEX ON tb1 USING mars2_btree(f1);
Construct 24,000 test data.
=# INSERT INTO tb1 SELECT generate_series(1, 24000), mod((random()*1000000*(generate_series(1, 1200)))::int8, (random()::int8/100 + 100)), (random() * generate_series(1, 24000))::float8, (random() * generate_series(1, 24000))::text;
1.2.1 desc_ranges
The desc_ranges function integrates the underlying meta information and index of MARS2 to provide such as minmax index information and null value information, and also supports accurate monitoring of the underlying storage space occupancy of data.
SELECT <* / column1,column2,...> FROM matrixts_internal.desc_ranges(<'tablename'>)
Field | Description | Required |
---|---|---|
tablename | table name, partition table name when partitioning table | Yes |
Fields | Description |
---|---|
segno | Segment number, starting with 0 |
attno | Attribute Number, column number, starting with 0 |
forkno | Number of physical file shards. It is the underlying number of the database, which can be considered to correspond to a specific file. By default, it starts from the first fork |
offno | RANGE Offset in physical file, unit bytes. The default starts from 0 . In MARS2, data is stored in batches and a storage RANGE is formed in units of set compress_threshold (compression threshold parameter), forming its own position identification with offsets in the physical file. |
nbytes | RANGE The actual space occupied in physical files, unit bytes |
nrows | compress_threshold set when creating tables, default is 1200 . compress_threshold is the compression threshold. Used to control how many tuples (tuples) are compressed in a single table, and is the upper limit of the number of Tuples compressed in the same unit. |
nrowsnotnull | RANGE Non-null number of bars |
mmin | If this column supports minmax index, it displays the minimum value in the RANGE, and if it does not exist, it is NULL |
mmax | If this column supports minmax index, it is the maximum value in the RANGE. If it does not exist, it is NULL. |
=# SELECT attno, sum(nbytes)/1024 as "Size in KB"
FROM matrixts_internal.desc_ranges('tb1') GROUP BY attno ORDER BY attno;
attno | Size in KB
-------+--------------------------------------------------------------------------------------------------------------------------
0 | 94.9062500000000000
1 | 7.8203125000000000
2 | 187.3437500000000000
3 | 386.3515625000000000
(4 rows)
=# SELECT * FROM matrixts_internal.desc_ranges('tb1') WHERE segno = 1;
segno | attno | forkname | forkno | offno | nbytes | nrows | nrowsnotnull | mmin | mmax
-------+---------+-------+-------+---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 | 0 | data1 | 304 | 0 | 4848 | 1200 | 1200 | 15 | 7240
1 | 1 | data1 | 304 | 16376 | 856 | 1200 | 199 | 0 | 99
1 | 2 | data1 | 304 | 17712 | 9072 | 1200 | 1200 | 1.4602231817218758 | 704.8010557110921
1 | 3 | data1 | 304 | 50024 | 20272 | 1200 | 1200 | NULL | NULL
1 | 0 | data1 | 304 | 4848 | 4856 | 1200 | 1200 | 7243 | 14103
1 | 1 | data1 | 304 | 17232 | 160 | 1200 | 0 | NULL | NULL
1 | 2 | data1 | 304 | 26784 | 9760 | 1200 | 1200 | 705.0931003474365 | 1372.9018354549075
1 | 3 | data1 | 304 | 70296 | 19680 | 1200 | 1200 | NULL | NULL
1 | 0 | data1 | 304 | 9704 | 4856 | 1200 | 1200 | 14125 | 21417
1 | 1 | data1 | 304 | 17392 | 160 | 1200 | 0 | NULL | NULL
1 | 2 | data1 | 304 | 36544 | 9760 | 1200 | 1200 | 1375.043496121433 | 2084.906658862494
1 | 3 | data1 | 304 | 89976 | 19792 | 1200 | 1200 | NULL | NULL
1 | 0 | data1 | 304 | 14560 | 1816 | 445 | 445 | 21429 | 23997
1 | 1 | data1 | 304 | 17552 | 160 | 445 | 0 | NULL | NULL
1 | 2 | data1 | 304 | 46304 | 3720 | 445 | 445 | 2086.0748374078717 | 2336.065046118657
1 | 3 | data1 | 304 | 109768 | 7576 | 445 | 445 | NULL | NULL
(16 rows)
The show_range
function selects a piece of physical data in MARS2 and displays this piece of data into readable data. The currently supported data types are: int2, int4, int8, float4, float8, timestamp, date, text.
SELECT <* / column1,column2,...> FROM matrixts_internal.show_range(
tablename text,
attno int4,
forkno int4,
offno int4,
nbytes int4
)
Field | Description | Required |
---|---|---|
tablename | table name, partition table name when partitioning table | Yes |
attno | column number, when defining the table, the columns numbered from 0 in order |
Yes |
forkno | Physical file shard number | Yes |
offno | Offset of data in physical file, unit bytes | Yes |
nbytes | The actual space occupied by data in physical files, unit bytes | Yes |
Notes!
For details of the above parameters, please refer to the return introduction of thedesc_ranges
function above.
Fields | Description |
---|---|
rowno | row number. This number is determined by the relative offset in RANGE, rather than the absolute offset of the entire table |
val | Actual content |
Notes!
There may be errors in theval
displayed on floating point data.
=# SELECT * FROM matrixts_internal.show_range('tb1', 1, 304, 16176, 808) LIMIT 20;
rowno | val
-------+--------
1 | 4
2 | 36
3 | 81
4 | 58
5 | 17
6 | 75
7 | 11
8 | 84
9 | 60
10 | 78
11 | 69
12 | 0
13 | 87
14 | 40
15 | 72
16 | 58
17 | 17
18 | 48
19 | 70
20 | 6
(20 rows)
The dump_range
function decompresses the physical data selected in MARS2 into a binary file for secondary analysis.
=# SELECT <* / column1,column2,...> FROM matrixts_internal.dump_ranges(
tablename text,
attno int4,
outfile text,
forkno int4,
offno int4,
limits int4
);
Field | Description | Required |
---|---|---|
tablename | table name, partition table name when partitioning table | Yes |
attno | Column number | Yes |
outfile | Exported file name. After the export is completed, different segments will produce independent suffixes identified with different numbers. For example, the data file is named tb1-f2.bin , and after export, it appears as tb1-f2.bin-seg1 on Segment1 |
Yes |
forkno | Physical file shard number | No |
offno | Offset of data in physical file, unit bytes | No |
limits | Default is 100, and the number of RANGEs limited by the specified positions of forkno and offno is the starting position |
No |
Note! Here
limits
limits only the number of exported RANGEs on a single segment.
Notes!
For details of the above parameters, please refer to the return introduction of thedesc_ranges
function above.
return
Example
=# SELECT * FROM matrixts_internal.dump_ranges('tb1', 1, '/data/demo/tb1-f2.bin', 304, 16176, 1) LIMIT 20;
nbytes
--------
0
0
1480
1632
1704
1592
(6 rows)
In the example, after executing dump_ranges
, it was found that 6 results were returned. This is because the export is performed independently by the Segment. Each of them represents the export results on a downstream Segment, which will appear in the corresponding directory. Among them, nbytes
of 2 results is 0
because there is no data on the Segment that meets the filtering conditions.
After execution, each Segment host will generate a binary file with the original suffix and the independent suffix identifier .-seg<no>
.

desc_compress compares two user-specified compression algorithms for a continuous segment of physical data selected in the MARS2 column store.
=# SELECT <* / column1,column2,...> FROM matrixts_internal.desc_compress(
tablename text,
attno int4,
<method1> text,
<method2> int4,
forkno int4,
offno int4,
limits int4
);
Field | Description | Required |
---|---|---|
tablename | table name, partition table name when partitioning table | Yes |
attno | Column number | Yes |
Compression algorithm name 1 | such as zstd, lz4, deltadelta, etc. | Yes |
Compression algorithm name 2 | such as zstd, lz4, deltadelta, etc. | Yes |
forkno | Physical file shard number | No |
offno | Offset of data in physical file, unit bytes | No |
limits | Default is 100, and the number of RANGEs limited by the specified positions of forkno and offno is the starting position |
No |
Fields | Description |
---|---|
segno | Segment number, starting with 0 |
attno | Attribute Number, column number, starting with 0 |
forkno | Number of physical file shards. It is the underlying number of the database, which can be considered to correspond to a specific file. By default, it starts from the first fork |
offno | RANGE Offset in physical file, unit bytes. The default starts from 0 . In MARS2, data is stored in batches and a storage RANGE is formed in units of set compress_threshold (compression threshold parameter), forming its own position identification with offsets in the physical file. |
nbytes | RANGE The actual space occupied in physical files, unit bytes |
nrows | compress_threshold set when creating tables, default is 1200 . compress_threshold is the compression threshold. Used to control how many tuples (tuples) are compressed in a single table, and is the upper limit of the number of Tuples compressed in the same unit. |
nrowsnotnull | RANGE Non-null number of bars |
mmin | If this column supports minmax index, it displays the minimum value in the RANGE, and if it does not exist, it is NULL |
mmax | If this column supports minmax index, it is the maximum value in the RANGE. If it does not exist, it is NULL. |
decompresstime0 | Decompress the current column's original compression algorithm used, unit clock cycle (CPU Cycle) |
compressedsize1 | Use <method1> to compress the data size, unit bytes |
compressedtime1 | Use <method1> to compress the time of this data, in CPU Cycle |
selectiontime1 | If <method1> is the system-matched Auto, then the meaning of this parameter is the time taken to analyze data characteristics (CPU Cycle) |
decompressedtime1 | Use <method1> to decompress this data in CPU Cycle |
iscompressible1 | Use <method1> to compress. When the compression algorithm compresses the result close to the original data, or even larger, it is recommended to give up compression |
dataloss1 | Is there any accuracy loss after using <method1> |
compressedsize2 | Use <method2> to compress the data in bytes |
compressedtime2 | Use <method2> to compress this data in unit CPU Cycle |
selectiontime2 | If <method2> is adaptive encoding, then the meaning of this parameter is the time taken to analyze data characteristics (CPU Cycle) |
decompressedtime2 | Use <method2> to decompress this data in unit CPU Cycle |
iscompressible2 | Use <method2> to compress. When the compression algorithm compresses the result close to the original data, or even larger, it is recommended to give up compression |
dataloss2 | Whether there will be accuracy loss after using <method2> |
desc_compress
to analyze what compression algorithm to use for f1 column data to obtain a higher compression rate.=# SELECT compressedsize1,compressedsize2,dataloss1,dataloss2,compressedtime1,compressedtime2,iscompressible1,iscompressible2
FROM matrixts_internal.desc_compress(
'tb1', 0, 'lz4', 'deltadelta') limit 10;
compressedsize1 | compressedsize2 | dataloss1 | dataloss2 | compressedtime1 | compressedtime2 | iscompressible1 | iscompressibl
e2
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---
4835 | 975 | NO | NO | 275976 | 171902 | YES | YES
4848 | 1041 | NO | NO | 241742 | 111632 | YES | YES
4843 | 995 | NO | NO | 161580 | 123520 | YES | YES
4843 | 976 | NO | NO | 180704 | 118966 | YES | YES
4846 | 1009 | NO | NO | 157268 | 123994 | YES | YES
4844 | 1025 | NO | NO | 93118 | 70050 | YES | YES
4846 | 1018 | NO | NO | 91896 | 64120 | YES | YES
4843 | 997 | NO | NO | 89732 | 64062 | YES | YES
4845 | 1013 | NO | NO | 95010 | 71106 | YES | YES
4561 | 975 | NO | NO | 84664 | 82220 | YES | YES
(10 rows)
What is displayed in this table is actually the comparison data for each RANGE in the selected data segment. It can be seen that the compression rate of the deltadelta
algorithm in column f1 is much higher than lz4
, and the compression time is relatively short. Therefore, compared with lz4
, it is obviously a better choice to use the deltadelta
algorithm in column f1.