MatrixBench Basic Features

This document describes some of the basic features of MatrixBench individually and in detail, including the following:

Multiple data types and features
Combined query statement

Notes!
Currently, mxbench is open source, and your review and contribution are welcome. Please click here read README.

1 Multiple data types and features

1.1 Support

Notes!
Due to the particularity of the timing scenario, mxbench stipulates that the first column of the data table must be a timestamp column, the second column must be a device identification column, and the indicator column that can be written to user data starting from the third column. Currently, multi-data types and feature functions only support metric columns, that is, columns starting from the third column.

Currently mxbench is able to support the acquisition of many different types of indicators in the signal table: in addition to the four types of int4, int8, float4, and float8, text and varchar.
int4, int8, float4, and float8 are also possible to specify data characteristics such as value range.
The varchar type also provides special types plate_template and vin_template to better adapt to the needs of the Internet of Vehicles scenarios. The plate_template type is used to generate the varchar column of the license plate number type characteristics, such as: Guangdong BDM7709. vin_template is used to generate the varchar column for frame number type characteristics, such as: 1G1JC124627237595.

1.2 Limitations

Column 1 must be a timestamp column, the column name is unlimited, and the type must be timestamp or timestamptz.
Column 2 must be a device number/device list, the column name is unlimited, the type must be one of text, varchar, int8.

1.3 Usage

There are two configuration methods for implementing a large number of indicators, multiple data types, multiple data value ranges, etc. in mxbench.

1.3.1 Implemented through DDL files

In the DDL file, after creating the table, express the purpose of the column by adding a comment: COMMENT ON COLUMN.

CREATE EXTENSION IF NOT EXISTS matrixts;
ALTER EXTENSION matrixts UPDATE;
CREATE SCHEMA IF NOT EXISTS "public";
CREATE TABLE "public"."table1" (
    tsss timestamp
,   vinnn bigint
,   lltt varchar(32)
,   c_arb float8
,   exttt json
)
USING mars2 WITH ( compress_threshold='1000', chunk_size='32' )
DISTRIBUTED BY (vinnn)
PARTITION BY RANGE(tsss) (
        START ('2022-04-18 09:00:00')
        END ('2022-04-19 09:00:06')
        EVERY ('10800 second'),
        DEFAULT PARTITION default_prt
);

CREATE INDEX IF NOT EXISTS "idx_table1" ON "public"."table1"
USING mars2_btree(
        vinnn
      , tsss);

COMMENT ON COLUMN table1.lltt is '{"name": "plate_template"}';
COMMENT ON COLUMN table1.c_arb is '{"min": 3.5, "max": 5}';
COMMENT ON COLUMN table1.exttt is '{"is-ext": true, "columns-descriptions": [{"type": "float8", "count": 1, "comment": {"min": 3, "max": 4}},{"type": "float4", "count": 3, "comment": {"min": 2, "max": 3}}]}';

The example configuration is as above. COMMENT is all JSON format strings, description:

There are content restrictions for the first two columns, please see the "1.2 Restrictions" section above.
Except for the first 2 columns, everything else is considered an indicator column.
The lltt column is of type varchar. In the example, we use COMMENT to mark: name=license_template, and data of license plate number type will be generated for lltt.
The c_arb column is of type float8. According to COMMENT, random data in the range 3.5 to 5 will be generated.
The table1.exttt column is of type JSON, and its COMMENT is: is-ext=true, meaning it is marked as an extended metric column, which may contain multiple simple metrics. There are two ways to view simple metrics:
- If columns-descriptions is not specified, the total-metrics-count and metrics-type parameters in the Global category are read. The configuration of total-metrics-count must be greater than or equal to Simple metric columns + 3, that is, the extended column contains at least 3 indicators, otherwise an error message will be generated.
- If columns-descriptions is specified and is a legal json array string, the total-metrics-count and metrics-type parameters in the Global category are invalid. You need to understand the number of indicators through the columns-descriptions parameter. Examples are as follows:
```
[
{"type": "float8", "count": 1, "comment": {"min":  3, "max": 4}},  
{"type": "float4", "count": 3, "comment": {"min":  2, "max": 3}}, 
]
```
  Therefore, there are 1 metric of type float8 and 3 metrics of type float4 in this extension column. At the same time, there are also limits on the value range.

1.3.2 Implemented through configuration files

In this section, you can reuse the columns-description syntax of the above extension columns to make metrics-descriptions under the Global module accept a string-type parameter. Examples are as follows:

[global]
  table-name = "table1"
  metrics-descriptions =  """[
    {"type": "float8", "count": 1000, "comment": {"min": 3, "max": 4}},
    {"type": "float4", "count": 3, "comment": {"min": 2, "max": 3}}
    ]"""

The description of this parameter means that in addition to the fixed column (first column and second column) that are not included in the indicator calculation, the table table1 also contains 1000 metrics of type float8, 3 metrics of type float4, and 1003 total indicators, with a total of 1005 columns, which exceeds the upper limit of the number of columns in the data table (1000 columns).
Therefore, according to YMatrix logic, there will be 6 metrics arranged in the extended column of type JSON. In this way, there are only 997 simple metrics and 2 fixed columns, and 1 extended column (column containing multiple metrics), which is exactly 1,000 columns in total, and it is back to the support scope of the data table.

Notes!
You can also use the --total-metrics-count and ---metrics-type parameters in the Global category without specifying the --metrics-descriptions parameter as alternatives.

2 Combination query statements

Directly inputting customized query statements cannot set certain values in the SELECT and WHERE statements based on the characteristics of the data generated this time, so the effect of query tests is limited. This problem can be solved by using combined query statements.

2.1 Configuration Overview

    {
    // The name of the query statement
    "name": "QUERY_NAME",
    // Statements after SELECT
    "projections": {"use-raw-expression": true, "expression": "*"},
    // Expression after SELECT FROM
    "from": {"use-relation-identifier": true, "relation-identifier": "sub-relation-identifier"},
    // Expressions in device identification column-related WHERE statements
    "device-predicate": {"count": 2, "is-random": ture},
    // Timestamp column-related expressions in WHERE statements
    "ts-predicate": {"start": "2022-05-05 01:04:10", "end": "2022-05-05 01:04:10"},
    // Expressions in user metrics related WHERE statements
    "metrics-predicate": {"use-raw-expression": true, "expression": "m1>=37.5"},
    // GROUP BY statement related expressions
    "group-by": {"use-raw-expression": true, "expression": "device_column_name,ts"},
    // ORDER BY statement related expressions
     "order-by": {"user-raw-expression": true, "": "s desc"},
     // LIMIT statement related expressions
     "limit": 3
    },

2.2 Detailed explanation

"name" The name of this query statement. Will be printed in the summary report of the query statement.
"projections" Default is not allowed. Refers to the expression after SELECT, which can be a query for all fields "*", or a projection query for some fields.
Accepts a configuration of JSON type.
Use raw-expression:
```
{"use-raw-expression": true, "expression": "*"}
```

"from" FROM expression after.
The default is to use the table-name configured in the Global category directly, which is equivalent to:

{
"use-relation-identifier": true, 
"relation-identifier": "table-name-in-global-config",
}

Use the name of the data table directly:

{
"use-relation-identifier": true, 
"relation-identifier": "device_signal_mars2",
}

Use another data table to illustrate:

{ 
"relation-statement":
 {
 "projections": {"use-raw-expression": true, "expression": "device_column_name,max(m1) as mp, min(m2), count(m1), avg(m3)"},
 "device-predicate": {"count": 17, "is-random": true},
 "ts-predicate": {"start": "2022-05-05 00:00:00", "end": "2022-05-06 00:00:00"},
 "group-by": {"use-raw-expression": true, "expression": "device_column_name"}
 }
}

"device-predicate" An expression in the WHERE statement about the device number, accepting a configuration of type JSON.
The default is that there are no expressions in the WHERE statement related to the device number, that is, the full device.
Randomly select n (n is a positive integer, within the range of devices): For example, n = 2, and the following configuration is generated: WHERE <device_column_name> IN (<random_device_id1>, <random_device_id2>).
```
{
"count": 2, 
"is-random": false
}
```
If n = 1, it will be interpreted as an equal value query, and the following configuration is generated: WHERE <device_column_name>=<random_device_id>.
```
{
"count": 1, 
"is-random": false
}
```
The string after using raw-expression: "expression" appears directly after WHERE. and other predictates are "AND".
```
{
"use-raw-expression": true, 
"expression": "device_column_name IN (1234, 4321)"
}
```
"ts-predicate" An expression in a WHERE statement about timestamps that accepts a configuration of type JSON.
The default is that there is no expression in the WHERE statement about the device number, that is, the full device. Randomly select 1 time point (within the timestamp range where the data was generated).
```
{
"is-random": true, 
"duration": 3600  
}
```
Set start and end time: Assuming the timestamp field is named ts, an expression of WHERE ts >= '2022-07-15 18:07:00' AND ts <= '2022-07-15 18:31:17' will be generated.
```
{
"start": "2022-07-15 18:07:00", 
"end": "2022-07-15 18:31:17",
}
```
Equivalent query: WHERE ts = '2022-07-15 18:07:00'.
```
{
"start": "2022-07-15 18:07:00", 
"end": "2022-07-15 18:07:00",
}
```
Open interval query: WHERE ts >= '2022-07-15 18:07:00' AND ts < '2022-07-15 18:31:17'.
```
{
"start": "2022-07-15 18:07:00", 
"end": "2022-07-15 18:07:00",
"end-exclusive": true
}
```
Set alias for timestamp fields: WHERE tttt >= '2022-07-15 18:07:00' AND tttt < '2022-07-15 18:31:17'.
```
{
"has-alias": true,
"alias": "tttt",
"start": "2022-07-15 18:07:00", 
"end": "2022-07-15 18:07:00",
"end-exclusive": true
}
```
The string after using raw-expression: "expression" appears directly after WHERE. and its predictates are "AND".
```
{
"use-raw-expression": true, 
"expression": "ts='2022-07-16 10:31:17'"
}
```
"metrics-predicate" Expressions in WHERE statements regarding metric filtering, accepting a configuration of type JSON.
The default is no metric filtering. The string after using raw-expression: "expression" appears directly after WHERE. and other predictates are "AND".
```
{
"use-raw-expression": true, 
"expression": "m1>=37.5"
}
```
"group-by" GROUP BY statement, accepts a configuration of JSON type.
The default is to not add the GROUP BY statement.
Use raw-expression:
```
{
"use-raw-expression": true, 
"expression": "device_column_name,ts"
}
```
"order-by" ORDER BY statement, accepts a configuration of type JSON.
The default is to not add the ORDER BY statement.
Use raw-expression:
```
{
"user-raw-expression": true, 
"expression": "s desc"
}
```
"limit" Accepts a positive integer.

Notes!
For complete command line parameter information for MatrixBench, please refer to MatrixBench Command Line Parameters; for detailed explanation of MatrixBench progress information and statistical reports, please refer to MatrixBench Understanding Progress Information and Statistical Report.

简体中文