MxBench

MxBench is a stress measurement tool for MatrixDB data loading and query. It can quickly generate random data based on the number of devices, time ranges, and number of indicators specified by the user, automatically create data tables, and perform data loading and querying in serial or concurrently. Users can flexibly configure indicator types, acquisition frequency, null rate, randomness, etc., and can also specify the number of threads for query, customized query statements, etc. MxBench can be run through the command line or through a configuration file. This tool is located in bin/mxbench in the MatrixDB installation directory.

1 Preparation

1.1 MatrixDB Cluster

A properly running MatrixDB cluster is required.

1.2 Environment variables

Since MxBench needs to call createdb, gpconfig, and gpstop, the user needs to configure relevant environment variables so that these commands can be executed correctly.

Specifically, you need to execute source <MatrixDB installation directory>/greenplum_path.sh, and the following environment variables need to be correctly set:

PGHOST
PGPORT
PGUSER
PGPASSWORD
PGDATABASE
MASTER_DATA_DIRECTORY

In addition, users can also try to execute commands such as createdb mxbench, gpconfig -s log_rotation_size, gpstop -rai to ensure that they can run correctly.

2 Usage

2.1 Quick Trial

If you want to quickly try MxBench on your personal development machine, you can run MxBench using configuration files or command lines.

2.1.1 Configuration File

You can use the following configuration file, named mxbench.conf and run: mxbench --config mxbench.conf. Note that the setting of the benchmark-parallel parameter needs to adapt to machine performance, and it is recommended that the number of CPU cores is less than or equal.

[database]
  db-database = "testdb1"
  db-master-port = 5432

[global]
  # Turn on progress viewing function, the default is true
  watch = true

  # The generated DDL, GUCs settings suggestions, and the storage directory of Query and other files
  workspace = "/tmp/mxbench"

  table-name = "table1"

  # Number of equipment
  tag-num = 25000
  # Metric data type, supporting four types: int4, int8, float4, float8
  metrics-type = "float8"
  # Number of indicators. If the number of indicators is greater than 998, the first 997 are used as simple columns.
  # The other ones are stored as json in a column named ext
  total-metrics-count = 300

  # The start timestamp of the generated data, ts-end must be later than ts-start, otherwise an error will be reported.
  ts-start = "2022-04-19 09:00:00"
  ts-end = "2022-04-19 09:03:00"

  # Collect indicators every few seconds
  ts-step-in-second = "2"

[generator]
  generator = "telematics"

  [generator.telematics]
    # The indicators of each device at each time point are uploaded in several pieces of data, and finally they are upsert into a tuple in the DB.
    generator-batch-size = 1
    # The generation ratio of delayed uploaded data (1~100), push the timestamp forward for 1 hour
    generator-disorder-ratio = 0
    # The null rate of generated data (1~100)
    generator-empty-value-ratio = 90
    # The randomness of the generated data is several OFF/S/M/L. The default is "OFF". You can select M in general scenarios.
    generator-randomness = "OFF"

[writer]
  writer = "stdin"

[benchmark]
  benchmark = "telematics"

  [benchmark.telematics]
    # Array, query concurrency
    benchmark-parallel = [8]
    # 3 queries provided: latest value of bicycle, latest value of 10 vehicles, and detailed bicycles
    benchmark-run-query-names = ["SINGLE_TAG_DETAIL_QUERY" ]
    # The number of times or time of running each query in each round, to make the time take effect, you need to set the number of times to 0, as follows:
    benchmark-run-times = 0
    benchmark-runtime-in-second = "30"

2.1.2 Command Line

You can also run MxBench using the command line. Running the following command is equivalent to running MxBench in the above configuration file.

mxbench run \
  --db-database "testdb1" \
  --db-master-host "localhost" \
  --db-master-port 5432 \
  --db-user "mxadmin" \
  --workspace "/tmp/mxbench" \
  --table-name "table1" \
  --ts-start "2022-04-19 09:00:00" \
  --ts-end "2022-04-19 09:02:00" \
  --ts-step-in-second 2 \
  --generator "telematics" \
  --writer "stdin" \
  --benchmark "telematics" \
  --benchmark-run-query-names "SINGLE_TAG_DETAIL_QUERY" \
  --benchmark-parallel 8 \
  --benchmark-run-times 0 \
  --benchmark-runtime-in-second 30

2.2 Configuration details

As can be seen from the sample configuration file, the configuration file is divided into the following sections:

Global configuration

database: database-related configuration;
global: Includes some table structure, data volume information, as well as management configurations such as dumping data files or actually performing tests, and whether progress observation is enabled;

Plug-in configuration

generator: data generator:

telematics (default) - generates data for the Internet of Vehicles scenario;
file - read data from a file;
nil - No data is generated.

writer: Responsible for writing generated data to MatrixDB through MxGate:

http (default) - Start MxGate in HTTP and load data into MatrixDB through it;
stdin - Start MxGate in the form of stdin and load data into MatrixDB through it;
nil - MxGate is not started, does not load.

benchmark: Pluginable, generate and execute query statements:

telematics (default) - Generate and execute common query in vehicle network scenarios, supporting customized query;
nil - No query statement generated or executed.

The following are detailed explanations of each module:

2.2.1 database

[database]
  # If the database name does not exist, MxBench will be automatically created; if it already exists, there will be no errors.
  db-database = "postgres"

  ## hostname of the host where the MatrixDB master instance is located
  db-master-host = "localhost"

  ## Port number of the MatrixDB master instance
  db-master-port = 5432

  # MatrixDB User Password
  db-password = ""

  # MatrixDB Username
  db-user = "mxadmin"

2.2.2 global

[global]

  # MxBench generates DDL files, existing system GUCs recovery scripts, best practice recommended GUCs change scripts,
  # csv data file, query file directory.
  # If it does not exist, MxBench will be created automatically; if it already exists and is not a directory, an error will be reported. You may need to pay attention to permission issues.
  # Every time you run MxBench, a directory named Unix timestamp is created under it, and the files generated by that run are in that directory.
  # Default is "/tmp/mxbench"
  workspace = "/tmp/mxbench"

  # Whether to dump the csv data file, optionally true or false.
  # The default is false, and DDL, data loading, and query will be performed.
  # In the directory with Unix timestamps under workspace, DDL, GUCs-related scripts and query files will be generated.
  # 1. mxbench_ddl.sql: DDL file;
  # 2. mxbench_gucs_setup.sh: Best practice recommended GUCs setup script, you may need to restart MatrixDB before all of them take effect.
  # 3. mxbench_gucs_backup.sh: The existing system GUCs backup. If the GUCs of existing systems are consistent with the GUCs of best practices, they will not be generated.
  # 4. mxbench_<benchmark-plugin>_query.sql: The query statement generated by the corresponding benchmark plugin.
  # If true is selected, DDL, data loading and query will not be actually performed; in addition to the above files, it will also generate:
  # 5. mxbench_<generator-plugin>_data.csv: The csv file of data generated by the corresponding generator plugin.
  dump = false

  # Whether to enable process observation, you can choose true or false.
  #(Default) If true is selected, the execution progress information of the writer and benchmark modules will be printed every 5 seconds.
  # Select false to close.
  watch = true

  # Whether to perform data loading and query at the same time, optionally true or false.
  # If true is selected: perform a mixed load of data loading and querying
  # After query is finished, it will run again until the data loading is over.
  # (Default) If false is selected: execute data loading first, and then execute query after execution.
  simultaneous-loading-and-query = false

  # If you need to customize DDL, fill in the path of the DDL file in this parameter.
  # (Default) If you do not fill in, DDL will be generated based on other related configurations.
  ddl-file-path = ""

  # Type of indicator. The default is "float8", which is a double-precision floating point number.
  # Only "int4", "int8", "float4", "float8", and 4 types are supported.
  metrics-type = "float8"

  # The schema name, defaults to "public".
  schema-name = "public"

  # Data table name. The default is "", which must be set manually.
  # If the table with the same name exists in the configured database or schema, an error will be reported and the MxBench program will be terminated.
  table-name = "test_table"

  # Number of equipment. Default is 25000.
  tag-num = 25000

  # Total number of indicators. Default is 300.
  # If the total number of indicators is greater than 998, the exceeded indicators will be stored in a column named "ext" as json type.
  # For example, if the total number of indicators is set to 1000, 998 will be stored in a column named c0~c997 in a simple column.
  # The other 2 are placed in the json type column of ext in json form.
  total-metrics-count = 300

  # Generate data timestamp start time. Because there is delayed reporting of data, the data generated may be earlier than this time.
  ts-start = "2022-04-25 09:00:00"

  # Generate data timestamp termination time. ts-end must be later than ts-start, otherwise an error will be reported
  ts-end = "2022-04-25 09:01:00"

  # The indicators are collected every few seconds. Default is "1".
  ts-step-in-second = "1"

2.2.3 generator

2.2.3.1 telematics (default)

To generate the data of the Internet of Vehicles scenario, select generator="telematics".

[generator]

  generator = "telematics"

  [generator.telematics]

    # Each device has several metrics uploaded to MatrixDB in each time stamp.
    # For example, if set to 5, the device named "tag1" is under the timestamp "2022-04-25 09:00:03",
    # The data of each indicator will be transmitted to MatrixDB in 5 pieces and will eventually be merged into 1 tuple.
    # The default is 1. That is, the indicator information is not split.
    generator-batch-size = 1

    # The proportion of delayed reporting of data. Values 1 to 100.
    # The default value is 0, which means there is no delayed reporting of data.
    generator-disorder-ratio = 0

    # The null rate of each row of data. The value is 1~100. The default is 90%, that is, 90% of the indicators will be empty.
    generator-empty-value-ratio = 90

    # The randomness of the index data is divided into four levels: OFF/S/M/L. The default is OFF.
    generator-randomness = "OFF"

2.2.3.2 file

Read data from the csv file and load it, select generator="file".

[generator]

  generator = "file"

  [generator.file]

    # The absolute path to data csv data. It can receive an array, that is, upload multiple csv files.
    generator-file-paths = []

    # In the same as [generator.telematics], setting these two parameters is not required.
    # But proper settings will help us better generate DDL statements. If you select a custom DDL statement, it will not work.
    generator-batch-size = 1
    generator-empty-value-ratio = 90

2.2.3.3 nil

To not generate or load any data, select generator="nil".

[generator]

  generator = "nil"

2.2.4 writer

2.2.4.1 http(default)

Start MxGate in http form and load data.

[writer]

  writer = "http"

  [writer.http]

    # Whether to use gzip compression to send http messages, it does not use default
    writer-use-gzip = false

    # The concurrency of sending data to MxGate.
    writer-parallel = 8

2.2.4.2 stdin

Start MxGate in stdin mode and load data.

[writer]

  writer = "stdin"

2.2.4.3 nil

Do not start MxGate and do not write data.

[writer]

  writer = "nil"

2.2.5 benchmark

2.2.5.1 telematics (default)

[benchmark]

  benchmark = "telematics"

  [benchmark.telematics]

    # The query name provided by telematics to be executed sequentially,
    # Now available:
    # 1. "SINGLE_TAG_LATEST_QUERY"
    # 2. "MULTI_TAG_LATEST_QUERY"
    # 3. "SINGLE_TAG_DETAIL_QUERY"
    # There are 3 legal query names in total, namely:
    # "SINGLE_TAG_LATEST_QUERY": Get the values of each indicator of the bicycle's most recent timestamp;
    # "MULTI_TAG_LATEST_QUERY": Randomly select 10 cars to obtain the various indicator values of their most recent timestamps;
    # "SINGLE_TAG_DETAIL_QUERY": Get the values of each indicator of the bicycle over a period of time.
    # Note: For ultra-wide tables, there are many indicators, and DBMS may not support getting all indicator values at once.
    # Therefore, the following values are taken from the number of indicators obtained by the parameter debugging and the time period of "SINGLE_TAG_DETAIL_QUERY".
    # For example, enter [ "SINGLE_TAG_LATEST_QUERY", "MULTI_TAG_LATEST_QUERY", "SINGLE_TAG_DETAIL_QUERY" ] to execute the above three query sequentially.
    # On this basis, deleting the query name can prevent the corresponding query from being executed. Entering another name will be ignored.
    # By default, no preset query is executed.
    benchmark-run-query-names = [ "SINGLE_TAG_DETAIL_QUERY" ]

    # Custom query, separated using ","
    # For example, ["SELECT COUNT(*) from t1", "SELECT MAX(ts) from t1"]
    # By default, no custom query is performed.
    benchmark-custom-queries = []

    # To run the concurrency of a query, you can enter multiple inputs, and execute each query in sequence according to the concurrency, separated by ","
    # For example, by entering [1, 8], you can run each query with concurrency of 1 first, and then run each query with concurrency of 8.
    # Default is empty.
    benchmark-parallel = [8]

    # The number of times each query runs under each concurrency is determined by the delay and TPS statistics based on the results of so many runs, and the default is 0.
    benchmark-run-times = 0

    # The time (seconds) of each query runs under each concurrency, and the delay and TPS statistics are performed based on the results of query execution during this period.
    # Only effective if benchmark-run-times is 0. The default is 60, which means that each query runs for 60 seconds at each concurrency.
    benchmark-runtime-in-second = "60"

2.2.5.2 nil

If no query is required, set benchmark to nil.

[benchmark]

  benchmark = "nil"

2.3 Sample Configuration File

2.3.1 Ultra-wide sparse table generates data and runs hybrid loads

[database]
  db-database = "testdb2"
  db-master-port = 5432

[global]
  # Turn on progress viewing function, the default is true
  watch = true

  # The directory for the generated DDL, GUCs best practice recommendations, Query and other files storage
  workspace = "/home/mxadmin/mxbench/workspace"

  # Whether data loading and querying are performed simultaneously
  simultaneous-loading-and-query = true

  table-name = "table2"

  # Number of equipment
  tag-num = 20000
  # Metric data type, supporting four types: int4, int8, float4, float8
  metrics-type = "float8"
  # Number of indicators. If the number of indicators is greater than 998, the first 997 are used as simple columns.
  # The other ones are stored as json in a column named ext
  total-metrics-count = 5000

  # The start timestamp of the data generated, ts-end must be later than ts-start, otherwise an error will be reported.
  ts-start = "2022-04-19 00:00:00"
  ts-end = "2022-04-19 00:01:00"

[generator]
  generator = "telematics"

  [generator.telematics]
    # The indicators of each device at each time point are uploaded in several pieces of data, and finally they are upsert into a tuple in the DB.
    generator-batch-size = 1
    # The generation ratio of delayed uploaded data (1~100), push the timestamp forward for 1 hour
    generator-disorder-ratio = 0
    # The null rate of generated data (1~100)
    generator-empty-value-ratio = 90
    # The randomness of the generated data is several levels, with OFF/S/M/L, and the default "OFF" is turned off.
    generator-randomness = "OFF"

[writer]
  writer = "stdin"

[benchmark]
  benchmark = "telematics"

  [benchmark.telematics]
    # Array, query concurrency
    benchmark-parallel = [64]
    # 3 queries provided: latest value of bicycle, latest value of 10 vehicles, and detailed bicycles
    benchmark-run-query-names = [ "SINGLE_TAG_LATEST_QUERY", "MULTI_TAG_LATEST_QUERY", "SINGLE_TAG_DETAIL_QUERY" ]
    # The number of times or time of running each query in each round, to make the time take effect, you need to set the number of times to 0, as follows:
    benchmark-run-times = 0
    benchmark-runtime-in-second = "60"

2.3.2 Read DDL from the outside and read csv file externally to load, without running query

[database]
  db-database = "testdb3"
  db-master-port = 5432

[generator]
  # Read data from csv file
  generator = "file"

  [generator.file]
    generator-file-paths = ["/home/mxadmin/mxbench/data.csv"]


[global]

  table-name = "table3"

  watch = true
  workspace = "/home/mxadmin/mxbench/workspace"
  ddl-file-path = "/home/mxadmin/mxbench/ddl.sql"

[writer]
  writer = "stdin"

[benchmark]
  benchmark = "nil"

2.4 Sample command line

2.4.1 Ultra-wide sparse table generates data and runs hybrid loads

Running MxBench using sample configuration file 1 is equivalent to running MxBench using the following command line:

mxbench run \
  --db-database "testdb2" \
  --db-master-port 5432 \
  --db-user "mxadmin" \
  --workspace "/home/mxadmin/mxbench/workspace" \
  --simultaneous-loading-and-query \
  --table-name "table2" \
  --tag-num 25000 \
  --metrics-type "float8" \
  --total-metrics-count 5000 \
  --ts-start "2022-04-19 00:00:00" \
  --ts-end "2022-04-19 00:01:00" \
  --generator "telematics" \
  --generator-batch-size 1 \
  --generator-disorder-ratio 0 \
  --generator-empty-value-ratio 90 \
  --generator-randomness "OFF" \
  --writer "stdin" \
  --benchmark "telematics" \
  --benchmark-run-query-names "SINGLE_TAG_LATEST_QUERY" \
  --benchmark-run-query-names "MULTI_TAG_LATEST_QUERY" \
  --benchmark-run-query-names "SINGLE_TAG_DETAIL_QUERY" \
  --benchmark-parallel 64 \
  --benchmark-run-times 0 \
  --benchmark-runtime-in-second 60

2.4.2 Read DDL from the outside and read csv file externally to load, without running query

Running MxBench using sample configuration file 2 is equivalent to running MxBench using the following command line:

mxbench run \
  --db-database "testdb3" \
  --db-master-port 5432 \
  --workspace "/home/mxadmin/mxbench/workspace" \
  --ddl-file-path "/home/mxadmin/mxbench/ddl.sql" \
  --table-name "table3" \
  --generator "file" \
  --generator-file-paths "/home/mxadmin/mxbench/data.csv" \
  --writer "stdin" \
  --benchmark "nil"

2.5 FAQ

Only load, no query Set benchmark to nil;
Only query, no loading Set generator to nil;
Loading and querying run at the same time Simultaneous-loading-and-query is true in global settings.
Want to generate and dump the csv data file In the global setting, dump is true, and the generated file is in the directory under the directory set by the workspace.
Want to view the generated ddl and query The directory is in the directory set by the workspace.
Want to run custom DDL Fill in the absolute path of the ddl file in ddl-file-path in global settings.
Want to run a custom query Fill in the custom query statement in the benchmark-custom-queries of telematics benchmark and enclose it with "". Random parameters are not supported.
Do not want to use the system recommended GUCs, keep the existing GUCs running mxbench: When MxBench detects that there is inconsistency between existing systems and recommended GUCs, it will prompt in the standard output and ask whether it is necessary to reset the GUCs and start the database. Enter "N" to retain the original GUCs. MxBench will confirm again whether to continue running MxBench. Select "Y" and continue running.
What are the requirements for the legality of parameters? In the global configuration:

ts-end must be later than ts-start;
table-name and schema-name are not empty;
tag-num must be greater than 0;
ts-step-in-second is not 0.

3 Understand progress information and statistical reports

3.1 Progress information

Example:

● Stdin Writer Report
  ● period start: 2022-04-29 10:08:11, end: 2022-04-29 10:08:16, period: 5.000129734s

  ● count written in total: 637025 rows/ 1500000 rows 42%, 637025 rows in this period

  ● size written in total: 160878718 bytes/ 360000003 bytes 44%, 160878718 bytes in this period

  ● size written to MxGate in total: 350075440 bytes, 350075440 bytes in this period

● Telematics Benchmark Report
  ● stats for query SINGLE_TAG_LATEST_QUERY with parallel 8: progress: 100%

  ● stats for query MULTI_TAG_LATEST_QUERY with parallel 8: progress: 43%

Note: It consists of two parts in total, namely the progress report of writer and benchamrk.

3.1.1 writer

period start, period end, period: the start and end time and time period of the statistics window;
count written: The number of data rows written and the estimated number of data rows, and the percentage of both. xx in this period: the number of rows written in this statistics window;
size written: The number of data bytes and the estimated data bytes that have been written, and the percentage of both. xx in this period: the number of bytes written in this statistics window;
size written to MxGate: The number of bytes written to MxGate after converting data into text. xx in this period: The number of bytes written to MxGate in this statistics window.

3.1.2 benchmark

The execution progress of a query under a certain concurrency parallel parameter. When query is simultaneously in progress with data loading, query will continue until the data loading is over, so it may run multiple rounds in a loop. This progress report only shows the latest round progress report

3.2 Statistical Report

3.2.1 writer

┌───────────────────────────────────────────────────────┐
│            Summary Report for Stdin Writer            │
├─────────────────────────────────┬─────────────────────┤
│ start time:                     │ 2022-04-27 13:29:01 │
├─────────────────────────────────┼─────────────────────┤
│ stop time:                      │ 2022-04-27 13:29:58 │
├─────────────────────────────────┼─────────────────────┤
│ size written to MxGate (bytes): │ 848333400           │
├─────────────────────────────────┼─────────────────────┤
│ lines inserted:                 │ 1500000             │
├─────────────────────────────────┼─────────────────────┤
│ compress ratio:                 │ 1.561276 : 1        │
└─────────────────────────────────┴─────────────────────┘

start time: data load start time;
end time: data load termination time;
size written to MxGate (bytes): The number of bytes to write data to MxGate;
lines inserted: The number of inserts data. Since upsert may exist, this number may be higher than the actual number of data in the database.
compress ratio: compression ratio, that is, the ratio of the size of the data written to MxGate to the size of the table in the actual database.

3.2.2 benchmark

Each query will generate a report under each parallel parameter and will be printed out in real time.

┌─────────────────┬───────────────┐
│ Overall Latency │ 29.948370582s │
│ Average Latency │   13.723304ms │
│ P75 Latency     │    14.35972ms │
│ P50 Latency     │   13.654556ms │
│ P25 Latency     │   12.925912ms │
│ TPS             │           582 │
└─────────────────┴───────────────┘

Pxx represents the delay of the xx percentile. For example, P75 is 14.35972ms, which means that 25% of the number of times the query is executed are higher than it and 75% lower than it. P50 is the median.
TPS: The number of query executed per second.

Summary report:

┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                Summary Report for Telematics Benchmark                                                │
├─────────────────────┬─────────────────────────────────────┬─────────────────────────────────────┬─────────────────────────────────────┤
│ Parallel\Query Name │ SINGLE_TAG_LATEST_QUERY             │ MULTI_TAG_LATEST_QUERY              │ SINGLE_TAG_DETAIL_QUERY             │
├─────────────────────┼─────────────────────────────────────┼─────────────────────────────────────┼─────────────────────────────────────┤
│ 8                   │ ┌─────────────────┬───────────────┐ │ ┌─────────────────┬───────────────┐ │ ┌─────────────────┬───────────────┐ │
│                     │ │ Overall Latency │ 30.008625133s │ │ │ Overall Latency │ 36.407016561s │ │ │ Overall Latency │ 29.948370582s │ │
│                     │ │ Average Latency │    23.81758ms │ │ │ Average Latency │  7.270212105s │ │ │ Average Latency │   13.723304ms │ │
│                     │ │ P75 Latency     │   24.911171ms │ │ │ P75 Latency     │  8.139682559s │ │ │ P75 Latency     │    14.35972ms │ │
│                     │ │ P50 Latency     │   23.411248ms │ │ │ P50 Latency     │  7.161237157s │ │ │ P50 Latency     │   13.654556ms │ │
│                     │ │ P25 Latency     │   20.899622ms │ │ │ P25 Latency     │  6.642459805s │ │ │ P25 Latency     │   12.925912ms │ │
│                     │ │ TPS             │           335 │ │ │ TPS             │             1 │ │ │ TPS             │           582 │ │
│                     │ └─────────────────┴───────────────┘ │ └─────────────────┴───────────────┘ │ └─────────────────┴───────────────┘ │
│                     │ progress: 100%                      │ progress: 100%                      │ progress: 100%                      │
└─────────────────────┴─────────────────────────────────────┴─────────────────────────────────────┴─────────────────────────────────────┘

Each line represents the execution result of each query under a certain concurrency degree.
Each column represents the execution result of each query at each concurrency degree.
If the query set has performed multiple rounds (in the case of mixed load, the query will be executed continuously if the data load is not finished), only the last round results will be displayed.
If the query execution error or the user interrupts the execution, the progress bar will display the current progress, and the statistics are based on the query that has been executed.

简体中文