MatrixGate, a high-performance data loading tool

Please refer to [MatrixDB Data Access] for the teaching video of this course (https://www.bilibili.com/video/BV1kQ4y1C7Ar?share_source=copy_web)

MatrixGate is abbreviated as mxgate, and is a high-performance data loading tool that comes with MatrixDB.

Using mxgate for data loading performance is much higher than that of native INSERT statements. Because mxgate can communicate directly with segment, there is no master single point bottleneck.

1. Comparison of INSERT and MatrixGate

| Writing method | Advantages | Disadvantages | Applicable scenarios | | --- | --- | --- | | Direct INSERT | Simple interface | Low throughput | Low throughput, hundreds of thousands of data points/second | | MatrixGate | High throughput
Standard real-time | Additional deployment is required, with operation and maintenance costs | High throughput, tens of millions of data points/second |

After testing, using MatrixGate for data import can improve performance by dozens of times compared to competitors. For specific data, please refer to [Temporary Database Insertion Performance Evaluation: MatrixDB is 78 times that of InfluxDB] (https://ymatrix.cn/article/27)

2. How to use MatrixGate

MatrixGate provides the following operating mode:

  • Service mode
  • Command line mode
  • Migration mode

The following shows how to use these two modes to pour data into the data table. The schema of the data table dest is as follows:

CREATE TABLE dest(
    time timestamp,
    c1 int,
    c2 text
)DISTRIBUTED BY(c1);

2.1 Service mode

The service mode has backend processes that are permanently resident, providing HTTP interface to users and submitting time sequence data, which is a common way to use in production environments.

2.1.1 Generate configuration files

Use the service model to first generate configuration files and determine the database connection information, target table and other parameters

mxgate config --db-database test \
            --db-master-host localhost \
            --db-master-port 5432 \
            --db-user mxadmin \
            --target public.dest \
            --time-format raw \
            --delimiter ',' \
            > mxgate.conf

As determined in the above command:

| Parameter name | Description | Value | | --- | --- | | --db-database | Database | test | | --db-master-host | Database host | localhost | | --db-master-port | Database Port | 5432 | | --db-user | Database username | mxadmin | | --target | Target Table | public.dest | | --time-format | Time format | raw (plain text) | | --delimiter | Delimiter | , |

2.1.2 Start MatrixGate

Then, start MatrixGate and specify the configuration file you just generated in the startup parameters:

mxgate start --config mxgate.conf
**********************************************************
 __  __       _        _       ____       _
|  \/  | __ _| |_ _ __(_)_  __/ ___| __ _| |_ ___
| |\/| |/ _` | __| '__| \ \/ / |  _ / _` | __/ _ \
| |  | | (_| | |_| |  | |>  <| |_| | (_| | ||  __/
|_|  |_|\__,_|\__|_|  |_/_/\_\\____|\__,_|\__\___|
  Version: 4.2.0
  Your Copy is Licensed to: yMatrix.cn; 2022-03-01; any
**********************************************************
Launching MatrixGate daemon...
MatrixGate daemon started successfully

2.1.3 Submit data

After the startup is successful, use the curl tool to send HTTP request to submit data.

In production environment, use the HTTP library supported by the programming language to submit data

The test data file rows_header.csv has been prepared, and the content is as follows:

[mxadmin@sdw2 ~]$ cat rows_header.csv
public.dest
2021-01-01 00:00:00,1,a1
2021-01-01 00:00:00,2,a2
2021-01-01 00:00:00,3,a3

When submitting data, the first row must specify the target table name, because the MatrixGate service may have multiple target tables.

Submit data:

curl http://localhost:8086/ -X POST -H 'Content-Type: text/plain' --data-binary "@rows_header.csv"

MatrixGate binds to port 8086 by default, which can be modified through configuration files.

Query the injected data:

test=# select * from dest;
        time         | c1 | c2
--------------------+-------------------------------------------------------------------------------------------------------------
 2021-01-01 00:00:00 | 11 | a11
 2021-01-01 00:00:00 | 12 | a12
 2021-01-01 00:00:00 | 13 | a13
(3 rows)

For more detailed API parameters, please refer to Document

2.1.4 Operation and maintenance management

MatrixGate also provides other operation and maintenance commands for operation and maintenance management.

View Status

mxgate status
**********************************************************
 __  __       _        _       ____       _
|  \/  | __ _| |_ _ __(_)_  __/ ___| __ _| |_ ___
| |\/| |/ _` | __| '__| \ \/ / |  _ / _` | __/ _ \
| |  | | (_| | |_| |  | |>  <| |_| | (_| | ||  __/
|_|  |_|\__,_|\__|_|  |_/_/\_\\____|\__,_|\__\___|
  Version: 4.2.0
  Your Copy is Licensed to: yMatrix.cn; 2022-03-01; any
**********************************************************
PID          15146 alive
Launched At  2021-09-01 14:59:03
Up For       26 seconds
Binary       /usr/local/matrixdb-4.2.0.community/bin/mxgated
Log          /home/mxadmin/gpAdminLogs/matrixgate.2021-09-01_145904.log
Config       /home/mxadmin/mxgate.conf

You can see the service program running status, configuration files and log paths, which are used to track down problems.

Stop service

mxgate stop
**********************************************************
 __  __       _        _       ____       _
|  \/  | __ _| |_ _ __(_)_  __/ ___| __ _| |_ ___
| |\/| |/ _` | __| '__| \ \/ / |  _ / _` | __/ _ \
| |  | | (_| | |_| |  | |>  <| |_| | (_| | ||  __/
|_|  |_|\__,_|\__|_|  |_/_/\_\\____|\__,_|\__\___|
  Version: 4.2.0
  Your Copy is Licensed to: yMatrix.cn; 2022-03-01; any
**********************************************************
PID 15146 stopped

mxgate stop can stop the service.


2.2 Command Line Mode

Command line mode is used to pour data files into one go, and the process exits after the end.

It's still the data file just now. Remove the first row of the target table, only keep the data rows, and execute the following command:

cat rows.csv | mxgate --source stdin --db-database test --db-master-host localhost --db-master-port 5432 --db-user mxadmin --time-format raw --target public.dest --parallel 2  --delimiter ',' 

For more methods of file access, please refer to Document


2.3 Migration Mode

The migration mode is used for high-speed data migration, and supports migration of data tables from other Greenplum5, Greenplum6, and MatrixDB clusters to the current MatrixDB cluster.

The usage is as follows:

mxgate --source transfer \
         --src-host 172.31.41.7 \
         --src-port 5432 \
         --src-db postgres \
         --src-user ec2-user \
         --src-password abc \
         --src-schema public \
         --src-table trans_ao \
         --compress "gzip" \
         --port-base 9129 \
         --local-ip 172.31.33.128 \
         --db-database ttt \
         --target public.trans_ao \
         --format text \
         --time-format raw \
         --use-auto-increment=false

in:

Parameter name Description
--source Function portal, 'transfer' must be specified
--src-host IP address of source library master
--src-port Port number of source library master
--src-user Username for connecting to the source library (superuser is recommended)
--src-password Connection Password
--src-schema schema name of the source table
--src-table Table name of the source table
--compress Transfer method from the source database segment host to this data:
The blank string "" means non-compression, plain text transmission
gzip: Using gzip compression, the linux command gzip that requires the source database must be installed on the segment host
lz4: Using lz4 compression, the linux command lz4 that requires the source database must be installed on the segment host
recommended lz4 > gzip > non-compression
--port-base A batch of ports will be occupied during transmission, and the port range is 9129~
--local-ip The IP address that must be connected to the local machine using the source library
--db-database The database name where the migration target table is located
--target The migration target table name can be in the form of \<schema>.\<table>. If the schema name is not written, the default is public
--format text or csv, CSV is required only if there are complex strings (including newlines, quotes, and separators) in the migrated data. When both text/csv are available in other cases, text mode is preferred
--time-format must be raw in transfer mode
--use-auto-increment When the target table includes a self-increment field of serial type, the fields of this type will be skipped by default in mxgate. This option is added to close the logic of skipping mxgate

Another usage of migration mode is to quickly export data to a file:

mxgate --source transfer \
         --src-host 172.31.41.7 \
         --src-port 5432 \
         --src-db postgres \
         --src-user ec2-user \
         --src-schema public \
         --src-table trans_ao_1 \
         --compress "lz4" \
         --port-base 9129 \
         --local-ip 172.31.33.128 \
         --save-to-dir /tmp/receive/ \
         --db-database ttt \
         --transform nil \
         --writer nil \
         --target trans_ao

Use the --save-to-dir parameter to specify the file storage path.

Note that even if exported to a file, the --db-database and --target parameters need to be given to specify the target library and table, and the target library and table must exist

Filtering migration can be specified by SQL to filter data that needs to be synchronized by using the --src-sql parameter. It can be used in table-to-table migration and table-to-file migration:

mxgate --source transfer \
         --src-host 172.31.41.7 \
         --src-port 5432 \
         --src-db postgres \
         --src-user ec2-user \
         --src-sql "select * from demo where c1 = 'xxxx'" \
         --compress "lz4" \
         --port-base 9129 \
         --local-ip 172.31.33.128 \
         --save-to-dir /tmp/receive/ \
         --db-database ttt \
         --transform nil \
         --writer nil \
         --target trans_ao

For more usage of migration mode, please refer to Document


3. MatrixGate's UPSERT semantic support

When accessing timing data, we may face this scenario:

  • Device data is not sent at once, but is sent in batches. It needs to be merged according to the device number and timestamp as the primary key
  • Device data may be sent repeatedly, and duplicate data should be updated instead of repeated insertion

MatrixGate 4.2 version has added UPSERT semantics to solve the above problem.

3.1 Loading data using UPSERT semantics

3.1.1 Create a data table

CREATE TABLE upsert_demo (
    ts    timestamp
  , tagid int
  , c1    int
  , c2    int
  , UNIQUE(ts, tagid)
) DISTRIBUTED BY (tagid);

Note that in order for the database to use the UPSERT function, UNIQUE constraints must be created on the device id + timestamp of the table.

3.1.2 Prepare data files

upsert_demo1.dat:

2020-11-11|1|10|

upsert_demo2.dat:

2020-11-11|1||20
2020-11-11|2||100
2020-11-11|2|200|

3.1.3 Loading data files

Load upsert_demo1.dat:

cat upsert_demo1.dat|mxgate --source stdin \
  --db-database test \
  --db-master-host localhost \
  --db-master-port 5432 \
  --db-user mxadmin \
  --time-format raw \
  --delimiter "|" \
  --target upsert_demo \
  --upsert-key ts \
  --upsert-key tagid

Query results:

test=# select * from upsert_demo ;
         ts          | tagid | c1 | c2
--------------------+------+------------------------------------------------------------------------------------------------------
 2020-11-11 00:00:00 |     1 | 10 |
(1 row)

Load upsert_demo2.dat:

cat upsert_demo2.dat|mxgate --source stdin \
  --db-database test \
  --db-master-host localhost \
  --db-master-port 5432 \
  --db-user mxadmin \
  --time-format raw \
  --delimiter "|" \
  --target upsert_demo \
  --upsert-key ts \
  --upsert-key tagid

Query results:

test=# select * from upsert_demo ;
         ts          | tagid | c1  | c2
--------------------+-------------------------------------------------------------------------------------------------------------
 2020-11-11 00:00:00 |     1 |  10 |  20
 2020-11-11 00:00:00 |     2 | 200 | 100
(2 rows)

From the results, we can see that the same row data of ts and tagid are merged.

3.2 UPSERT generates the configuration file for service mode

mxgate config \
  --db-database test \
  --db-master-host localhost \
  --db-master-port 5432 \
  --db-user mxadmin \
  --time-format raw \
  --delimiter "|" \
  --target upsert_demo \
  --stream-prepared 0 \
  --upsert-key ts \
  --upsert-key tagid > mxgate.conf

Note: When using the UPSERT feature in the service mode, you need to set the stream-prepared parameter to 0, otherwise it will trigger a deadlock.

3.3 UPSERT usage precautions

  • UNIQUE constraints and --upsert-key must be exactly the same, and there is no requirement for the parameter passing order
  • --upsert-key must define UNIQUE constraints


4. MatrixGate's fault tolerance mechanism

Because MatrixGate uses an external table mechanism internally, microbatches insert data into the target table. Therefore, any piece of data is entered in batches together with other submitted data when it is entered into the database. If any of the data formats are incorrect, the entire batch of data will fail to be stored.

Starting from 4.3, MatrixGate has added a fault tolerance mechanism. A single data format error will not affect other data entry, and will return error data information and record error logs.

Note: Fault tolerance is only for format errors, and if the constraint rules are violated (unique index) it will still fail in batches.

4.1 Error message

Unlike the previous error data that directly returns HTTP 500, the HTTP code returned after fault tolerance is still 200. The response body will contain error line information, for example:

At line: 2
missing data for column "c3"

4.2 Threshold Control

Of course, it is not unlimited tolerance errors, and tolerance thresholds are related to GUC: gp_initial_bad_row_limit. When the number of wrong data rows exceeds 5 * gp_initial_bad_row_limit, batch write failed.

For more detailed usage methods, please refer to MatrixGate.