mxshift

This document introduces the data migration tool mxshift.

1 Description

The mxshift tool currently supports the following features:

  • Full migration: Migrates data table data definition language (DDL) and data from Greenplum 4.3.X/5/6/YMatrix clusters to a new YMatrix cluster on a database-by-database basis.
  • Incremental migration: Migrates DDL and data from selected tables in the target database from Greenplum 5/6/YMatrix clusters to a new YMatrix cluster in parallel, on a per-table basis.
  • Conditional migration: Enables conditional migration of data using WHERE conditions.
  • Supported data distribution strategies: Hash distribution, random distribution, replicated tables (REPLICATED), and Master-only.

2 Parameter Information

This section mainly introduces the parameter information related to the mxshift tool.

2.1 configuration file

To use the mxshift tool, you must first prepare a TOML file. The following is a detailed description of the parameters in the configuration file:

Parameter Name Description Required
[database.source] category
--db-host Source cluster Master node host name (Host) or IP address. The source database cluster refers to the database cluster from which data needs to be migrated Yes
--db-port Port number of the source database Master Yes
--db-database Source database name Yes
--db-user Source database username Yes
--db-password Source database user password Yes
--install-dir Database installation path Yes
--hostname-to-ip Specify the IP address corresponding to the hostname. When the hostnames of the source and target databases are the same, use this configuration to ensure routing accuracy No
[database.target] category
--db-host The hostname (Host) or IP address of the Master node in the target cluster. The target cluster refers to the database cluster into which data is migrated Yes
--db-port The port number of the target database Master Yes
--db-database The name of the target database Yes
--db-user The username of the target database Yes
--db-password The password of the target database user Yes
[scope]Category
--mode Running mode. The following values ​​are allowed: normal, dryrun, fetch, motion.
normal: Default value, refers to the normal data migration process;
dryrun: Verify the connectivity of both databases, use a simple database definition language (DDL) for verification operations, but does not read and write data;
fetch: On the basis of dryrun, the data on the source side is read and sent to the target data node (Segment), but it is discarded directly and does not process it. This option is used to verify the source side reading efficiency and the network transmission efficiency of the source side to the target side;
motion: On the basis of fetch, the data is parsed on the target side database and exchanged to the real target Segment. This option is used to further test the data exchange rate of the target side database
No
--compress-method Compression method: gzip / zstd / lz4 / 0, the default value is 0, that is, it is not compressed.
At present, any compression algorithm is enabled, the mxadmin user of each Segment on the target side needs to log in to each Segment on the source side without password ssh. It is recommended to use dryrun to test it after configuration
No
--gphome GPHOME absolute path to the source database (v5.0.0 is available, v5.0.1 and above must use the install-dir configuration) Yes
--table-list Schema (Schema) and name list of source data table. If it is a partition table, you need to fill in the partition table. If you need to migrate the entire database, the parameter is configured as empty. mxshift will migrate all tables of size not 0 from large to small No
--exclude-table-list Schema and name list of tables to be excluded No
--schema-list Schema of the source database. The priority is that if table-list has a specific table configured, only the tables in table-list are transferred. If not, check whether schema-list is configured. If configured, all tables under the Schema. If not, transmit all tables under [database.source]``db-database` No
--exclude-schema-list Exclude Schema of the source database and do not migrate data to the tables under this Schema. If schema-list and exclude-schema-list are configured, then the Schema will be excluded No
[log]Category
--log-level The default value is info, which will print normal progress information, error and warning information. If changed to debug, more debug information will be output, only for development and fault location No
--no-color Default is false. true will close the color tag No
[controller]Category
--both-way Default is false, and the largest table will be migrated first to achieve the shortest time to migrate the complete data. If both-way is set to true, it will be migrated from both the largest and smallest ends at the same time No
--concurrency The number of tables that support parallel migration, default is 5, no more than 10 No
[transfer]Category
--verify Check whether the number of rows before and after data migration matches, the default is false No
--with-index Do not migrate the index? The default is false. Choose not to perform automatic migration. At this time, mxshift will save the SQL statements for migrating the index in a file under the current mxshift execution path, so that the index can be created manually later. Note: If you are using the MARS2 storage engine to build tables, you must create an index table to take effect. Therefore, while migrating DDL in MARS2 table, the index will be automatically migrated, and will not be affected by this option No
[ddl]Category
--enabled Whether to migrate data table DDL, default is false. After starting, mxshift will prioritize its DDL migration by database before migrating data from a data table.
DDL migration supports database version:
Greenplum6 -> YMatrix4 / YMatrix5;
YMatrix4 -> YMatrix4 / YMatrix5;
YMatrix5 -> YMatrix5
No
--only-ddl Whether only DDL is migrated. The default is false. When configured as false, the data of the relevant table will be automatically migrated after the DDL migration is completed; when configured as true, the program will be exited after the DDL migration is completed, and the table data will not be migrated No
--skip-resource-queue-and-group Configure whether to migrate resource queues (groups). When skip-resource-queue-and-group=true, the resource group and resource queue will be skipped during the DDL migration No
--skip-table-space Configure whether to migrate tablespaces. When skip-table-space=true, tablespace will be skipped during DDL migration No

If you want to implement conditional migration through WHERE conditions, you may use the following parameters:

Category Subcategory Subparameters Sub-subcategory Sub-Subparameters Description Required
[transfer]Category [transfer.table-data-where-sql]Category --enabled If `true`, the source database data is filtered using the WHERE condition. Only data that meets the WHERE condition is migrated to achieve incremental migration. This function is turned off by default. Manual identification is required when enabled. `true` No
--global This parameter refers to the `WHERE` expression of SQL. mxshift will append this conditional expression to all tables that are migrated. If the `WHERE` keyword is not included in the table, it is prompted that `SKIPPED` will automatically skip the table No
[[transfer.table-data-where-sql.override]] Category --where represents a WHERE subexpression with a higher priority than the WHERE expression in global and apply only to a table level that filters a specific table. No
[transfer.table-data-where-sql.override.table]Category --schema Schema name to be migrated No
--name Table name to be migrated No

See the configuration file template below for examples of the above parameters usage.

2.2 Command line parameters

The mxshift command line parameters are as follows:

Parameter name Description
--config Print the default content of the configuration file
-a or --all Perform migration directly without confirmation. Completed tables and empty tables are skipped by default
-c or --config-path mxshift tool global configuration file path
-v or --version mxshift tool version
-l or --show-progress Print the progress of the last migration. If it is not currently migrating, the results of the last migration will be displayed; if it is currently migrating, the progress of the current migration will be displayed. Among them, the value of "STATUS" and its meaning are: Not start: not within the migration range; Skip: skip (migrated or empty table); Processing/Interrupted: is being migrated or interrupted during the migration process; Complete: complete migration; Failed: migration failed
-R or --redo If the table has been migrated, do not skip, remigrate (this option will clear all data from the table in the target library and remigrate)

3 Examples

The complete template file is named config.toml. This template is available by executing mxshift --config.

We give an example template for different usage scenarios below. Please use this as a reference and use the vim config.toml command to modify the file to complete the configuration according to the specific situation.

3.1 Full migration

[database]
        [database.source]
        ## Name of database
        db-database= "testdb"
        ## Hostname of database master
        db-host="sdw3"
        ## password of database
        db-password="xxxx"
        ## Port of database master
        db-port=54322
        ## user name of database
        db-user="gpadmin"
        ## Version of database(Please use the result of 'SELECT version();' as value). Required only when
        ## 1. Source database is un-reachable, and 'ddl.only-ddl' is enabled and 'ddl.mode' is 'input'
        ## 2. Target database is un-reachable, and 'ddl.mode' is 'output' */
         # db-version="PostgreSQL 12 (MatrixDB 5.1.0-enterprise) (Greenplum Database 7.0.0+dev.17410.gedbdb5ef84 build dev) on arm-apple-darwin21.5.0, compiled by Apple clang version 13.0.0 (clang-1300.0.27.3), 64-bit compiled on Jun 5 2023 15:45:24"
         ## The installation directory of database
         install-dir="/usr/local/greenplum-db-6.7.1"
         [[database.source.hostname-to-ip]]
             ## The content within <> should be replaced with actual information and <> should be removed
             node-hostname="<mdw>" 
             node-ip="<127.0.0.1>"
         [[database.source.hostname-to-ip]]
             node-hostname="<sdw1>"
             node-ip="<127.0.0.2>"
         [[database.source.hostname-to-ip]]
             node-hostname="<sdw2>"
             node-ip="<127.0.0.3>"

        [database.target]
        ## Name of database
        db-database="destdb"
        ## Hostname of database master
        db-host="172.16.100.32"
        ## password of database
        db-password="yyyy"
        ## Port of database master
        db-port=5432
        ## user name of database
        db-user="mxadmin"
        ## Version of database(Please use the result of 'SELECT version();' as value). Required only when
        ## 1. Source database is un-reachable, and 'ddl.only-ddl' is enabled and 'ddl.mode' is 'input'
        ## 2. Target database is un-reachable, and 'ddl.mode' is 'output' */
         # db-version="PostgreSQL 12 (MatrixDB 5.1.0-enterprise) (Greenplum Database 7.0.0+dev.17410.gedbdb5ef84 build dev) on arm-apple-darwin21.5.0, compiled by Apple clang version 13.0.0 (clang-1300.0.27.3), 64-bit compiled on Jun 5 2023 15:45:24"

[scope]
## The compress method for transferring data, methods restricted to 0/gzip/lz4/zstd
compress-method="lz4"
## mode for transferring data from source to target database, and value restricted to normal/dryrun/fetch/motion.
##dryrun for only executing ddl, not transferring data
##fetch for fetching data from source and abandon
##motion for fetching data from source, redistributing and finally abandon
mode="normal"
## Sql for select segment information from source database
# select-source-segment-sql="SELECT dbid, content, port, hostname FROM gp_segment_configuration WHERE status = 'u' AND role = 'p' ORDER BY CONTENT;"
## Sql for select segment information from target database
# select-target-segment-sql="SELECT dbid, content, port, hostname FROM gp_segment_configuration WHERE status = 'u' AND role = 'p' ORDER BY CONTENT;"
        [[scope.table-list]]
                schema="test_schema_1"
                name="table_001"
        [[scope.table-list]]
                schema="test_schema_2"
                name="table_002"       
        [[scope.exclude-table-list]]
                schema="test_schema_3"
                name="table_003"
schema-list=["test_schema_1", "test_schema_2"]
exclude-schema-list=["test_schema_5", "test_schema_8"]
## Whether to disable data incremental migration, by default, it is true.
# disable-data-increment=true

[log]
## The log level, value restricted to: debug/verbose/info.
log-level="info"
## Print log without color.
# no-color=false

[controller]
## By default, transfer work will start from the largest table. If set 'bothway' it will start from both the largest and the smallest table
both-way=true
## The number of table transferred at the same time
concurrency=3

[transfer]
## Verity the number of record of every table
verify=true
with-index=true

[ddl]
enabled=true
# file-path="/tmp/mxshift.sql"
# mode="output"
only-ddl=false
## During the DDL transfer, whether to skip the transfer of resource queue or group, by default, it is true.
# skip-resource-queue-and-group=true
## During the DDL transfer, whether to skip the transfer of tablespace, by default, it is true.
# skip-table-space=true
        [[ddl.replace]]
        ## Only applicable for the case of migration from Greenplum to YMatrix
                category="role"
                [[ddl.replace.pairs]]
                        old="gpadmin"
                        new="mxadmin"
## Whether to disable ddl incremental migration, by default, it is true.
# disable-ddl-increment=true

After preparing the TOML file, execute the following command to start mxshift for full migration.

$ mxshift -c config_path_full.toml

3.2 Incremental Migration

[database]
        [database.source]
        ## Name of database
        db-database= "testdb"
        ## Hostname of database master
        db-host="sdw3"
        ## password of database
        db-password="xxxx"
        ## Port of database master
        db-port=54322
        ## user name of database
        db-user="gpadmin"
        ## Version of database(Please use the result of 'SELECT version();' as value). Required only when
        ## 1. Source database is un-reachable, and 'ddl.only-ddl' is enabled and 'ddl.mode' is 'input'
        ## 2. Target database is un-reachable, and 'ddl.mode' is 'output' */
         # db-version="PostgreSQL 12 (MatrixDB 5.1.0-enterprise) (Greenplum Database 7.0.0+dev.17410.gedbdb5ef84 build dev) on arm-apple-darwin21.5.0, compiled by Apple clang version 13.0.0 (clang-1300.0.27.3), 64-bit compiled on Jun 5 2023 15:45:24"
         ## The installation directory of database
         install-dir="/usr/local/greenplum-db-6.7.1"
         [[database.source.hostname-to-ip]]
             ## The content within <> should be replaced with actual information and <> should be removed
             node-hostname="<mdw>" 
             node-ip="<127.0.0.1>"
         [[database.source.hostname-to-ip]]
             node-hostname="<sdw1>"
             node-ip="<127.0.0.2>"
         [[database.source.hostname-to-ip]]
             node-hostname="<sdw2>"
             node-ip="<127.0.0.3>"

        [database.target]
        ## Name of database
        db-database="destdb"
        ## Hostname of database master
        db-host="172.16.100.32"
        ## password of database
        db-password="yyyy"
        ## Port of database master
        db-port=5432
        ## user name of database
        db-user="mxadmin"
        ## Version of database(Please use the result of 'SELECT version();' as value). Required only when
        ## 1. Source database is un-reachable, and 'ddl.only-ddl' is enabled and 'ddl.mode' is 'input'
        ## 2. Target database is un-reachable, and 'ddl.mode' is 'output' */
         # db-version="PostgreSQL 12 (MatrixDB 5.1.0-enterprise) (Greenplum Database 7.0.0+dev.17410.gedbdb5ef84 build dev) on arm-apple-darwin21.5.0, compiled by Apple clang version 13.0.0 (clang-1300.0.27.3), 64-bit compiled on Jun 5 2023 15:45:24"

[scope]
## The compress method for transferring data, methods restricted to 0/gzip/lz4/zstd
compress-method="lz4"
## mode for transferring data from source to target database, and value restricted to normal/dryrun/fetch/motion.
##dryrun for only executing ddl, not transferring data
##fetch for fetching data from source and abandon
##motion for fetching data from source, redistributing and finally abandon
mode="normal"
## Sql for select segment information from source database
# select-source-segment-sql="SELECT dbid, content, port, hostname FROM gp_segment_configuration WHERE status = 'u' AND role = 'p' ORDER BY CONTENT;"
## Sql for select segment information from target database
# select-target-segment-sql="SELECT dbid, content, port, hostname FROM gp_segment_configuration WHERE status = 'u' AND role = 'p' ORDER BY CONTENT;"
        [[scope.table-list]]
                schema="test_schema_1"
                name="table_001"
        [[scope.table-list]]
                schema="test_schema_2"
                name="table_002"          
        [[scope.exclude-table-list]]
                schema="test_schema_3"
                name="table_003"
schema-list=["test_schema_1", "test_schema_2"]
exclude-schema-list=["test_schema_5", "test_schema_8"]
## Whether to disable data incremental migration, by default, it is true.
disable-data-increment=false

[log]
## The log level, value restricted to: debug/verbose/info.
log-level="info"
## Print log without color.
# no-color=false

[controller]
## By default, transfer work will start from the largest table. If set 'bothway' it will start from both the largest and the smallest table
both-way=true
## The number of table transferred at the same time
concurrency=3

[transfer]
## Verity the number of record of every table
verify=true
with-index=true

[ddl]
enabled=true
# file-path="/tmp/mxshift.sql"
# mode="output"
only-ddl=false
## During the DDL transfer, whether to skip the transfer of resource queue or group, by default, it is true.
# skip-resource-queue-and-group=true
## During the DDL transfer, whether to skip the transfer of tablespace, by default, it is true.
# skip-table-space=true
        [[ddl.replace]]
        ## Only applicable for the case of migration from Greenplum to YMatrix
                category="role"
                [[ddl.replace.pairs]]
                        old="gpadmin"
                        new="mxadmin"
## Whether to disable ddl incremental migration, by default, it is true.
disable-ddl-increment=false

After you have prepared the TOML file, execute the following command to start mxshift for incremental migration.

$ mxshift -c config_path_incremental.toml

3.3 Conditional migration

Through the WHERE condition, implement data migration that meets certain conditions or implement expired data not to be migrated:

[database]
        [database.source]
        ## Name of database
        db-database= "testdb"
        ## Hostname of database master
        db-host="sdw3"
        ## password of database
        db-password="xxxx"
        ## Port of database master
        db-port=54322
        ## user name of database
        db-user="gpadmin"
        ## The installation directory of database
        install-dir="/usr/local/greenplum-db-6.7.1"

        [database.target]
        ## Name of database
        db-database="destdb"
        ## Hostname of database master
        db-host="172.16.100.32"
        ## password of database
        db-password="yyyy"
        ## Port of database master
        db-port=5432
        ## user name of database
        db-user="mxadmin"

[scope]
## The compress method for transferring data, methods restricted to 0/gzip/lz4/zstd
compress-method="lz4"
## mode for transferring data from source to target database, and value restricted to normal/dryrun/fetch/motion.
##dryrun for only executing ddl, not transferring data
##fetch for fetching data from source and abandon
##motion for fetching data from source, redistributing and finally abandon
mode="normal"
## Sql for select segment information from source database
# select-source-segment-sql="SELECT dbid, content, port, hostname FROM gp_segment_configuration WHERE status = 'u' AND role = 'p' ORDER BY CONTENT;"
## Sql for select segment information from target database
# select-target-segment-sql="SELECT dbid, content, port, hostname FROM gp_segment_configuration WHERE status = 'u' AND role = 'p' ORDER BY CONTENT;"
        [[scope.table-list]]
                schema=......
                name=......
        [[scope.table-list]]
        ......    
        [[scope.exclude-table-list]]
        ......    
schema-list=["test_schema_1", "test_schema_2"]
exclude-schema-list=["test_schema_5", "test_schema_8"]

[log]
## The log level, value restricted to: debug/verbose/info.
log-level="info"
## Print log without color.
# no-color=false

[controller]
## By default, transfer work will start from the largest table. If set 'bothway' it will start from both the largest and the smallest table
both-way=true
## The number of table transferred at the same time
concurrency=3

[transfer]
## Verity the number of record of every table
verify=true
[transfer.table-data-where-sql]
enabled=true
global="txdate >= '2022-10-01' AND batchnum >= 100000000"
[[transfer.table-data-where-sql.override]]
        where="abc > 10"
        [transfer.table-data-where-sql.override.table]
        schema="test_schema_1"
        name="table_001"
[[transfer.table-data-where-sql.override]]
        where="tag != 'aabbcc' AND ts > '2022-01-01'"
        [transfer.table-data-where-sql.override.table]
        schema="test_schema_2"
        name="another_table"

After you have prepared the TOML file, execute the following command to start mxshift for conditional migration.

$ mxshift -c config_path_conditional.toml

3.4 Migrate replication tables

If you are migrating a database or all replication tables under a schema, you do not need to make additional configurations in config_path.toml. mxshfit will automatically determine which tables are replication tables and execute corresponding migration strategies to prevent repeated migration of data. You only need to configure schema and name in scope.table-list.

   [[scope.table-list]]
                schema=“public1”
                name=“table_replicated1”
   [[scope.table-list]]
                schema=“public2”
                name=“table_replicated2”

After you have prepared the TOML file, execute the following command to start mxshift for copy table migration.

$ mxshift -c config_path_replicated.toml

Notes!
For an introduction to the data migration ideas and complete steps, please refer to [Data Migration] (/doc/5.1/maintain/migrate/guidebook)