Migrate data from Greenplum 4.3.X/5/6 to YMatrix 5

This section of the documentation mainly introduces best practices for migrating data from Greenplum 4.3.X/5/6 to YMatrix 5.

1 Environmental preparation

Once you decide to do an important operation, be fully prepared and start, whether it is actual conditions or mentally prepared (because you may encounter problems at all times). Mental preparation Everyone has their own way. We will list a relatively complete plan for you, including optional steps, as follows:

Serial number	Preparation steps	Instructions	Optional
1	Backup source cluster data	Data migration only reads the source cluster data, no writes, so it does not involve the risk of data corruption caused by migration	Yes
2	Install and deploy the target database software		No, necessary steps
3	Deploy monitoring for target clusters	Depend on demand	Yes
4	Prohibit all DDL operations on the business side	This step is important and may pose risks to the migration process. Please take it seriously.	No, necessary steps
5	Interrupt all business connections		No, necessary steps
6	Collect source cluster and target cluster information	Software and hardware configuration information, source cluster topology information, target cluster topology information, etc.	No, necessary steps
7	Backup source cluster source information	DDL, pattern name, user information, etc.	No, necessary steps
8	Fix or optimize DDL	Fix is to adapt to the DDL statement between Greenplum and YMatrix to solve compatibility issues (if it is migrating data from YMatrix 4 to YMatrix 5 without correction), optimization is to enable database optimize performance from the initial stage as much as possible	Yes, currently mxshift supports automatic migration of DDL, see mxshift
9	Generate SQL files to modify table fields		Yes, currently mxshift supports automatic migration of indexes, see mxshift
10	Add whitelist for the source cluster's master		No, necessary steps
11	Create user for target cluster		No, necessary steps
12	Restore table structure		Yes, currently mxshift supports automatic migration of indexes. For details, see mxshift
13	Create DDL for the target cluster	In YMatrix, it is more efficient to re-create the index after performing the migration operation. Therefore, when creating DDL for the target cluster before migration, it is recommended to create an index without an index	Yes, currently mxshift supports automatic migration of DDL, see mxshift
14	Modify the table to C sort character set		Yes

Based on the table above, we give specific examples.

1.1 Backup source cluster data (optional)

Data migration only has read operations and no write operations for source cluster data, so it does not involve the risk of data corruption caused by migration. But if you are still worried, or have other business needs that require data, you can use the mxbackup tool to implement cluster parallel backup.

1.2 Install and deploy the target database software

Notes!
We recommend that you do not deploy mirror nodes (Mirrors) when deploying the cluster, and add them after the migration is completed to improve migration efficiency.

Notes!
The target cluster host name and the source cluster cannot be repeated!

Please refer to the standard cluster deployment document:

1.3 Deploy monitoring for the target cluster (optional)Please refer to the monitoring alarm document:

1.4 All DDL operations on the business side are prohibited

Notes!
Before officially stopping all services on the business side, the source Greenplum cluster service side does not allow any DDL to be executed, including creating objects, modifying objects, adding fields, deleting fields, and executing CREATE, ALTER, TRUNCATE, and DROP statements are prohibited.

1.5 Disconnect all business connections

Modify the pg_hba.conf file on the source Greenplum cluster master.

$ vim  pg_hba.conf

Add the business client address in the following format to disable remote access.

host     all         all         <Client IP Address>/<Subnet Mask Bits>       reject

Then reload the configuration to make the modified configuration file take effect

$ gpstop -u

1.6 Collect source cluster and target cluster information

Collect source cluster and target cluster information, including the number of physical machines, operating system, CPU, memory, disk type, disk usage, network card information, source cluster topology, target cluster topology, database license, resource group configuration, etc., and use it according to the specific scenario to make comprehensive migration execution preparations. The following commands may be used:

Serial number	Command	Purpose
1	free -g	View operating system memory information
2	lscpu	View CPU quantity
3	cat /etc/system-release	View operating system version information
4	uname -a	Output all kernel information in the following order (where the detection results of -p and -i are omitted if they are agnostic): kernel name; host name on network node; kernel issue number; kernel version; host hardware architecture name; processor type (not portable); hardware platform or (not portable); operating system name
5	tail -11 /proc/cpuinfo	View CPU information
6	gpcheckperf	Network performance, bandwidth, disk I/O performance detection

1.7 Backup source cluster source information

Under the gpadmin user, use the pg_dump tool to back up the source Greenplum cluster DDL, schema name, user information, etc.

# Backup global user objects
pg_dumpall -g -f global_user.sql

# Backup table structure
pg_dump <Source Database Name> -s -f orig.sql          

# Copy a backup
cp orig.sql copy.sql

Generates the SQL file that creates the index.

$ cat get_index.sql

WITH soi (oid, toid, SIZE, tschema, tname) AS
  ( SELECT soioid,
           soitableoid,
           soisize,
           soitableschemaname,
           soitablename
   FROM gp_toolkit.gp_size_of_index ),
     childrel (oid, coid)AS
  ( SELECT pp.parrelid,
           ppr.parchildrelid
   FROM pg_partition pp
   INNER JOIN pg_partition_rule ppr ON pp.oid = ppr.paroid ),
     irn (oid, toid, SIZE, tschema, tname, rn) AS
  ( SELECT *,
           row_number() OVER (
            ORDER BY dt.ssize DESC) rn
   FROM
     ( SELECT soi.oid,
              soi.toid ,
              sum(coalesce(dt2.SIZE, soi.SIZE)) ssize ,
                                                soi.tschema,
                                                soi.tname
      FROM soi
      LEFT JOIN
        ( SELECT childrel.oid,
                 soi.SIZE
         FROM soi
         INNER JOIN childrel ON soi.toid = childrel.coid ) dt2 ON soi.toid = dt2.oid
      GROUP BY 1,
               2,
               4,
               5 ) dt )
SELECT SQL || ';'
FROM
  ( SELECT pg_get_indexdef(oid) AS SQL ,
           (rn % 12 + (rn / 12)::int) % 12 AS orderkey
   FROM irn
   WHERE toid NOT IN
       (SELECT coid
        FROM childrel) ) dt
WHERE SQL NOT LIKE 'CREATE UNIQUE INDEX%'
ORDER BY dt.orderkey ;

Execute the above SQL via psql.

psql -d <Source Database Name> -U <Source database super username> -t -f get_index.sql > index.sql

1.8 Fix or optimize DDL (optional)

mxshift now supports automatic migration of DDL. For details, please refer to "2 Migration Execution". If you need to manually migrate DDL, refer to this section.

Fixed to adapt Greenplum and YMatrix's DDL statements to solve compatibility issues (if migrate data from YMatrix to YMatrix is to enable database optical performance from the initial stage possible. The following is an example of a manual correction or optimization command that is executed in the source Greenplum cluster environment:

# Modify the compression algorithm
sed -i 's/quicklz/zstd/g' orig.sql   

# Modify the plpython version
sed -i 's/plpythonu/plpython3u/g' orig.sql     

# Added the creation plugin syntax
sed -i '1s/^/create language plpython3u;/' orig.sql  ####

# Delete index creation statement
sed -i -e '/CREATE INDEX.*;/d' -e '/CREATE INDEX/,/;/d' orig.sql  

# Optimized compression algorithm
sed -i 's/compresstype=zlib/compresstype=zstd/g' orig.sql  

# Modify compression level
sed -i 's/compresslevel=5/compresslevel=1/g' orig.sql

For the above corrections or optimizations, the following specific explanation is given:

Modify the compression algorithm. YMatrix does not support quicklz compression algorithm types.
Modify the plpython version. YMatrix supports plpython3 dependencies.
Added the creation plugin syntax. There are some extension plugins that need to be created in YMatrix.
Delete the indexed statement. If you create DDL in the target database with an index, the migration efficiency will be greatly reduced by increasing the number of scans. We recommend deleting the index statement first to create a new table, and recreating the index after the migration is completed.
Optimize compression algorithm. It is better to use the zstd compression algorithm type in YMatrix.
Modify compression level. Using 1 compression level in YMatrix is better.

1.9 Generate SQL files to modify table fields (optional)

After building the table, the fields that are indexed need to be specified as C character sets to efficiently add indexes and compress migration time.

$ cat collate.sql

SELECT 'alter table '||quote_ident(b.schemaname)||'.'||quote_ident(b.tablename)||' alter column '||quote_ident(b.colname)||' type '||d.coltype||' COLLATE "pg_catalog"."C";'
FROM
  (SELECT DISTINCT a.schemaname,
                   a.tablename,
                   regexp_split_to_table(replace(a.keyindex,' ',''),',') AS colname
   FROM
     (SELECT schemaname,
             tablename,
             rtrim(split_part(indexdef,'(',2),')') AS keyindex,
             indexdef
      FROM pg_indexes
      WHERE schemaname NOT IN ('pg_catalog')
        AND indexname NOT LIKE '%pkey'
        AND indexdef LIKE 'CREATE INDEX%') a) b
INNER JOIN
  (SELECT c.oid,
          n.nspname,
          c.relname
   FROM pg_catalog.pg_class c
   LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace) c ON b.schemaname=c.nspname
AND c.relname=b.tablename
INNER JOIN
  (SELECT e.attrelid,a.attname, pg_catalog.format_type(a.atttypid, a.atttypmod) AS coltype,
     (SELECT substring(pg_catalog.pg_get_expr(d.adbin, d.adrelid)
                       FOR 128)
      FROM pg_catalog.pg_attrdef d
      WHERE d.adrelid = a.attrelid
        AND d.adnum = a.attnum
        AND a.atthasdef), a.attnotnull,
                          a.attnum
   FROM pg_catalog.pg_attribute a
   LEFT OUTER JOIN pg_catalog.pg_attribute_encoding e ON e.attrelid = a .attrelid
   AND e.attnum = a.attnum
   WHERE a.attnum >0) d ON d.attrelid=c.oid
AND d.attname=b.colname;

1.10 Add whitelist for the master of the source cluster

Notes!
If the source cluster and the target cluster are running on the same server, just skip this step.

In the master of the source cluster, add the host IP address of the target cluster master to the pg_hba.conf file. The example is as follows, with the IP address 172.16.100.2:

host    all    all    172.16.100.2    md5

In the Master of the target cluster, execute the following command to add the host IP address and host name of the source cluster master node to the /etc/hosts file. In the example, the host IP address is 172.16.100.195 and the host name is sdw1.

$ cat  add_hosts.sh

#!/bin/bash
for line  in `psql -Atc "select distinct  hostname from gp_segment_configuration order by datadir desc"`
do
    gpssh -h $hostname -v -e "echo 172.16.100.195 sdw1 >> /etc/hosts"

done

Then reload the configuration to allow the modified configuration file to take effect. Greenplum executes the gpstop command.

$ gpstop -u

YMatrix 5 executes the mxstop command.

$ mxstop -u

1.11 Create a user for the target cluster

Execute the following command in the YMatrix cluster environment.

$ psql -h <YMatrix Server IP Address> -p  <Target cluster port number> -d <Target database> -U <Target database superuser name> -f global_user.sql

1.12 Create DDL for the target cluster (optional)

Execute the following command in the YMatrix cluster environment.

$ psql -h <YMatrix Server IP Address> -p  <Target cluster port number> -d <Target database> -U <Target database superuser name> -f orig.sql

1.13 Restore table structure (optional)

Use the backup orig.sql file to restore the table structure in the target cluster YMatrix.

$ time psql -d <Target database name> -f orig.sql > restoreddl.log  2>&1 &

1.14 Modify the table to C sort character set (optional)

$ time psql -d <Target database name> -f collate.sql

2 Migration execution

Notes!
For detailed parameters, please refer to mxshift

First write the configuration file config_path.toml. The example is as follows

[database]
        [database.source]
        ## Name of database
        db-database= "testdb"
        ## Hostname of database master
        db-host="sdw3"
        ## password of database
        db-password="xxxx"
        ## Port of database master
        db-port=54322
        ## user name of database
        db-user="gpadmin"
        ## Version of database(Please use the result of 'SELECT version();' as value). Required only when
        ## 1. Source database is un-reachable, and 'ddl.only-ddl' is enabled and 'ddl.mode' is 'input'
        ## 2. Target database is un-reachable, and 'ddl.mode' is 'output' */
         # db-version="PostgreSQL 12 (MatrixDB 5.2.0-enterprise) (Greenplum Database 7.0.0+dev.17410.gedbdb5ef84 build dev) on arm-apple-darwin21.5.0, compiled by Apple clang version 13.0.0 (clang-1300.0.27.3), 64-bit compiled on Jun 5 2023 15:45:24"
         ## The installation directory of matrixdb
         install-dir="/usr/local/greenplum-db-6.7.1"
         [[database.source.hostname-to-ip]]
             ## The content within <> should be replaced with actual information and <> should be removed
             node-hostname="<mdw>" 
             node-ip="<127.0.0.1>"
         [[database.source.hostname-to-ip]]
             node-hostname="<sdw1>"
             node-ip="<127.0.0.2>"
         [[database.source.hostname-to-ip]]
             node-hostname="<sdw2>"
             node-ip="<127.0.0.3>"

        [database.target]
        ## Name of database
        db-database="destdb"
        ## Hostname of database master
        db-host="172.16.100.32"
        ## password of database
        db-password="yyyy"
        ## Port of database master
        db-port=5432
        ## user name of database
        db-user="mxadmin"
        ## Version of database(Please use the result of 'SELECT version();' as value). Required only when
        ## 1. Source database is un-reachable, and 'ddl.only-ddl' is enabled and 'ddl.mode' is 'input'
        ## 2. Target database is un-reachable, and 'ddl.mode' is 'output' */
         # db-version="PostgreSQL 12 (MatrixDB 5.2.0-enterprise) (Greenplum Database 7.0.0+dev.17410.gedbdb5ef84 build dev) on arm-apple-darwin21.5.0, compiled by Apple clang version 13.0.0 (clang-1300.0.27.3), 64-bit compiled on Jun 5 2023 15:45:24"

[scope]
## The compress method for transferring data, methods restricted to 0/gzip/lz4/zstd
compress-method="lz4"
## mode for transferring data from source to target database, and value restricted to normal/dryrun/fetch/motion.
##dryrun for only executing ddl, not transferring data
##fetch for fetching data from source and abandon
##motion for fetching data from source, redistributing and finally abandon
mode="normal"
## Sql for select segment information from source database
# select-source-segment-sql="SELECT dbid, content, port, hostname FROM gp_segment_configuration WHERE status = 'u' AND role = 'p' ORDER BY CONTENT;"
## Sql for select segment information from target database
# select-target-segment-sql="SELECT dbid, content, port, hostname FROM gp_segment_configuration WHERE status = 'u' AND role = 'p' ORDER BY CONTENT;"
        [[scope.table-list]]
                schema="test_schema_1"
                name="table_001"
        [[scope.table-list]]
                schema="test_schema_2"
                name="table_002"
        [[scope.exclude-table-list]]
                schema="test_schema_3"
                name="table_003"
schema-list=["test_schema_1", "test_schema_2"]
exclude-schema-list=["test_schema_5", "test_schema_8"]
## Whether to disable data incremental migration, by default, it is true.
# disable-data-increment=true

[log]
## The log level, value restricted to: debug/verbose/info.
log-level="info"
## Print log without color.
# no-color=false

[controller]
## By default, transfer work will start from the largest table. If set 'bothway' it will start from both the largest and the smallest table
both-way=true
## The number of table transferred at the same time
concurrency=3

[transfer]
## Verity the number of record of every table
verify=true
with-index=true

[ddl]
enabled=true
# file-path="/tmp/mxshift.sql"
# mode="output"
only-ddl=false
## During the DDL transfer, whether to skip the transfer of resource queue or group, by default, it is true.
# skip-resource-queue-and-group=true
## During the DDL transfer, whether to skip the transfer of tablespace, by default, it is true.
# skip-table-space=true
        [[ddl.replace]]
        ## Only applicable for the case of migration from Greenplum to YMatrix
                category="role"
                [[ddl.replace.pairs]]
                        old="gpadmin"
                        new="mxadmin"
## Whether to disable ddl incremental migration, by default, it is true.
# disable-ddl-increment=true

Then perform data migration on the target YMatrix cluster.

$ mxshift -c config_path.toml

3 Follow-up tasks

3.1 Execute to create an index (optional)

Perform the creation of an index on the target cluster YMatrix.

$ psql -h localhost -p  <Target cluster port number> -d <Target database name> -U <Target database super username> -f index.sql >>idx.out 2>&1 &

3.2 Execute the analyzedb command

Update the library statistics on the target cluster YMatrix.

$ export PGPORT=<Target cluster port number>
time analyzedb -d <Target database name> -p 10 -a

3.3 Adding Mirror

Add Mirror on the target cluster YMatrix. The steps are as follows:

# First, check the current cluster instance information
postgres=# SELECT * from gp_segment_configuration order by 1;
 dbid | content | role | preferred_role | mode | status | port | hostname | address |                   datadir
-----+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    1 |      -1 | p    | p              | n    | u      | 5432 | mdw      | mdw     | /home/mxdata_20220925154450/master/mxseg-1
    2 |       0 | p    | p              | n    | u      | 6000 | sdw2     | sdw2    | /home/mxdata_20220925154450/primary/mxseg0
    3 |       1 | p    | p              | n    | u      | 6001 | sdw2     | sdw2    | /home/mxdata_20220925154450/primary/mxseg1
    4 |       2 | p    | p              | n    | u      | 6000 | sdw3     | sdw3    | /home/mxdata_20220925154450/primary/mxseg2
    5 |       3 | p    | p              | n    | u      | 6001 | sdw3     | sdw3    | /home/mxdata_20220925154450/primary/mxseg3
    6 |      -1 | m    | m              | s    | u      | 5432 | sdw1     | sdw1    | /home/mxdata_20220925154450/standby/mxseg-1
(6 rows)

# Create a file with all hostnames
$ cat /home/mxadmin/seg_hosts 
sdw1
sdw2
sdw3
sdw4

# Batch increase of Mirror directory via gpssh command
$ gpssh -f /home/mxadmin/seg_hosts -e 'mkdir -p /home/mxdata_20220925154450/mirror'

# Generate Mirror template file
$ mxaddmirrors -o ./addmirror

# View Mirror template files
$ cat addmirror

# Perform the Add Mirror Operation
$ mxaddmirrors -i addmirror

# Finally, check the cluster instance again
postgres=# SELECT * from gp_segment_configuration order by 1;
 dbid | content | role | preferred_role | mode | status | port | hostname | address |                   datadir
-----+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    1 |      -1 | p    | p              | n    | u      | 5432 | mdw      | mdw     | /home/mxdata_20220925154450/master/mxseg-1
    2 |       0 | p    | p              | n    | u      | 6000 | sdw2     | sdw2    | /home/mxdata_20220925154450/primary/mxseg0
    3 |       1 | p    | p              | s    | u      | 6001 | sdw2     | sdw2    | /home/mxdata_20220925154450/primary/mxseg1
    4 |       2 | p    | p              | s    | u      | 6000 | sdw3     | sdw3    | /home/mxdata_20220925154450/primary/mxseg2
    5 |       3 | p    | p              | s    | u      | 6001 | sdw3     | sdw3    | /home/mxdata_20220925154450/primary/mxseg3
    6 |      -1 | m    | m              | s    | u      | 5432 | sdw1     | sdw1    | /home/mxdata_20220925154450/standby/mxseg-1
    7 |       0 | m    | m              | n    | d      | 7000 | sdw3     | sdw3    | /home/mxdata_20220925154450/mirror/mxseg0
    8 |       1 | m    | m              | s    | u      | 7001 | sdw3     | sdw3    | /home/mxdata_20220925154450/mirror/mxseg1
    9 |       2 | m    | m              | s    | u      | 7000 | sdw2     | sdw2    | /home/mxdata_20220925154450/mirror/mxseg2
   10 |       3 | m    | m              | s    | u      | 7001 | sdw2     | sdw2    | /home/mxdata_20220925154450/mirror/mxseg3
(10 rows)

After completing the above steps, restore business access and observe business operation status, and keep track of it for a period of time (the specific time depends on the specific timing scenario). If it runs stably, congratulations to the successful completion of the data migration!

简体中文