Migrating Data from Greenplum 6 to YMatrix 5

This document describes best practices for migrating data from Greenplum 6 to YMatrix 5.

1 Environment Preparation

Before starting a critical operation, thorough preparation is essential—both in terms of technical readiness and mental preparedness (since issues may arise at any time). While mental preparation varies by individual, we provide a comprehensive checklist for technical setup. Some steps are optional:

No. Preparation Step Description Optional
1 Back up source cluster data Migration only involves read operations on the source cluster; no write operations are performed, so there is no risk of data corruption due to migration Yes
2 Install and deploy target database software No, required
3 Set up monitoring for the target cluster Optional, based on requirements Yes
4 Disable all DDL operations from applications This is a critical step that can introduce risks during migration—treat it with high priority No, required
5 Terminate all application connections No, required
6 Collect information about source and target clusters Includes hardware/software configurations, source and target cluster topologies, etc. No, required
7 Back up source cluster metadata DDLs, schema names, user information, etc. No, required
8 Adjust or optimize DDLs Adjustment ensures compatibility between Greenplum 6 and YMatrix DDL syntax; optimization aims to enable optimal database performance from the start Yes, mxshift now supports automatic DDL migration—see mxshift
9 Generate SQL file for altering table columns Yes, mxshift supports automatic index migration—see mxshift
10 Add whitelists on all nodes of source and target clusters No, required
11 Create users on the target cluster No, required
12 Restore table structures Yes, mxshift supports automatic index migration—see mxshift
13 Create DDLs on the target cluster In YMatrix, it's more efficient to create indexes after data migration. Therefore, when creating DDLs before migration, exclude index creation statements Yes, mxshift supports automatic DDL migration—see mxshift
14 Change table column collation to C Yes

Below are detailed examples for each step.

1.1 Back Up Source Cluster Data (Optional)

Data migration performs only read operations on the source cluster, so there is no risk of data corruption. However, if you want additional safety or have other business needs involving the data, use the mxbackup tool to perform parallel cluster backup.

1.2 Install and Deploy Target Database Software

Note!
We recommend not deploying mirror segments during initial cluster deployment. Add mirrors after migration completes to improve migration efficiency.

Note!
Hostnames of the target cluster must not conflict with those of the source cluster.

Refer to standard cluster deployment guides:

1.3 Set Up Monitoring for Target Cluster (Optional)

Refer to monitoring and alerting documentation:

1.4 Disable All DDL Operations from Applications

Note!
Before stopping all application traffic, no DDL operations should be allowed on the source Greenplum 6 cluster. This includes creating or modifying objects, adding or dropping columns, and executing CREATE, ALTER, TRUNCATE, or DROP statements.

1.5 Terminate All Application Connections

On the Master node of the source Greenplum 6 cluster, edit the pg_hba.conf file.

$ vim pg_hba.conf

Add client IP addresses in the following format to disable remote access:

host     all         all         <Client IP Address>/<CIDR Mask>       reject

Reload the configuration to apply changes:

$ gpstop -u

1.6 Collect Source and Target Cluster Information

Gather details including number of physical machines, OS, CPU, memory, disk type, disk usage, NIC information, source and target cluster topologies, database license, resource group settings, etc. Adjust according to your specific scenario to ensure full migration readiness. Useful commands include:

No. Command Purpose
1 free -g View system memory
2 lscpu View number of CPUs
3 cat /etc/system-release View OS version
4 uname -a Output kernel information: kernel name, hostname, kernel release, kernel version, architecture, processor type, platform, OS name
5 tail -11 /proc/cpuinfo View CPU details
6 gpcheckperf Test network bandwidth and disk I/O performance

1.7 Back Up Source Cluster Metadata

As the gpadmin user, use pg_dump to back up DDLs, schema names, and user information from the source Greenplum 6 cluster:

# Back up global user objects
pg_dumpall -g -f global_user.sql

# Back up table schemas
pg_dump <source_db_name> -s -f orig.sql          

# Make a copy for backup
cp orig.sql copy.sql                        

Generate SQL file for index creation:

$ cat get_index.sql 
WITH soi (oid, toid, SIZE, tschema, tname) AS
  ( SELECT soioid,
           soitableoid,
           soisize,
           soitableschemaname,
           soitablename
   FROM gp_toolkit.gp_size_of_index ),
     childrel (oid, coid)AS
  ( SELECT pp.parrelid,
           ppr.parchildrelid
   FROM pg_partition pp
   INNER JOIN pg_partition_rule ppr ON pp.oid = ppr.paroid ),
     irn (oid, toid, SIZE, tschema, tname, rn) AS
  ( SELECT *,
           row_number() OVER (
            ORDER BY dt.ssize DESC) rn
   FROM
     ( SELECT soi.oid,
              soi.toid ,
              sum(coalesce(dt2.SIZE, soi.SIZE)) ssize ,
                                                soi.tschema,
                                                soi.tname
      FROM soi
      LEFT JOIN
        ( SELECT childrel.oid,
                 soi.SIZE
         FROM soi
         INNER JOIN childrel ON soi.toid = childrel.coid ) dt2 ON soi.toid = dt2.oid
      GROUP BY 1,
               2,
               4,
               5 ) dt )
SELECT SQL || ';'
FROM
  ( SELECT pg_get_indexdef(oid) AS SQL ,
           (rn % 12 + (rn / 12)::int) % 12 AS orderkey
   FROM irn
   WHERE toid NOT IN
       (SELECT coid
        FROM childrel) ) dt
WHERE SQL NOT LIKE 'CREATE UNIQUE INDEX%'
ORDER BY dt.orderkey ;

Execute via psql:

psql -d <source_db_name> -U <superuser> -t -f get_index.sql > index.sql

1.8 Adjust or Optimize DDLs (Optional)

mxshift now supports automatic DDL migration—refer to section "2 Migration Execution". Manual adjustments are described below.

Adjustments resolve compatibility issues between Greenplum 6 and YMatrix DDL syntax (not needed for YMatrix 4 to YMatrix 5 migration). Optimization enables best performance from the beginning. Example commands executed on the source Greenplum 6 cluster:

# Change compression algorithm
sed -i 's/quicklz/zstd/g' orig.sql   

# Update plpython version
sed -i 's/plpythonu/plpython3u/g' orig.sql     

# Add extension creation command
sed -i '1s/^/create language plpython3u;/' orig.sql  

# Remove index creation statements
sed -i -e '/CREATE INDEX.*;/d' -e '/CREATE INDEX/,/;/d' orig.sql  

# Optimize compression method
sed -i 's/compresstype=zlib/compresstype=zstd/g' orig.sql  

# Adjust compression level
sed -i 's/compresslevel=5/compresslevel=1/g' orig.sql        

Explanation:

  • Compression algorithm: YMatrix does not support quicklz.
  • plpython version: YMatrix supports plpython3u.
  • Extension creation: Required extensions must be created in YMatrix.
  • Remove index statements: Creating indexes during table creation increases data scans and reduces migration efficiency. We recommend removing them and rebuilding after migration.
  • Optimize compression: zstd is preferred in YMatrix.
  • Compression level: Level 1 is optimal in YMatrix.

1.9 Generate SQL File for Altering Table Columns (Optional)

After table creation, set indexed columns to use C collation for faster indexing and reduced migration time.

$ cat collate.sql
SELECT 'alter table '||quote_ident(b.schemaname)||'.'||quote_ident(b.tablename)||' alter column '||quote_ident(b.colname)||' type '||d.coltype||' COLLATE "pg_catalog"."C";'
FROM
  (SELECT DISTINCT a.schemaname,
                   a.tablename,
                   regexp_split_to_table(replace(a.keyindex,' ',''),',') AS colname
   FROM
     (SELECT schemaname,
             tablename,
             rtrim(split_part(indexdef,'(',2),')') AS keyindex,
             indexdef
      FROM pg_indexes
      WHERE schemaname NOT IN ('pg_catalog')
        AND indexname NOT LIKE '%pkey'
        AND indexdef LIKE 'CREATE INDEX%') a) b
INNER JOIN
  (SELECT c.oid,
          n.nspname,
          c.relname
   FROM pg_catalog.pg_class c
   LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace) c ON b.schemaname=c.nspname
AND c.relname=b.tablename
INNER JOIN
  (SELECT e.attrelid,a.attname, pg_catalog.format_type(a.atttypid, a.atttypmod) AS coltype,
     (SELECT substring(pg_catalog.pg_get_expr(d.adbin, d.adrelid)
                       FOR 128)
      FROM pg_catalog.pg_attrdef d
      WHERE d.adrelid = a.attrelid
        AND d.adnum = a.attnum
        AND a.atthasdef), a.attnotnull,
                          a.attnum
   FROM pg_catalog.pg_attribute a
   LEFT OUTER JOIN pg_catalog.pg_attribute_encoding e ON e.attrelid = a .attrelid
   AND e.attnum = a.attnum
   WHERE a.attnum >0) d ON d.attrelid=c.oid
AND d.attname=b.colname;

1.10 Add Whitelists on All Nodes of Source and Target Clusters

Note!
Skip this step if source and target clusters run on the same server.

On the Master nodes of both clusters, add all node IP addresses to pg_hba.conf. Example IPs: 172.16.100.2/32, 172.16.100.3/32.

Note!
For multiple hosts, include all host IPs in the script.

$ cat config_hba.sh 
#!/bin/bash
for line  in `psql -Atc "select hostname||','|| datadir
 from gp_segment_configuration order by datadir desc"`
do
    hostname=$(echo $line | cut -d',' -f1)
    datadir=$(echo $line | cut -d',' -f2)

    gpssh -h $hostname -v -e "echo host    all    all    172.16.100.2/32    md5>> ${datadir}/pg_hba.conf"
    gpssh -h $hostname -v -e "echo host    all    all    172.16.100.3/32    md5>> ${datadir}/pg_hba.conf"

done

Also add host IPs and hostnames to /etc/hosts. Example: IP 172.16.100.195, hostname sdw1.

$ cat add_hosts.sh 
#!/bin/bash
for line  in $(psql -Atc "select hostname from gp_segment_configuration")
do
    gpssh -h $line -v -e "echo 172.16.100.195 sdw1 >> /etc/hosts"
done

Reload configuration: For Greenplum, run:

$ gpstop -u

For YMatrix 5, run:

$ mxstop -u

1.11 Create Users on Target Cluster

Run the following on the YMatrix cluster:

$ psql -h <YMatrix_Server_IP> -p <target_port> -d <target_db> -U <target_superuser> -f global_user.sql

1.12 Create DDLs on Target Cluster (Optional)

Run on YMatrix cluster:

$ psql -h <YMatrix_Server_IP> -p <target_port> -d <target_db> -U <target_superuser> -f orig.sql 

1.13 Restore Table Structures (Optional)

Use the backed-up orig.sql to restore table structures on the target YMatrix cluster:

$ time psql -d <target_db_name> -f orig.sql > restoreddl.log  2>&1 &   

1.14 Change Table Collation to C (Optional)

$ time psql -d <target_db_name> -f collate.sql  

2 Migration Execution

Note!
For detailed parameter descriptions, see mxshift.

First, create a configuration file config_path.toml. Example:

[database]
        [database.source]
        ## Name of database
        db-database= "testdb"
        ## Hostname of database master
        db-host="sdw3"
        ## password of database
        db-password="xxxx"
        ## Port of database master
        db-port=54322
        ## user name of database
        db-user="gpadmin"

    [database.target]
        ## Name of database
        db-database="destdb"
        ## Hostname of database master
        db-host="172.16.100.32"
        ## password of database
        db-password="yyyy"
        ## Port of database master
        db-port=5432
        ## user name of database
        db-user="mxadmin"

[scope]
compress_method="lz4"
gphome="/usr/local/greenplum-db-6.7.1"
mode="normal"
        [[scope.table-list]]
                schema="test_schema_1"
                name="table_001"
        [[scope.table-list]]
                schema="test_schema_2"
                name="table_002"
        [[scope.exclude-table-list]]
                schema="test_schema_3"
                name="table_003"   
schema-list=["test_schema_1", "test_schema_2"]
exclude-schema-list=["test_schema_5", "test_schema_8"]

[log]
log-level="info"
## Print log without color.
# no-color=false

[controller]
both-way=true
concurrency=5

[transfer]
verify=true
# with-index=true
[transfer.table-data-where-sql]
enabled=false
global="txdate >= '2022-10-01' AND batchnum >= 100000000"
[[transfer.table-data-where-sql.override]]
        where="abc > 10"
        [transfer.table-data-where-sql.override.table]
        schema="test_schema_1"
        name="table_001"
[[transfer.table-data-where-sql.override]]
        where="tag != 'aabbcc' AND ts > '2022-01-01'"
        [transfer.table-data-where-sql.override.table]
        schema="test_schema_2"
        name="another_table"

[ddl]
enabled=true
only-ddl=false
## During the DDL transfer, whether to skip the transfer of resource queue or group, by default, it is true.
# skip-resource-queue-and-group=true
## During the DDL transfer, whether to skip the transfer of tablespace, by default, it is true.
# skip-table-space=true
        [[ddl.replace]]
        ## Only applicable for the case of migration from Greenplum to YMatrix
                category="role"
                [[ddl.replace.pairs]]
                        old="gpadmin"
                        new="mxadmin"      

Then execute the migration on the target YMatrix cluster:

$ mxshift -c config_path.toml

3 Post-Migration Tasks

3.1 Create Indexes (Optional)

Create indexes on the target YMatrix cluster:

psql -h localhost -p <target_port> -d <target_db_name> -U mxadmin -f index.sql >>idx.out 2>&1 &

3.2 Run analyzedb Command

Update statistics for the entire database on the target YMatrix cluster:

export PGPORT=<target_port>
time analyzedb -d <target_db_name> -p 10 -a 

--- SPLIT ---

3.3 Adding Mirror Segments
Add mirror segments to the target YMatrix cluster. The procedure is as follows:

# First, check the current cluster instance information
postgres=# SELECT * from gp_segment_configuration order by 1;
 dbid | content | role | preferred_role | mode | status | port | hostname | address |                   datadir
------+---------+------+----------------+------+--------+------+----------+---------+---------------------------------------------
    1 |      -1 | p    | p              | n    | u      | 5432 | mdw      | mdw     | /home/mxdata_20220925154450/master/mxseg-1
    2 |       0 | p    | p              | n    | u      | 6000 | sdw2     | sdw2    | /home/mxdata_20220925154450/primary/mxseg0
    3 |       1 | p    | p              | n    | u      | 6001 | sdw2     | sdw2    | /home/mxdata_20220925154450/primary/mxseg1
    4 |       2 | p    | p              | n    | u      | 6000 | sdw3     | sdw3    | /home/mxdata_20220925154450/primary/mxseg2
    5 |       3 | p    | p              | n    | u      | 6001 | sdw3     | sdw3    | /home/mxdata_20220925154450/primary/mxseg3
    6 |      -1 | m    | m              | s    | u      | 5432 | sdw1     | sdw1    | /home/mxdata_20220925154450/standby/mxseg-1
(6 rows)

# Create a file containing all hostnames
$ cat /home/mxadmin/seg_hosts 
sdw1
sdw2
sdw3
sdw4

# Use gpssh to batch create mirror directories on all segment hosts
$ gpssh -f /home/mxadmin/seg_hosts -e 'mkdir -p /home/mxdata_20220925154450/mirror'

# Generate the mirror configuration template file
$ mxaddmirrors -o ./addmirror

# View the generated mirror template file
$ cat addmirror

# Execute the command to add mirrors
$ mxaddmirrors -i addmirror

# Finally, verify the updated cluster configuration
postgres=# SELECT * from gp_segment_configuration order by 1;
 dbid | content | role | preferred_role | mode | status | port | hostname | address |                   datadir
------+---------+------+----------------+------+--------+------+----------+---------+---------------------------------------------
    1 |      -1 | p    | p              | n    | u      | 5432 | mdw      | mdw     | /home/mxdata_20220925154450/master/mxseg-1
    2 |       0 | p    | p              | n    | u      | 6000 | sdw2     | sdw2    | /home/mxdata_20220925154450/primary/mxseg0
    3 |       1 | p    | p              | s    | u      | 6001 | sdw2     | sdw2    | /home/mxdata_20220925154450/primary/mxseg1
    4 |       2 | p    | p              | s    | u      | 6000 | sdw3     | sdw3    | /home/mxdata_20220925154450/primary/mxseg2
    5 |       3 | p    | p              | s    | u      | 6001 | sdw3     | sdw3    | /home/mxdata_20220925154450/primary/mxseg3
    6 |      -1 | m    | m              | s    | u      | 5432 | sdw1     | sdw1    | /home/mxdata_20220925154450/standby/mxseg-1
    7 |       0 | m    | m              | n    | d      | 7000 | sdw3     | sdw3    | /home/mxdata_20220925154450/mirror/mxseg0
    8 |       1 | m    | m              | s    | u      | 7001 | sdw3     | sdw3    | /home/mxdata_20220925154450/mirror/mxseg1
    9 |       2 | m    | m              | s    | u      | 7000 | sdw2     | sdw2    | /home/mxdata_20220925154450/mirror/mxseg2
   10 |       3 | m    | m              | s    | u      | 7001 | sdw2     | sdw2    | /home/mxdata_20220925154450/mirror/mxseg3
(10 rows)

After completing the above steps, resume business operations and monitor system behavior. Observe the cluster for a sustained period (duration depends on specific workload patterns). If the system runs stably, congratulations — the data migration has been successfully completed!