Migrating Data from Greenplum to YMatrix 6

This document describes best practices for migrating data from Greenplum to YMatrix 6.

1 Environment Preparation

Before starting a critical operation, thorough preparation is essential—both mentally and technically. While mental readiness varies by individual, we provide a comprehensive checklist for technical preparation. Some steps are optional:

No. Preparation Step Description Optional
1 Back up source cluster data Migration only reads from the source cluster and does not modify data, so there is no risk of data corruption due to migration Yes
2 Install and deploy target database software No, required
3 Set up monitoring for the target cluster Optional, based on requirements Yes
4 Disable all DDL operations on the application side This is a critical step that affects migration execution. Please take it seriously No, required
5 Terminate all business connections No, required
6 Collect information about source and target clusters Includes hardware/software configurations, source and target cluster topologies, etc. No, required
7 Back up source cluster metadata Includes DDLs, schema names, user information, etc. No, required
8 Adjust or optimize DDLs Adjustment ensures compatibility between Greenplum and YMatrix DDL syntax. Optimization enables optimal database performance from the start. (Not needed when migrating from YMatrix 4/5 to YMatrix 6) Yes, mxshift now supports automatic DDL migration; see mxshift
9 Generate SQL file to alter table columns Yes, mxshift now supports automatic index migration; see mxshift
10 Add whitelist entry on source cluster Master No, required
11 Create users in the target cluster No, required
12 Restore table structure Yes, mxshift now supports automatic index migration; see mxshift
13 Create DDLs for the target cluster In YMatrix, it's more efficient to recreate indexes after migration. Therefore, when creating DDLs before migration, exclude index creation statements. Yes, mxshift now supports automatic DDL migration; see mxshift
14 Change table collation to C Yes

Below are detailed examples for each step.

1.1 Back Up Source Cluster Data (Optional)

Migration performs only read operations on the source cluster, so data integrity is not at risk. However, if you want additional safety or have other use cases for the backup, use the mxbackup tool for parallel cluster backup.

1.2 Install and Deploy Target Database Software

Note!
We recommend not deploying mirror segments during initial cluster setup. Add mirrors after migration completes to improve migration efficiency.

Note!
Hostnames of the target cluster must not duplicate those of the source cluster.

Refer to standard deployment guides:

1.3 Set Up Monitoring for Target Cluster (Optional)

Refer to monitoring documentation:

1.4 Disable All DDL Operations on Application Side

Note!
Before stopping all business traffic, no DDL operations should be performed on the source Greenplum cluster. This includes creating or modifying objects, adding or dropping columns, and executing CREATE, ALTER, TRUNCATE, DROP statements.

1.5 Terminate All Business Connections

On the source Greenplum Master, edit the pg_hba.conf file:

$ vim pg_hba.conf

Add client IP addresses in the following format to block remote access:

host all all <Client IP address>/<CIDR mask> reject

Reload the configuration to apply changes:

$ gpstop -u

1.6 Collect Source and Target Cluster Information

Gather details including number of physical hosts, OS, CPU, memory, disk type, disk usage, NIC specs, source and target cluster topologies, database license, resource group settings, etc. Use this information to plan the migration thoroughly. Useful commands include:

No. Command Purpose
1 free -g View system memory
2 lscpu View CPU count
3 cat /etc/system-release View OS version
4 uname -a Output kernel info: kernel name, hostname, kernel release, version, architecture, processor type, platform, OS name
5 tail -11 /proc/cpuinfo View CPU details
6 gpcheckperf Test network bandwidth and disk I/O performance

1.7 Back Up Source Cluster Metadata

As the gpadmin user, use pg_dump to back up DDLs, schemas, and user information from the source Greenplum cluster:

# Back up global user objects
pg_dumpall -g -f global_user.sql

# Back up table schemas
pg_dump <source_db_name> -s -f orig.sql

# Make a copy for backup
cp orig.sql copy.sql

Generate SQL to recreate indexes:

$ cat get_index.sql
WITH soi (oid, toid, SIZE, tschema, tname) AS
  ( SELECT soioid,
           soitableoid,
           soisize,
           soitableschemaname,
           soitablename
   FROM gp_toolkit.gp_size_of_index ),
     childrel (oid, coid) AS
  ( SELECT pp.parrelid,
           ppr.parchildrelid
   FROM pg_partition pp
   INNER JOIN pg_partition_rule ppr ON pp.oid = ppr.paroid ),
     irn (oid, toid, SIZE, tschema, tname, rn) AS
  ( SELECT *,
           row_number() OVER (
            ORDER BY dt.ssize DESC) rn
   FROM
     ( SELECT soi.oid,
              soi.toid ,
              sum(coalesce(dt2.SIZE, soi.SIZE)) ssize ,
                                                soi.tschema,
                                                soi.tname
      FROM soi
      LEFT JOIN
        ( SELECT childrel.oid,
                 soi.SIZE
         FROM soi
         INNER JOIN childrel ON soi.toid = childrel.coid ) dt2 ON soi.toid = dt2.oid
      GROUP BY 1,
               2,
               4,
               5 ) dt )
SELECT SQL || ';'
FROM
  ( SELECT pg_get_indexdef(oid) AS SQL ,
           (rn % 12 + (rn / 12)::int) % 12 AS orderkey
   FROM irn
   WHERE toid NOT IN
       (SELECT coid
        FROM childrel) ) dt
WHERE SQL NOT LIKE 'CREATE UNIQUE INDEX%'
ORDER BY dt.orderkey ;

Execute the above SQL using psql:

psql -d <source_db_name> -U <superuser_name> -t -f get_index.sql > index.sql

1.8 Adjust or Optimize DDLs (Optional)

mxshift now supports automatic DDL migration. See section "2 Migration Execution" for details. Manual adjustment is described below.

Adjustments ensure compatibility between Greenplum and YMatrix DDL syntax. Optimizations aim to achieve best performance early. Example commands to run on the source Greenplum cluster:

# Change compression algorithm
sed -i 's/quicklz/zstd/g' orig.sql

# Update plpython version
sed -i 's/plpythonu/plpython3u/g' orig.sql

# Add language creation statement
sed -i '1s/^/create language plpython3u;/' orig.sql

# Remove index creation statements
sed -i -e '/CREATE INDEX.*;/d' -e '/CREATE INDEX/,/;/d' orig.sql

# Optimize compression method
sed -i 's/compresstype=zlib/compresstype=zstd/g' orig.sql

# Adjust compression level
sed -i 's/compresslevel=5/compresslevel=1/g' orig.sql

Explanation:

  • Change compression algorithm: YMatrix does not support quicklz.
  • Update plpython version: YMatrix supports plpython3u.
  • Add language creation: Required to enable procedural languages in YMatrix.
  • Remove index creation: Creating indexes during table creation reduces migration efficiency. Recreate them after migration.
  • Optimize compression: zstd offers better performance in YMatrix.
  • Adjust compression level: Level 1 provides optimal balance in YMatrix.

1.9 Generate SQL to Alter Table Columns (Optional)

After table creation, set indexed columns to use C collation for faster indexing and reduced migration time.

$ cat collate.sql
SELECT 'alter table '||quote_ident(b.schemaname)||'.'||quote_ident(b.tablename)||' alter column '||quote_ident(b.colname)||' type '||d.coltype||' COLLATE "pg_catalog"."C";'
FROM
  (SELECT DISTINCT a.schemaname,
                   a.tablename,
                   regexp_split_to_table(replace(a.keyindex,' ',''),',') AS colname
   FROM
     (SELECT schemaname,
             tablename,
             rtrim(split_part(indexdef,'(',2),')') AS keyindex,
             indexdef
      FROM pg_indexes
      WHERE schemaname NOT IN ('pg_catalog')
        AND indexname NOT LIKE '%pkey'
        AND indexdef LIKE 'CREATE INDEX%') a) b
INNER JOIN
  (SELECT c.oid,
          n.nspname,
          c.relname
   FROM pg_catalog.pg_class c
   LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace) c ON b.schemaname=c.nspname
AND c.relname=b.tablename
INNER JOIN
  (SELECT e.attrelid,a.attname, pg_catalog.format_type(a.atttypid, a.atttypmod) AS coltype,
     (SELECT substring(pg_catalog.pg_get_expr(d.adbin, d.adrelid)
                       FOR 128)
      FROM pg_catalog.pg_attrdef d
      WHERE d.adrelid = a.attrelid
        AND d.adnum = a.attnum
        AND a.atthasdef), a.attnotnull,
                          a.attnum
   FROM pg_catalog.pg_attribute a
   LEFT OUTER JOIN pg_catalog.pg_attribute_encoding e ON e.attrelid = a .attrelid
   AND e.attnum = a.attnum
   WHERE a.attnum >0) d ON d.attrelid=c.oid
AND d.attname=b.colname;

1.10 Add Whitelist Entry on Source Cluster Master

Note!
Skip this step if the source and target clusters run on the same server.

On the source cluster Master, add the target cluster Master’s IP address to the pg_hba.conf file. Example with IP 172.16.100.2:

host    all    all    172.16.100.2    md5

On the target cluster Master, add the source cluster Master’s IP and hostname to the /etc/hosts file. Example with IP 172.16.100.195 and hostname sdw1:

$ cat add_hosts.sh
#!/bin/bash
for line in `psql -Atc "select distinct  hostname from gp_segment_configuration order by datadir desc"`
do
    gpssh -h $hostname -v -e "echo 172.16.100.195 sdw1 >> /etc/hosts"
done

Reload configuration to apply changes.
For Greenplum, run: `gpstop`` $ gpstop -u


For YMatrix 5, run:
`mxstop``
$ mxstop -u

1.11 Create Users in Target Cluster

Run the following command in the YMatrix environment:

$ psql -h <YMatrix_Server_IP> -p <target_port> -d <target_db> -U <target_superuser> -f global_user.sql

1.12 Create DDLs for Target Cluster (Optional)

Run in the YMatrix environment:

$ psql -h <YMatrix_Server_IP> -p <target_port> -d <target_db> -U <target_superuser> -f orig.sql

1.13 Restore Table Structure (Optional)

Use the backed-up orig.sql file to restore table structure in the target YMatrix cluster:

$ time psql -d <target_db_name> -f orig.sql > restoreddl.log 2>&1 &

1.14 Change Table Collation to C (Optional)

$ time psql -d <target_db_name> -f collate.sql

2 Migration Execution

Note!
For full parameter details, refer to mxshift.

First, create a configuration file config_path.toml. Example:

[database]
        [database.source]
        ## Name of database
        db-database= "testdb"
        ## Hostname of database master
        db-host="sdw3"
        ## password of database
        db-password="xxxx"
        ## Port of database master
        db-port=54322
        ## user name of database
        db-user="gpadmin"
        ## Version of database(Please use the result of 'SELECT version();' as value). Required only when
        ##       1. Source database is un-reachable, and 'ddl.only-ddl' is enabled and 'ddl.mode' is 'input'
        ##       2. Target database is un-reachable, and 'ddl.mode' is 'output' */
         # db-version="PostgreSQL 12 (MatrixDB 5.2.0-enterprise) (Greenplum Database 7.0.0+dev.17410.gedbdb5ef84 build dev) on arm-apple-darwin21.5.0, compiled by Apple clang version 13.0.0 (clang-1300.0.27.3), 64-bit compiled on Jun  5 2023 15:45:24"
         ## The installation directory of matrixdb
         install-dir="/usr/local/greenplum-db-6.7.1"
         [[database.source.hostname-to-ip]]
             ## Replace content within <> with actual values and remove <>
             node-hostname="<mdw>" 
             node-ip="<127.0.0.1>"
         [[database.source.hostname-to-ip]]
             node-hostname="<sdw1>"
             node-ip="<127.0.0.2>"
         [[database.source.hostname-to-ip]]
             node-hostname="<sdw2>"
             node-ip="<127.0.0.3>"

        [database.target]
        ## Name of database
        db-database="destdb"
        ## Hostname of database master
        db-host="172.16.100.32"
        ## password of database
        db-password="yyyy"
        ## Port of database master
        db-port=5432
        ## user name of database
        db-user="mxadmin"
        ## Version of database(Please use the result of 'SELECT version();' as value). Required only when
        ##       1. Source database is un-reachable, and 'ddl.only-ddl' is enabled and 'ddl.mode' is 'input'
        ##       2. Target database is un-reachable, and 'ddl.mode' is 'output' */
         # db-version="PostgreSQL 12 (MatrixDB 5.2.0-enterprise) (Greenplum Database 7.0.0+dev.17410.gedbdb5ef84 build dev) on arm-apple-darwin21.5.0, compiled by Apple clang version 13.0.0 (clang-1300.0.27.3), 64-bit compiled on Jun  5 2023 15:45:24"

[scope]
## Compression method for data transfer: 0/gzip/lz4/zstd        
compress-method="lz4"
## Mode for data transfer: normal/dryrun/fetch/motion
## dryrun: execute DDL only, no data transfer
## fetch: extract data from source and discard
## motion: extract, redistribute, then discard
mode="normal"
## SQL to query segment info from source database        
# select-source-segment-sql="SELECT dbid, content, port, hostname FROM gp_segment_configuration WHERE status = 'u' AND role = 'p' ORDER BY CONTENT;"
## SQL to query segment info from target database
# select-target-segment-sql="SELECT dbid, content, port, hostname FROM gp_segment_configuration WHERE status = 'u' AND role = 'p' ORDER BY CONTENT;"
        [[scope.table-list]]
                schema="test_schema_1"
                name="table_001"
        [[scope.table-list]]
                schema="test_schema_2"
                name="table_002"
        [[scope.exclude-table-list]]
                schema="test_schema_3"
                name="table_003"
schema-list=["test_schema_1", "test_schema_2"]
exclude-schema-list=["test_schema_5", "test_schema_8"]
## Whether to disable incremental data migration (default: true)
# disable-data-increment=true

[log]
## Log level: debug/verbose/info
log-level="info"
## Print log without color
# no-color=false

[controller]
## By default, transfer starts from largest table. Set to 'bothway' to start from both largest and smallest
both-way=true
## Number of tables to transfer concurrently
concurrency=3

[transfer]
## Verify row count for each table
verify=true
with-index=true

[ddl]
enabled=true
# file-path="/tmp/mxshift.sql"
# mode="output"
only-ddl=false
## Skip resource queue/group migration (default: true)
# skip-resource-queue-and-group=true
## Skip tablespace migration (default: true)
# skip-table-space=true
        [[ddl.replace]]
        ## Only applicable for Greenplum to YMatrix migration
                category="role"
                [[ddl.replace.pairs]]
                        old="gpadmin"
                        new="mxadmin"
## Disable incremental DDL migration (default: true)
# disable-ddl-increment=true   

Then execute the migration on the target YMatrix cluster:

$ mxshift -c config_path.toml

3 Post-Migration Tasks

3.1 Create Indexes (Optional)

On the target YMatrix cluster, recreate indexes:

psql -h localhost -p <target_port> -d <target_db_name> -U mxadmin -f index.sql >>idx.out 2>&1 &

3.2 Run analyzedb

Update statistics for the entire database on the target YMatrix cluster:

export PGPORT=<target_port>
time analyzedb -d <target_db_name> -p 10 -a

--- SPLIT ---

3.3 Adding Mirror Segments

Add mirror segments to the target YMatrix cluster. The procedure is as follows:

# First, check the current cluster instance information
postgres=# SELECT * from gp_segment_configuration order by 1;
 dbid | content | role | preferred_role | mode | status | port | hostname | address |                   datadir
------+---------+------+----------------+------+--------+------+----------+---------+---------------------------------------------
    1 |      -1 | p    | p              | n    | u      | 5432 | mdw      | mdw     | /home/mxdata_20220925154450/master/mxseg-1
    2 |       0 | p    | p              | n    | u      | 6000 | sdw2     | sdw2    | /home/mxdata_20220925154450/primary/mxseg0
    3 |       1 | p    | p              | n    | u      | 6001 | sdw2     | sdw2    | /home/mxdata_20220925154450/primary/mxseg1
    4 |       2 | p    | p              | n    | u      | 6000 | sdw3     | sdw3    | /home/mxdata_20220925154450/primary/mxseg2
    5 |       3 | p    | p              | n    | u      | 6001 | sdw3     | sdw3    | /home/mxdata_20220925154450/primary/mxseg3
    6 |      -1 | m    | m              | s    | u      | 5432 | sdw1     | sdw1    | /home/mxdata_20220925154450/standby/mxseg-1
(6 rows)

# Create a file containing all hostnames
$ cat /home/mxadmin/seg_hosts 
sdw1
sdw2
sdw3
sdw4

# Use gpssh to create mirror directories in batch
$ gpssh -f /home/mxadmin/seg_hosts -e 'mkdir -p /home/mxdata_20220925154450/mirror'

# Generate the mirror configuration template file
$ mxaddmirrors -o ./addmirror

# View the generated mirror template file
$ cat addmirror

# Execute the command to add mirrors
$ mxaddmirrors -i addmirror

# Finally, check the cluster instance again
postgres=# SELECT * from gp_segment_configuration order by 1;
 dbid | content | role | preferred_role | mode | status | port | hostname | address |                   datadir
------+---------+------+----------------+------+--------+------+----------+---------+---------------------------------------------
    1 |      -1 | p    | p              | n    | u      | 5432 | mdw      | mdw     | /home/mxdata_20220925154450/master/mxseg-1
    2 |       0 | p    | p              | n    | u      | 6000 | sdw2     | sdw2    | /home/mxdata_20220925154450/primary/mxseg0
    3 |       1 | p    | p              | s    | u      | 6001 | sdw2     | sdw2    | /home/mxdata_20220925154450/primary/mxseg1
    4 |       2 | p    | p              | s    | u      | 6000 | sdw3     | sdw3    | /home/mxdata_20220925154450/primary/mxseg2
    5 |       3 | p    | p              | s    | u      | 6001 | sdw3     | sdw3    | /home/mxdata_20220925154450/primary/mxseg3
    6 |      -1 | m    | m              | s    | u      | 5432 | sdw1     | sdw1    | /home/mxdata_20220925154450/standby/mxseg-1
    7 |       0 | m    | m              | n    | d      | 7000 | sdw3     | sdw3    | /home/mxdata_20220925154450/mirror/mxseg0
    8 |       1 | m    | m              | s    | u      | 7001 | sdw3     | sdw3    | /home/mxdata_20220925154450/mirror/mxseg1
    9 |       2 | m    | m              | s    | u      | 7000 | sdw2     | sdw2    | /home/mxdata_20220925154450/mirror/mxseg2
   10 |       3 | m    | m              | s    | u      | 7001 | sdw2     | sdw2    | /home/mxdata_20220925154450/mirror/mxseg3
(10 rows)

After completing the above steps, restore business access and monitor system operations. Observe the cluster for a period of time (duration depends on specific workload patterns). If the system runs stably, congratulations — the data migration has been successfully completed!