This document describes best practices for migrating data from Greenplum 4.3.X/5/6 to YMatrix 5.
Before performing any major operation, thorough preparation is essential—both in terms of technical readiness and mental preparedness (as issues may arise at any time). While personal mental preparation varies, we provide a comprehensive checklist for technical setup. Some steps are optional:
| No. | Preparation Step | Description | Optional |
|---|---|---|---|
| 1 | Back up source cluster data | Migration only reads from the source cluster and does not modify data, so there is no risk of data corruption due to migration | Yes |
| 2 | Install and deploy target database software | No, required | |
| 3 | Set up monitoring for the target cluster | Optional, based on requirements | Yes |
| 4 | Disable all DDL operations on the application side | This is a critical step that can introduce risks during migration; treat it with high priority | No, required |
| 5 | Terminate all business connections | No, required | |
| 6 | Collect information about source and target clusters | Hardware/software configurations, source cluster topology, target cluster topology, etc. | No, required |
| 7 | Back up source cluster metadata | DDLs, schema names, user information, etc. | No, required |
| 8 | Adjust or optimize DDLs | Adjustment ensures compatibility between Greenplum and YMatrix DDL syntax. Optimization enables optimal database performance from the start. (Not needed when migrating from YMatrix 4 to YMatrix 5) | Yes, mxshift now supports automatic DDL migration—see mxshift |
| 9 | Generate SQL file for altering table columns | Yes, mxshift supports automatic index migration—see mxshift |
|
| 10 | Add whitelists for all nodes in source and target clusters | No, required | |
| 11 | Create users on the target cluster | No, required | |
| 12 | Restore table structures | Yes, mxshift supports automatic index migration—see mxshift |
|
| 13 | Create DDLs on the target cluster | In YMatrix, it's more efficient to create indexes after data migration. Therefore, when creating DDLs before migration, exclude index creation statements. | Yes, mxshift supports automatic DDL migration—see mxshift |
| 14 | Change table column collation to C | Yes |
Below are detailed examples for each step.
Migration involves only read operations on the source cluster, so data integrity is not at risk. However, if you want extra assurance or have other use cases for the data, use the mxbackup tool for parallel cluster backup.
Note!
We recommend deploying the cluster without mirror segments initially. Add mirrors after migration completes to improve migration efficiency.
Note!
Hostnames of the target cluster must not conflict with those of the source cluster.
Refer to standard cluster deployment guides:
See monitoring and alerting documentation:
Note!
Before stopping all business traffic, no DDL operations are allowed on the source Greenplum cluster. This includes creating or modifying objects, adding or dropping columns, and executingCREATE,ALTER,TRUNCATE,DROPstatements.
On the Master node of the source Greenplum cluster, edit the pg_hba.conf file:
$ vim pg_hba.conf
Add client IP addresses in the following format to block remote access:
host all all <Client IP Address>/<CIDR Mask> reject
Reload the configuration to apply changes:
$ gpstop -u
Gather details including number of physical machines, OS, CPU, memory, disk type, disk usage, NIC info, cluster topologies, database license, resource group settings, etc., adjusting as needed per your environment. Useful commands include:
| No. | Command | Purpose |
|---|---|---|
| 1 | free -g | View system memory |
| 2 | lscpu | View number of CPUs |
| 3 | cat /etc/system-release | View OS version |
| 4 | uname -a | Output kernel information in order: kernel name, hostname, kernel release, kernel version, hardware architecture, processor type, platform, OS name (omitted if unknown) |
| 5 | tail -11 /proc/cpuinfo | View CPU details |
| 6 | gpcheckperf | Test network bandwidth and disk I/O performance |
As the gpadmin user, use pg_dump to back up DDLs, schemas, and user information from the source Greenplum cluster:
# Back up global user objects
pg_dumpall -g -f global_user.sql
# Back up table structures
pg_dump <source_db_name> -s -f orig.sql
# Make a backup copy
cp orig.sql copy.sql
Generate SQL for recreating indexes:
$ cat get_index.sql
WITH soi (oid, toid, SIZE, tschema, tname) AS
( SELECT soioid,
soitableoid,
soisize,
soitableschemaname,
soitablename
FROM gp_toolkit.gp_size_of_index ),
childrel (oid, coid)AS
( SELECT pp.parrelid,
ppr.parchildrelid
FROM pg_partition pp
INNER JOIN pg_partition_rule ppr ON pp.oid = ppr.paroid ),
irn (oid, toid, SIZE, tschema, tname, rn) AS
( SELECT *,
row_number() OVER (
ORDER BY dt.ssize DESC) rn
FROM
( SELECT soi.oid,
soi.toid ,
sum(coalesce(dt2.SIZE, soi.SIZE)) ssize ,
soi.tschema,
soi.tname
FROM soi
LEFT JOIN
( SELECT childrel.oid,
soi.SIZE
FROM soi
INNER JOIN childrel ON soi.toid = childrel.coid ) dt2 ON soi.toid = dt2.oid
GROUP BY 1,
2,
4,
5 ) dt )
SELECT SQL || ';'
FROM
( SELECT pg_get_indexdef(oid) AS SQL ,
(rn % 12 + (rn / 12)::int) % 12 AS orderkey
FROM irn
WHERE toid NOT IN
(SELECT coid
FROM childrel) ) dt
WHERE SQL NOT LIKE 'CREATE UNIQUE INDEX%'
ORDER BY dt.orderkey ;
Execute the above SQL using psql:
psql -d <source_db_name> -U <superuser> -t -f get_index.sql > index.sql
mxshift now supports automatic DDL migration—refer to section "2 Migration Execution". Manual adjustments are described below.
Adjustments resolve compatibility issues between Greenplum and YMatrix DDL syntax (not required for YMatrix-to-YMatrix migrations). Optimizations aim to achieve peak performance early. Example commands executed on the source Greenplum cluster:
# Change compression algorithm
sed -i 's/quicklz/zstd/g' orig.sql
# Update plpython version
sed -i 's/plpythonu/plpython3u/g' orig.sql
# Add extension creation syntax
sed -i '1s/^/create language plpython3u;/' orig.sql
# Remove index creation statements
sed -i -e '/CREATE INDEX.*;/d' -e '/CREATE INDEX/,/;/d' orig.sql
# Optimize compression method
sed -i 's/compresstype=zlib/compresstype=zstd/g' orig.sql
# Adjust compression level
sed -i 's/compresslevel=5/compresslevel=1/g' orig.sql
Explanation:
quicklz.plpython3u.zstd offers better performance in YMatrix.After table creation, set indexed columns to use the C collation for faster indexing and reduced migration time:
$ cat collate.sql
SELECT 'alter table '||quote_ident(b.schemaname)||'.'||quote_ident(b.tablename)||' alter column '||quote_ident(b.colname)||' type '||d.coltype||' COLLATE "pg_catalog"."C";'
FROM
(SELECT DISTINCT a.schemaname,
a.tablename,
regexp_split_to_table(replace(a.keyindex,' ',''),',') AS colname
FROM
(SELECT schemaname,
tablename,
rtrim(split_part(indexdef,'(',2),')') AS keyindex,
indexdef
FROM pg_indexes
WHERE schemaname NOT IN ('pg_catalog')
AND indexname NOT LIKE '%pkey'
AND indexdef LIKE 'CREATE INDEX%') a) b
INNER JOIN
(SELECT c.oid,
n.nspname,
c.relname
FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace) c ON b.schemaname=c.nspname
AND c.relname=b.tablename
INNER JOIN
(SELECT e.attrelid,a.attname, pg_catalog.format_type(a.atttypid, a.atttypmod) AS coltype,
(SELECT substring(pg_catalog.pg_get_expr(d.adbin, d.adrelid)
FOR 128)
FROM pg_catalog.pg_attrdef d
WHERE d.adrelid = a.attrelid
AND d.adnum = a.attnum
AND a.atthasdef), a.attnotnull,
a.attnum
FROM pg_catalog.pg_attribute a
LEFT OUTER JOIN pg_catalog.pg_attribute_encoding e ON e.attrelid = a .attrelid
AND e.attnum = a.attnum
WHERE a.attnum >0) d ON d.attrelid=c.oid
AND d.attname=b.colname;
Note!
Skip this step if source and target clusters run on the same server.
On the Master nodes of both clusters, add all node IPs to the pg_hba.conf file. Example IPs: 172.16.100.2/32 and 172.16.100.3/32.
Note!
For multiple hosts, list all host IPs in the script.
$ cat config_hba.sh
#!/bin/bash
for line in `psql -Atc "select hostname||','|| datadir
from gp_segment_configuration order by datadir desc"`
do
hostname=`echo $line|awk -F "," '{print $1}'`
datadir=`echo $line|awk -F "," '{print $2}'`
gpssh -h $hostname -v -e "echo host all all 172.16.100.2/32 md5>> ${datadir}/pg_hba.conf"
gpssh -h $hostname -v -e "echo host all all 172.16.100.3/32 md5>> ${datadir}/pg_hba.conf"
done
Also add node IPs and hostnames to /etc/hosts. Example: IP 172.16.100.195, hostname sdw1.
$ cat add_hosts.sh
#!/bin/bash
for line in `psql -Atc "select distinct hostname from gp_segment_configuration order by 1 desc"`
do
gpssh -h $hostname -v -e "echo 172.16.100.195 sdw1 >> /etc/hosts"
done
Reload configuration to apply changes.
For Greenplum, run: `gpstop`` $ gpstop -u
For YMatrix 5, run:
`mxstop``
$ mxstop -u
Run the following command in the YMatrix environment:
$ psql -h <YMatrix_Server_IP> -p <target_port> -d <target_db> -U <target_superuser> -f global_user.sql
In the YMatrix environment:
$ psql -h <YMatrix_Server_IP> -p <target_port> -d <target_db> -U <target_superuser> -f orig.sql
Use the backed-up orig.sql to restore table structures on the target YMatrix cluster:
$ time psql -d <target_db_name> -f orig.sql > restoreddl.log 2>&1 &
$ time psql -d <target_db_name> -f collate.sql
Note!
For full parameter details, see mxshift
First, create a configuration file config_path.toml. Example:
[database]
[database.source]
## Name of database
db-database= "testdb"
## Hostname of database master
db-host="sdw3"
## password of database
db-password="xxxx"
## Port of database master
db-port=54322
## user name of database
db-user="gpadmin"
## Version of database(Please use the result of 'SELECT version();' as value). Required only when
## 1. Source database is un-reachable, and 'ddl.only-ddl' is enabled and 'ddl.mode' is 'input'
## 2. Target database is un-reachable, and 'ddl.mode' is 'output' */
# db-version="PostgreSQL 12 (MatrixDB 5.1.0-enterprise) (Greenplum Database 7.0.0+dev.17410.gedbdb5ef84 build dev) on arm-apple-darwin21.5.0, compiled by Apple clang version 13.0.0 (clang-1300.0.27.3), 64-bit compiled on Jun 5 2023 15:45:24"
## The installation directory of matrixdb
install-dir="/usr/local/greenplum-db-6.7.1"
[[database.source.hostname-to-ip]]
## Replace content within <> with actual values and remove <>
node-hostname="<mdw>"
node-ip="<127.0.0.1>"
[[database.source.hostname-to-ip]]
node-hostname="<sdw1>"
node-ip="<127.0.0.2>"
[[database.source.hostname-to-ip]]
node-hostname="<sdw2>"
node-ip="<127.0.0.3>"
[database.target]
## Name of database
db-database="destdb"
## Hostname of database master
db-host="172.16.100.32"
## password of database
db-password="yyyy"
## Port of database master
db-port=5432
## user name of database
db-user="mxadmin"
## Version of database(Please use the result of 'SELECT version();' as value). Required only when
## 1. Source database is un-reachable, and 'ddl.only-ddl' is enabled and 'ddl.mode' is 'input'
## 2. Target database is un-reachable, and 'ddl.mode' is 'output' */
# db-version="PostgreSQL 12 (MatrixDB 5.1.0-enterprise) (Greenplum Database 7.0.0+dev.17410.gedbdb5ef84 build dev) on arm-apple-darwin21.5.0, compiled by Apple clang version 13.0.0 (clang-1300.0.27.3), 64-bit compiled on Jun 5 2023 15:45:24"
[scope]
## Compression method for data transfer: 0/gzip/lz4/zstd
compress-method="lz4"
## Data transfer mode: normal/dryrun/fetch/motion
## dryrun: execute DDL only, no data transfer
## fetch: fetch data from source and discard
## motion: fetch, redistribute, then discard
mode="normal"
## SQL to select segment info from source
# select-source-segment-sql="SELECT dbid, content, port, hostname FROM gp_segment_configuration WHERE status = 'u' AND role = 'p' ORDER BY CONTENT;"
## SQL to select segment info from target
# select-target-segment-sql="SELECT dbid, content, port, hostname FROM gp_segment_configuration WHERE status = 'u' AND role = 'p' ORDER BY CONTENT;"
[[scope.table-list]]
schema="test_schema_1"
name="table_001"
[[scope.table-list]]
schema="test_schema_2"
name="table_002"
[[scope.exclude-table-list]]
schema="test_schema_3"
name="table_003"
schema-list=["test_schema_1", "test_schema_2"]
exclude-schema-list=["test_schema_5", "test_schema_8"]
## Whether to disable incremental data migration (default: true)
# disable-data-increment=true
[log]
## Log level: debug/verbose/info
log-level="info"
## Disable colored log output
# no-color=false
[controller]
## By default, transfer starts from largest table. Use 'bothway' to start from both largest and smallest
both-way=true
## Number of tables to migrate concurrently
concurrency=3
[transfer]
## Verify record count for each table
verify=true
with-index=true
[ddl]
enabled=true
# file-path="/tmp/mxshift.sql"
# mode="output"
only-ddl=false
## Skip resource queue/group during DDL transfer (default: true)
# skip-resource-queue-and-group=true
## Skip tablespace during DDL transfer (default: true)
# skip-table-space=true
[[ddl.replace]]
## Applies only when migrating from Greenplum to YMatrix
category="role"
[[ddl.replace.pairs]]
old="gpadmin"
new="mxadmin"
## Disable incremental DDL migration (default: true)
# disable-ddl-increment=true
Then execute the migration on the target YMatrix cluster:
$ mxshift -c config_path.toml
On the target YMatrix cluster:
psql -h localhost -p <target_port> -d <target_db_name> -U mxadmin -f index.sql >>idx.out 2>&1 &
Update statistics for the entire database on the target YMatrix cluster:
export PGPORT=<target_port>
time analyzedb -d <target_db_name> -p 10 -a
--- SPLIT ---
3.3 Adding Mirror Segments
Add mirror segments to the target YMatrix cluster. The procedure is as follows:
# First, check the current cluster instance information
postgres=# SELECT * from gp_segment_configuration order by 1;
dbid | content | role | preferred_role | mode | status | port | hostname | address | datadir
------+---------+------+----------------+------+--------+------+----------+---------+---------------------------------------------
1 | -1 | p | p | n | u | 5432 | mdw | mdw | /home/mxdata_20220925154450/master/mxseg-1
2 | 0 | p | p | n | u | 6000 | sdw2 | sdw2 | /home/mxdata_20220925154450/primary/mxseg0
3 | 1 | p | p | n | u | 6001 | sdw2 | sdw2 | /home/mxdata_20220925154450/primary/mxseg1
4 | 2 | p | p | n | u | 6000 | sdw3 | sdw3 | /home/mxdata_20220925154450/primary/mxseg2
5 | 3 | p | p | n | u | 6001 | sdw3 | sdw3 | /home/mxdata_20220925154450/primary/mxseg3
6 | -1 | m | m | s | u | 5432 | sdw1 | sdw1 | /home/mxdata_20220925154450/standby/mxseg-1
(6 rows)
# Create a file containing all hostnames
$ cat /home/mxadmin/seg_hosts
sdw1
sdw2
sdw3
sdw4
# Use gpssh to create mirror directories in batch
$ gpssh -f /home/mxadmin/seg_hosts -e 'mkdir -p /home/mxdata_20220925154450/mirror'
# Generate the mirror configuration template file
$ mxaddmirrors -o ./addmirror
# View the generated mirror template file
$ cat addmirror
# Execute the command to add mirrors
$ mxaddmirrors -i addmirror
# Finally, check the cluster instance again
postgres=# SELECT * from gp_segment_configuration order by 1;
dbid | content | role | preferred_role | mode | status | port | hostname | address | datadir
------+---------+------+----------------+------+--------+------+----------+---------+---------------------------------------------
1 | -1 | p | p | n | u | 5432 | mdw | mdw | /home/mxdata_20220925154450/master/mxseg-1
2 | 0 | p | p | n | u | 6000 | sdw2 | sdw2 | /home/mxdata_20220925154450/primary/mxseg0
3 | 1 | p | p | s | u | 6001 | sdw2 | sdw2 | /home/mxdata_20220925154450/primary/mxseg1
4 | 2 | p | p | s | u | 6000 | sdw3 | sdw3 | /home/mxdata_20220925154450/primary/mxseg2
5 | 3 | p | p | s | u | 6001 | sdw3 | sdw3 | /home/mxdata_20220925154450/primary/mxseg3
6 | -1 | m | m | s | u | 5432 | sdw1 | sdw1 | /home/mxdata_20220925154450/standby/mxseg-1
7 | 0 | m | m | n | d | 7000 | sdw3 | sdw3 | /home/mxdata_20220925154450/mirror/mxseg0
8 | 1 | m | m | s | u | 7001 | sdw3 | sdw3 | /home/mxdata_20220925154450/mirror/mxseg1
9 | 2 | m | m | s | u | 7000 | sdw2 | sdw2 | /home/mxdata_20220925154450/mirror/mxseg2
10 | 3 | m | m | s | u | 7001 | sdw2 | sdw2 | /home/mxdata_20220925154450/mirror/mxseg3
(10 rows)
After completing the above steps, restore business access and monitor system behavior. Observe the cluster for a sustained period (duration depends on specific workload patterns). If the system runs stably, congratulations — the data migration has been successfully completed!