Migrate data from Greenplum 6 to MatrixDB 4

This section of the documentation mainly covers best practices for migrating data from Greenplum 6 to MatrixDB 4.

1 Environmental preparation

Once you decide to do an important operation, be fully prepared and start, whether it is actual conditions or mentally prepared (because you may encounter problems at all times). Mental preparation Everyone has their own way. We will list a relatively complete plan for you, including optional steps, as follows:

| Serial number | Preparation steps | Instructions | Optional | --- | --- | | 1 | Backup source cluster data | Data migration only reads the source cluster data, no writes, so it does not involve the risk of data corruption caused by migration | Yes | 2 | Install and deploy the target database software | | No, necessary steps | 3 | Deploy monitoring for target clusters | Depend on demand | Yes | 4 | Prohibit all DDL operations on the business side | This step is an important step, which will bring risks to migration execution. Please be sure to pay attention to | No, necessary steps | 5 | Interrupt all business connections | | No, necessary steps | 6 | Collect source cluster and target cluster information | Software and hardware configuration information, source cluster topology information, target cluster topology information, etc. | No, necessary steps | 7 | Backup source cluster source information | DDL, pattern name, user information, etc. | No, necessary steps | 8 | Fix or optimize DDL | Fix is ​​​​​to adapt to DDL statements between Greenplum 6 and MatrixDB 4 to solve compatibility issues (if it is migrated from MatrixDB 4 to MatrixDB 4 without correction), optimization is to enable database optimize performance from the initial stage as much as possible | No, necessary steps | 9 | Generate SQL file to modify table fields| | Yes | 10 | Add whitelist to all nodes in the source and target cluster | | No, necessary steps | 11 |Create user for target cluster | | No, necessary steps | 12 |Restore table structure | | No, necessary steps | 13 |Create DDL for the target cluster| In MatrixDB, it is more efficient to re-create the index after the migration operation is performed. Therefore, when creating DDL for the target cluster before migration, it is recommended to create a statement without an index | No, necessary steps | 14 |Modify the table to C sort character set | | Yes

Based on the table above, we give specific examples.

1.1 Backup source cluster data

Data migration only has read operations and no write operations for source cluster data, so it does not involve the risk of data corruption caused by migration. But if you are still worried, or have other business needs that require data, you can use the mxbackup tool to implement cluster parallel backup.

1.2 Install and deploy the target database software

Notes!
We recommend that you do not deploy mirror nodes (Mirrors) when deploying the cluster, and add them after the migration is completed to improve migration efficiency.

Notes!
The target cluster host name and the source cluster cannot be repeated!

Please refer to the standard cluster deployment document:

1.3 Deploy monitoring for target clusters

Please refer to the monitoring alarm document:

1.4 All DDL operations on the business side are prohibited

Notes!
Before officially stopping all services on the business side, the source Greenplum 6 cluster service side does not allow any DDL to be executed, including creating objects, modifying objects, adding fields, deleting fields, and prohibiting the execution of CREATE, ALTER, TRUNCATE, and DROP statements. #

1.5 Disconnect all business connections

Modify the pg_hba.conf file on the source Greenplum 6 cluster master.

$ vim  pg_hba.conf

Add the business client address in the following format to disable remote access.

host     all         all         <客户端 IP 地址>/<子网掩码位数>       reject

Then reload the configuration to make the modified configuration file take effect

$ gpstop -u

1.6 Collect source cluster and target cluster information

Collect source cluster and target cluster information, including the number of physical machines, operating system, CPU, memory, disk type, disk usage, network card information, source cluster topology, target cluster topology, database license, resource group configuration, etc., and use it according to the specific scenario to make comprehensive migration execution preparations. The following commands may be used:

Serial number Command Purpose
1 free -g View operating system memory information
2 lscpu View CPU quantity
3 cat /etc/system-release View operating system version information
4 uname -a Output all kernel information in the following order (where the detection results of -p and -i are omitted if they are agnostic): kernel name; host name on network node; kernel issue number; kernel version; host hardware architecture name; processor type (not portable); hardware platform or (not portable); operating system name
5 tail -11 /proc/cpuinfo View CPU information
6 gpcheckperf Network performance, bandwidth, disk I/O performance detection

1.7 Backup source cluster source information

Under the gpadmin user, use the pg_dump tool to back up the source Greenplum 6 cluster DDL, schema name, user information, etc.

# Backup global user objects
pg_dumpall -g -f global_user.sql

# Backup table structure
pg_dump <Source Database Name> -s -f orig.sql          

# Copy a backup
cp orig.sql copy.sql                        

Generates the SQL file that creates the index.

$ cat get_index.sql 
WITH soi (oid, toid, SIZE, tschema, tname) AS
  ( SELECT soioid,
           soitableoid,
           soisize,
           soitableschemaname,
           soitablename
   FROM gp_toolkit.gp_size_of_index ),
     childrel (oid, coid)AS
  ( SELECT pp.parrelid,
           ppr.parchildrelid
   FROM pg_partition pp
   INNER JOIN pg_partition_rule ppr ON pp.oid = ppr.paroid ),
     irn (oid, toid, SIZE, tschema, tname, rn) AS
  ( SELECT *,
           row_number() OVER (
            ORDER BY dt.ssize DESC) rn
   FROM
     ( SELECT soi.oid,
              soi.toid ,
              sum(coalesce(dt2.SIZE, soi.SIZE)) ssize ,
                                                soi.tschema,
                                                soi.tname
      FROM soi
      LEFT JOIN
        ( SELECT childrel.oid,
                 soi.SIZE
         FROM soi
         INNER JOIN childrel ON soi.toid = childrel.coid ) dt2 ON soi.toid = dt2.oid
      GROUP BY 1,
               2,
               4,
               5 ) dt )
SELECT SQL || ';'
FROM
  ( SELECT pg_get_indexdef(oid) AS SQL ,
           (rn % 12 + (rn / 12)::int) % 12 AS orderkey
   FROM irn
   WHERE toid NOT IN
       (SELECT coid
        FROM childrel) ) dt
WHERE SQL NOT LIKE 'CREATE UNIQUE INDEX%'
ORDER BY dt.orderkey ;

Execute the above SQL via psql.

psql -d <Source Database Name> -U <Source database super username> -t -f get_index.sql > index.sql

1.8 Correct or optimize DDL

The correction is to adapt to the DDL statements of Greenplum 6 and MatrixDB to solve compatibility issues (no correction is required if migrating data from MatrixDB to MatrixDB), and the optimization is to enable database optimize performance from the initial stage as much as possible. Currently, mxshift does not support automatic correction of DDL. The following is an example of a manual correction or optimization command that is executed in a source Greenplum 6 cluster environment:

# Modify the compression algorithm
sed -i 's/quicklz/zstd/g' orig.sql   

# Modify the plpython version
sed -i 's/plpythonu/plpython3u/g' orig.sql     

# Added the creation plugin syntax
sed -i '1s/^/create language plpython3u;/' orig.sql  ####

# Delete index creation statement
sed -i -e '/CREATE INDEX.*;/d' -e '/CREATE INDEX/,/;/d' orig.sql  

# Optimized compression algorithm
sed -i 's/compresstype=zlib/compresstype=zstd/g' orig.sql  

# Modify compression level
sed -i 's/compresslevel=5/compresslevel=1/g' orig.sql        

For the above corrections or optimizations, the following specific explanation is given:

  • Modify the compression algorithm. MatrixDB does not support quicklz compression algorithm types.
  • Modify the plpython version. MatrixDB supports plpython3 dependencies.
  • Added the creation plugin syntax. In MatrixDB, you need to create some extension plugins.
  • Delete the indexed statement. If you create DDL in the target database with an index, the migration efficiency will be greatly reduced by increasing the number of scans. We recommend deleting the index statement first to create a new table, and recreating the index after the migration is completed.
  • Optimize compression algorithm. It is better to use the zstd compression algorithm type in MatrixDB.
  • Modify compression level. Using 1 compression level in MatrixDB is better.

    1.9 Generate SQL files that modify table fields

    After building the table, the fields that are indexed need to be specified as C character sets to efficiently add indexes and compress migration time.

    $ cat collate.sql
    SELECT 'alter table '||quote_ident(b.schemaname)||'.'||quote_ident(b.tablename)||' alter column '||quote_ident(b.colname)||' type '||d.coltype||' COLLATE "pg_catalog"."C";'
    FROM
    (SELECT DISTINCT a.schemaname,
                     a.tablename,
                     regexp_split_to_table(replace(a.keyindex,' ',''),',') AS colname
     FROM
       (SELECT schemaname,
               tablename,
               rtrim(split_part(indexdef,'(',2),')') AS keyindex,
               indexdef
        FROM pg_indexes
        WHERE schemaname NOT IN ('pg_catalog')
          AND indexname NOT LIKE '%pkey'
          AND indexdef LIKE 'CREATE INDEX%') a) b
    INNER JOIN
    (SELECT c.oid,
            n.nspname,
            c.relname
     FROM pg_catalog.pg_class c
     LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace) c ON b.schemaname=c.nspname
    AND c.relname=b.tablename
    INNER JOIN
    (SELECT e.attrelid,a.attname, pg_catalog.format_type(a.atttypid, a.atttypmod) AS coltype,
       (SELECT substring(pg_catalog.pg_get_expr(d.adbin, d.adrelid)
                         FOR 128)
        FROM pg_catalog.pg_attrdef d
        WHERE d.adrelid = a.attrelid
          AND d.adnum = a.attnum
          AND a.atthasdef), a.attnotnull,
                            a.attnum
     FROM pg_catalog.pg_attribute a
     LEFT OUTER JOIN pg_catalog.pg_attribute_encoding e ON e.attrelid = a .attrelid
     AND e.attnum = a.attnum
     WHERE a.attnum >0) d ON d.attrelid=c.oid
    AND d.attname=b.colname;

    1.10 Add a whitelist for all nodes in the source cluster and the target cluster

Notes!
If the source cluster and the target cluster are running on the same server, just skip this step.

In the master of the source cluster and the target cluster, execute the following command to add the host IP addresses of all nodes of the source cluster and the target cluster to the pg_hba.conf file. The IP address and subnet mask in the example are 172.16.100.2/32 and 172.16.100.3/32.

Notes!
If there are multiple hosts, you need to write all host IPs to the script.

$ cat  config_hba.sh 

#!/bin/bash
for line  in `psql -Atc "select hostname||','|| datadir
from gp_segment_configuration order by datadir desc"`
do
hostname=`echo $line|awk -F "," '{print $1}'`
datadir=`echo $line|awk -F "," '{print $2}'`
gpssh -h $hostname -v -e "echo host    all    all    172.16.100.2/32    md5>> ${datadir}/pg_hba.conf"
gpssh -h $hostname -v -e "echo host    all    all    172.16.100.3/32    md5>> ${datadir}/pg_hba.conf"

done

In the source cluster and the target cluster master, execute the following command to add the host IP address and host name of all nodes of the source cluster and the target cluster to the /etc/hosts file. In the example, the host IP address is 172.16.100.195 and the host name is sdw1.

$ cat add_hosts.sh

!/bin/bash

for line in psql -Atc "select distinct hostname from gp_segment_configuration order by 1 desc" do gpssh -h $hostname -v -e "echo 172.16.100.195 sdw1 >> /etc/hosts"

done

Then reload the configuration to make the modified configuration file take effect

$ gpstop -u

### 1.11 Create a user for the target cluster
Execute the following command in a MatrixDB cluster environment.

$ psql -h -p -d -U -f global_user.sql

### 1.12 Create DDL for the target cluster
Execute the following command in a MatrixDB cluster environment.

$ psql -h -p -d -U -f orig.sql

### 1.13 Restore table structure
Use the backup orig.sql file to restore the table structure in the target cluster MatrixDB.

$ time psql -d -f orig.sql > restoreddl.log 2>&1 &

### 1.14 Modify the table to C sort character set

$ time psql -d -f collate.sql

## 2 Migration execution
> ***Notes!***  
For detailed parameters, please refer to [mxshift](/doc/4.7/tools/mxshift)

First write the configuration file config_path.json.

{ "gphome": "/usr/local/matrixdb, "mode": "normal", "verify": false, "bothway": true, "concurrency": 6, "log_level": "info", "src_host": "", "src_db": "", "src_user": "", "src_password": "123456", "src_port": , "target_host": "localhost", "target_db": "", "target_user": "", "target_password": "123456", "target_port": }

The data migration is then performed on the target MatrixDB cluster.

$ time ./mxshift -c config_path.json

## 3 Follow-up tasks
### 3.1 Execute to create an index
Perform the creation of an index on the target cluster MatrixDB.

$ psql -h localhost -p -d -U -f index.sql >>idx.out 2>&1 &

### 3.2 Execute the analyzedb command
Update the library statistics on the target cluster MatrixDB.

$ export PGPORT= time analyzedb -d -p 10 -a

### 3.3 Adding Mirror
Add Mirror on the target cluster MatrixDB. The steps are as follows:

First, check the current cluster instance information

postgres=# SELECT * from gp_segment_configuration order by 1; dbid | content | role | preferred_role | mode | status | port | hostname | address | datadir -----+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 1 | -1 | p | p | n | u | 5432 | mdw | mdw | /home/mxdata_20220925154450/master/mxseg-1 2 | 0 | p | p | n | u | 6000 | sdw2 | sdw2 | /home/mxdata_20220925154450/primary/mxseg0 3 | 1 | p | p | n | u | 6001 | sdw2 | sdw2 | /home/mxdata_20220925154450/primary/mxseg1 4 | 2 | p | p | n | u | 6000 | sdw3 | sdw3 | /home/mxdata_20220925154450/primary/mxseg2 5 | 3 | p | p | n | u | 6001 | sdw3 | sdw3 | /home/mxdata_20220925154450/primary/mxseg3 6 | -1 | m | m | s | u | 5432 | sdw1 | sdw1 | /home/mxdata_20220925154450/standby/mxseg-1 (6 rows)

Create a file with all hostnames

$ cat /home/mxadmin/seg_hosts sdw1 sdw2 sdw3 sdw4

Batch increase of Mirror directory via gpssh command

$ gpssh -f /home/mxadmin/seg_hosts -e 'mkdir -p /home/mxdata_20220925154450/mirror'

Generate Mirror template file

$ gpaddmirrors -o ./addmirror

View Mirror template files

$ cat addmirror

Perform the Add Mirror Operation

$ gpaddmirrors -i addmirror

Finally, check the cluster instance again

postgres=# SELECT * from gp_segment_configuration order by 1; dbid | content | role | preferred_role | mode | status | port | hostname | address | datadir -----+-----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 1 | -1 | p | p | n | u | 5432 | mdw | mdw | /home/mxdata_20220925154450/master/mxseg-1 2 | 0 | p | p | n | u | 6000 | sdw2 | sdw2 | /home/mxdata_20220925154450/primary/mxseg0 3 | 1 | p | p | s | u | 6001 | sdw2 | sdw2 | /home/mxdata_20220925154450/primary/mxseg1 4 | 2 | p | p | s | u | 6000 | sdw3 | sdw3 | /home/mxdata_20220925154450/primary/mxseg2 5 | 3 | p | p | s | u | 6001 | sdw3 | sdw3 | /home/mxdata_20220925154450/primary/mxseg3 6 | -1 | m | m | s | u | 5432 | sdw1 | sdw1 | /home/mxdata_20220925154450/standby/mxseg-1 7 | 0 | m | m | n | d | 7000 | sdw3 | sdw3 | /home/mxdata_20220925154450/mirror/mxseg0 8 | 1 | m | m | s | u | 7001 | sdw3 | sdw3 | /home/mxdata_20220925154450/mirror/mxseg1 9 | 2 | m | m | s | u | 7000 | sdw2 | sdw2 | /home/mxdata_20220925154450/mirror/mxseg2 10 | 3 | m | m | s | u | 7001 | sdw2 | sdw2 | /home/mxdata_20220925154450/mirror/mxseg3 (10 rows)


After completing the above steps, restore business access and observe business operation status, and keep track of it for a period of time (the specific time depends on the specific timing scenario). If it runs stably, congratulations to the successful completion of the data migration!