Online Expansion

For distributed databases with massive storage, increasing data volume inevitably leads to insufficient cluster capacity and computing power. Therefore, expansion operations are essential. MatrixDB supports online expansion, including:

  1. Adding new nodes without downtime
  2. Parallelized table data redistribution
  3. Joins between tables that have not been redistributed and those that have

Online expansion consists of two steps:

  1. Add new nodes
  2. Redistribute data

1. Adding New Nodes

This phase adds new segment nodes to the cluster and brings them online. However, existing table data remains on the original nodes.

To add new nodes, you must first provide a deployment configuration file. This file can be written manually or automatically generated using the gpexpand tool. Automatic generation is recommended.

1.1 Generate Expansion Configuration File

Run gpexpand:

[mxadmin@mdw ~]$ gpexpand
......
Please refer to the Admin Guide for more information.

Would you like to initiate a new System Expansion Yy|Nn (default=N):

Enter y and press Enter:

> y

You will then be prompted to enter hostnames of new nodes, separated by commas:

Enter a comma separated list of new hosts you want
to add to your array.  Do not include interface hostnames.
**Enter a blank line to only add segments to existing hosts**[]:
> sdw3,sdw4

Ensure network connectivity to the new hosts, mutual trust established with existing nodes, and that rsync is installed.

How many additional primary segments per host?
The value entered specifies how many segments to add beyond the current number on each host. To match the original segment count per host, enter 0.

How many new primary segments per host do you want to add? (default=0):
> 0

Generating configuration file...

20211102:14:36:24:024562 gpexpand:mdw:mxadmin-[INFO]:-Generating input file...

Input configuration file was written to 'gpexpand_inputfile_20211102_143624'.

Please review the file and make sure that it is correct then re-run
with: gpexpand -i gpexpand_inputfile_20211102_143624

20211102:14:36:24:024562 gpexpand:mdw:mxadmin-[INFO]:-Exiting...

After execution, a configuration file named gpexpand_inputfile_20211102_143624 is created in the current directory:

[mxadmin@mdw ~]$ cat gpexpand_inputfile_20211102_143624
sdw3|sdw3|7002|/home/mxadmin/gpdemo/datadirs/dbfast1/demoDataDir3|5|3|p
sdw3|sdw3|7004|/home/mxadmin/gpdemo/datadirs/dbfast3/demoDataDir5|7|5|p
sdw4|sdw4|7002|/home/mxadmin/gpdemo/datadirs/dbfast1/demoDataDir6|8|6|p
sdw4|sdw4|7003|/home/mxadmin/gpdemo/datadirs/dbfast2/demoDataDir7|9|7|p
sdw4|sdw4|7004|/home/mxadmin/gpdemo/datadirs/dbfast3/demoDataDir8|10|8|p

The file contains information about the new hosts, including segment port numbers, data directories, dbid, and role. These entries correspond to the gp_segment_configuration catalog table. You may also write this file manually following the same format.

1.2 Execute Node Addition

With the configuration file ready, proceed to add the new nodes:

[mxadmin@mdw ~]$ gpexpand -i gpexpand_inputfile_20211102_143624

......
20211102:14:54:03:025371 gpexpand:mdw:mxadmin-[INFO]:-Unlocking catalog
20211102:14:54:03:025371 gpexpand:mdw:mxadmin-[INFO]:-Unlocked catalog
20211102:14:54:03:025371 gpexpand:mdw:mxadmin-[INFO]:-Creating expansion schema
20211102:14:54:03:025371 gpexpand:mdw:mxadmin-[INFO]:-Populating gpexpand.status_detail with data from database postgres
20211102:14:54:03:025371 gpexpand:mdw:mxadmin-[INFO]:-Populating gpexpand.status_detail with data from database template1
20211102:14:54:03:025371 gpexpand:mdw:mxadmin-[INFO]:-Populating gpexpand.status_detail with data from database mxadmin
20211102:14:54:03:025371 gpexpand:mdw:mxadmin-[INFO]:-************************************************
20211102:14:54:03:025371 gpexpand:mdw:mxadmin-[INFO]:-Initialization of the system expansion complete.
20211102:14:54:03:025371 gpexpand:mdw:mxadmin-[INFO]:-To begin table expansion onto the new segments
20211102:14:54:03:025371 gpexpand:mdw:mxadmin-[INFO]:-rerun gpexpand
20211102:14:54:03:025371 gpexpand:mdw:mxadmin-[INFO]:-************************************************
20211102:14:54:03:025371 gpexpand:mdw:mxadmin-[INFO]:-Exiting...

After successful execution, query the gp_segment_configuration table to confirm the new nodes have been added.

2. Data Redistribution

The second step of online expansion redistributes table data to the new segment nodes, as they initially contain no data.

MatrixDB supports three distribution methods—random, hash, and replicated—all of which must be synchronized to the new segments according to their rules:

  • Replicated tables are fully copied to each new segment.
  • Hash-distributed tables are redistributed based on their distribution key.
  • Randomly distributed tables are not migrated due to lack of distribution pattern; only newly inserted rows may land on new segments probabilistically.

After the first phase completes, the gpexpand schema is created in the postgres database, containing expansion status information:

postgres=# \d
                     List of relations
  Schema  |        Name        | Type  |  Owner  | Storage
----------+--------------------+-------+---------+---------
 gpexpand | expansion_progress | view  | mxadmin |
 gpexpand | status             | table | mxadmin | heap
 gpexpand | status_detail      | table | mxadmin | heap
(3 rows)

The gpexpand.status_detail table records all tables requiring redistribution and their status:

postgres=# select * from gpexpand.status_detail ;
 dbname  |   fq_name   | table_oid | root_partition_name | rank | external_writable |   status    | expansion_started | expansion_finished | source_bytes
---------+-------------+-----------+---------------------+------+-------------------+-------------+-------------------+--------------------+--------------
 mxadmin | public.t2   |     16388 |                     |    2 | f                 | NOT STARTED |                   |                    |         8192
 mxadmin | public.t1   |     16385 |                     |    2 | f                 | NOT STARTED |                   |                    |        16384
 mxadmin | public.disk |     16391 | public.disk         |    2 | f                 | NOT STARTED |                   |                    |            0
(3 rows)

Run gpexpand again to start redistribution:

[mxadmin@mdw ~]$ gpexpand
......
gpexpand:mdw:mxadmin-[INFO]:-Querying gpexpand schema for current expansion state
20211102:15:18:27:026291 gpexpand:mdw:mxadmin-[INFO]:-Expanding mxadmin.public.t2
20211102:15:18:27:026291 gpexpand:mdw:mxadmin-[INFO]:-Finished expanding mxadmin.public.t2
20211102:15:18:27:026291 gpexpand:mdw:mxadmin-[INFO]:-Expanding mxadmin.public.t1
20211102:15:18:27:026291 gpexpand:mdw:mxadmin-[INFO]:-Finished expanding mxadmin.public.t1
20211102:15:18:27:026291 gpexpand:mdw:mxadmin-[INFO]:-Expanding mxadmin.public.disk
20211102:15:18:27:026291 gpexpand:mdw:mxadmin-[INFO]:-Finished expanding mxadmin.public.disk
20211102:15:18:32:026291 gpexpand:mdw:mxadmin-[INFO]:-EXPANSION COMPLETED SUCCESSFULLY
20211102:15:18:32:026291 gpexpand:mdw:mxadmin-[INFO]:-Exiting...

After redistribution, querying gpexpand.status_detail shows updated status as COMPLETED:

 dbname  |   fq_name   | table_oid | root_partition_name | rank | external_writable |  status   |     expansion_started      |     expansion_finished     | source_bytes
---------+-------------+-----------+---------------------+------+-------------------+-----------+----------------------------+----------------------------+--------------
 mxadmin | public.t2   |     16388 |                     |    2 | f                 | COMPLETED | 2021-11-02 15:18:27.326247 | 2021-11-02 15:18:27.408379 |         8192
 mxadmin | public.t1   |     16385 |                     |    2 | f                 | COMPLETED | 2021-11-02 15:18:27.431481 | 2021-11-02 15:18:27.507591 |        16384
 mxadmin | public.disk |     16391 | public.disk         |    2 | f                 | COMPLETED | 2021-11-02 15:18:27.531727 | 2021-11-02 15:18:27.570559 |            0
(3 rows)

To increase concurrency during redistribution, use the -B option with gpexpand. The default is 16, maximum is 128:

[mxadmin@mdw ~]$ gpexpand -B 32

After redistribution completes, run gpexpand -c to clean up intermediate expansion tables:

[mxadmin@mdw ~]$ gpexpand -c
20211102:15:24:41:026524 gpexpand:mdw:mxadmin-[INFO]:-local Greenplum Version: 'postgres (MatrixDB) 5.0.0-enterprise~alpha (Greenplum Database) 7.0.0 build dev'
20211102:15:24:41:026524 gpexpand:mdw:mxadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 12 (MatrixDB 5.0.0-enterprise~alpha) (Greenplum Database 7.0.0 build dev) on x86_64-pc-linux-gnu, compiled by gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5), 64-bit compiled on Oct 25 2021 15:24:16'
20211102:15:24:41:026524 gpexpand:mdw:mxadmin-[INFO]:-Querying gpexpand schema for current expansion state


Do you want to dump the gpexpand.status_detail table to file? Yy|Nn (default=Y):
> n
20211102:15:24:46:026524 gpexpand:mdw:mxadmin-[INFO]:-Removing gpexpand schema
20211102:15:24:46:026524 gpexpand:mdw:mxadmin-[INFO]:-Cleanup Finished.  exiting...

Expansion is now complete.

3. FAQ

  1. Is the catalog consistent between old and new nodes during expansion?
    Yes. The catalog is always consistent. Even if a new node has no user data yet, its catalog is already synchronized.

  2. What impact does expansion have on concurrent queries?
    Since expansion is online, running queries are not interrupted.

    • During node addition, the catalog is briefly locked. DDL operations are blocked during this time, but the lock is released immediately after the new nodes are added.
    • During data redistribution, an exclusive lock is held, which blocks read operations.

For detailed usage of gpexpand, see the documentation.