MatrixDB Operation and Maintenance FAQ

This document describes common problems with MatrixDB operation and maintenance.

1 Forgot your password


Problem Description

Forgot your password using MatrixDB.

Solution

  1. Forgot the database password Reset a new password for the database user.
    =# alter user mxadmin with password 'you password';
  2. Forgot the login password of the graphic interface Check out /etc/matrixdb/auth.conf.
    # sudo cat /etc/matrixdb/auth.conf

    2 ssh_exchange_identification: Connection closed by remote host


Problem Description When using the gpssh command or using the ssh login command, you will encounter the following error:

ssh_exchange_identification: Connection closed by remote host

Problem Analysis

The number of ssh connections exceeds the limit.

Solution

  1. Open the sshd configuration file of the target server
    # vim /etc/ssh/sshd_config
  2. Find the MaxStartup parameter configuration

# MaxStartup 10:30:60

Parameter configuration meaning: MaxStartup triple form 10:30:60 10: When the number of connections reach 10, the connection is rejected, but not all of them are rejected. 30: When the number of connections reaching 10, there is a probability that the subsequent connections will be rejected. 60: When the number of connections reaching 60, all subsequent connections are rejected.

  1. Modify the parameters and remove #, MaxStartup 30:30:60.
  2. Restart the sshd service
    systemctl restart sshd 
    # or
    service sshd restart
  3. Log in again using ssh.

3 psql: WARNING: database "testdb" must be vacuumed within 529926866 transactions (seg35 192.168.247.129:6005 pid=448116)


log

psql: WARNING:  database "testdb" must be vacuumed within 529926866 transactions  (seg35 10.162.115.5:6005 pid=448116)
HINT:  To avoid a database shutdown, execute a database-wide VACUUM in that database.
You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
VACUUM

Problem Analysis

The age of the transaction exceeds the limit.

Solution

$ psql -d testdb -c "vacuum freeze;"

4 could not read block 3684552 of temporary file


Problem Description

When running SQL statements, you will encounter the following error:

could not read block 3684552 of temporary file: read only 0 of 8192 bytes

Problem Analysis

  1. Disk-related problems will occur if the server is insufficient during SQL calculation.

  2. This problem also occurs when disk exceptions are performed during SQL calculations.

Solution

  1. Use the following command to observe the disk usage in real time during SQL calculation.

    du -sh 
  2. Use the journalctl command to find error information about the disk.

    // View the total save size of journalctl logs
    journalctl --disk-usage
    // View logs after 20:20 pm on March 30, 2023
    journalctl --since "2023-03-30 20:20:00"
    // View all logs from 00:00:00 on March 1, 2023 to 00:00:00 on March 31, 2023
    journalctl --since "2023-03-01 00:00:00" --until "2023-03-31 00:00:00"
    // You can use "yesterday", "today", "tomorrow" or "now" to get the time period of logs
    journalctl --since yesterday
    // Get the log after 2023-03-25 Save it to the local disk
    journalctl --since "2023-03-25 00:00:00"  > journalctl.log
    // Find logs about disk errors in journalctl.log log
    cat journalctl.log|grep "I/O"

    5 Can MatrixDB modify the server hostname (Hostname)?


    OK.

  3. Modify /etc/hosts Backup saves the /etc/hosts file.

    cp /etc/hosts /tmp/hosts_bak

    Modify the Hostname of MatrixDB in the /etc/hosts file.

  4. Modify MatrixDB metadata information Switch users.

    $ su - mxadmin

    Close MatrixDB.

    $ gpstop -af

    Start the MatrixDB Master node.

    $ gpstart -m

    Go to the MatrixDB command line and run the command.

    PGOPTIONS='-c gp_session_role=utility' psql -U mxadmin postgres
    set allow_system_table_mods='ture';
    update gp_segment_configuration set hostname=<更改之后的主机名>,address=<更改之后的主机名> where address=<原来的主机名>;

    Close MatrixDB.

    $ gpstop -mf

    Start MatrixDB.

    $ gpstart -m
  5. Modify MatrixDB monitoring information Deploying Grafana monitoring requires the following operations:

    psql -d matrixmgr
    matrixmgr=# SELECT mxmgr_remove_all('local');
    matrixmgr=# truncate local.matrix_manager_config;
    matrixmgr=# SELECT mxmgr_init_local();

    Modify the information related to hostname in dashboard.json. For example, the original hostname was dw1 and after modification it was sdw1, all dw1 in dashboard.json needs to be modified to sdw1 and re-imported Dashboard.

Deploy Prometheus monitoring to perform the following operations

matrixmgr=# SELECT mxmgr_remove_exporter();
matrixmgr=# SELECT mxmgr_remove_gate_exporter();
matrixmgr=# truncate local.matrix_manager_config;
matrixmgr=# SELECT mxmgr_init_exporter();

6 Database startup error FATAL: database "postgres" does not exist


Problem Analysis

The host is absolute, the cluster startup fails after restart, check the database log, and an error is reported:

FATAL","3D000","database ""postgres"" does not exist","The database subdirectory ""base/13376"" is missing

Check the corresponding file directory and find that the 13376 directory is missing.

Problem recurs

  1. Build a cluster
  2. Close the cluster
    [mxadmin@mdw3 base]$ mxstop -af
  3. Delete or change the 13376 directory
    [mxadmin@mdw3 base]$ mv 13376 13376bak
  4. Start the cluster
    [mxadmin@mdw3 ~]$ mxstart -a
    2023-05-24:23:26:50.264 mxstart:mxadmin:mdw3:017121-[INFO]:-Starting MatrixDB cluster with: [no-prompt:true].
    2023-05-24:23:26:50.264 mxstart:mxadmin:mdw3:017121-[INFO]:-Connecting to physical cluster's etcd cluster ...
    2023-05-24:23:26:50.269 mxstart:mxadmin:mdw3:017121-[INFO]:-Connected to physical cluster's etcd cluster: [http://192.168.8.28:4679].
    2023-05-24:23:26:50.269 mxstart:mxadmin:mdw3:017121-[INFO]:-Collecting database cluster ID ...
    2023-05-24:23:26:50.269 mxstart:mxadmin:mdw3:017121-[INFO]:-Collected database cluster ID: AuWFhsrjyywC4xfMahgyor
    2023-05-24:23:26:50.271 mxstart:mxadmin:mdw3:017121-[INFO]:-Collecting cluster info ...
    2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:----------------------------
    2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Master instance parameters
    2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:----------------------------
    2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Master Host         = mdw3
    2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Master Port         = 5432
    2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Master Directory    = /mxdata_20230514185455/master/mxseg-1
    2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:----------------------------------------
    2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Segment instances that will be started
    2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:----------------------------------------
    2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Host   Port    DataDir                                 Role      
    2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-mdw3   6000    /mxdata_20230514185455/primary/mxseg0   PRIMARY   
    2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-mdw3   6001    /mxdata_20230514185455/primary/mxseg1   PRIMARY   
    2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
    2023-05-24:23:26:50.275 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
    2023-05-24:23:26:50.603 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [1/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
    2023-05-24:23:26:51.605 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
    2023-05-24:23:26:51.608 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
    2023-05-24:23:26:51.748 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [2/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
    2023-05-24:23:26:52.749 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
    2023-05-24:23:26:52.752 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
    2023-05-24:23:26:52.781 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [3/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
    2023-05-24:23:26:53.782 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
    2023-05-24:23:26:53.785 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
    2023-05-24:23:26:53.906 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [4/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
    2023-05-24:23:26:54.907 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
    2023-05-24:23:26:54.912 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
    2023-05-24:23:26:54.933 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [5/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
    2023-05-24:23:26:55.933 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
    2023-05-24:23:26:55.936 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
    2023-05-24:23:26:55.958 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [6/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
    2023-05-24:23:26:56.958 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
    2023-05-24:23:26:56.961 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
    2023-05-24:23:26:57.085 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [7/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
    2023-05-24:23:26:58.086 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
    2023-05-24:23:26:58.089 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
    2023-05-24:23:26:58.114 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [8/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
    2023-05-24:23:26:59.115 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
    2023-05-24:23:26:59.118 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
    2023-05-24:23:26:59.140 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [9/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
    2023-05-24:23:27:00.140 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
    2023-05-24:23:27:00.143 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
    2023-05-24:23:27:00.299 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [10/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
    2023-05-24:23:27:00.301 mxstart:mxadmin:mdw3:017121-[ERROR]:-MatrixDB cluster start failed: 'rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000))'. Please confirm database cluster has been deployed.
    Error: rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000))
  5. Check the log

Solution

  1. A host exception causes the postgres library file of the Master node to be lost. If there is Standby in this cluster, switch Standby to take over the Master and rebuild Standby.
  2. Without Standby configuration, if only the postgres library file is lost, the library files of other nodes can be copied to the Master node without writing to the business table.

Notes!
It is recommended to configure a complete Standby and Mirror mirroring mechanism to ensure the security and availability of the cluster.

7 ERROR: replication slot "internal_wal_replication_slot" already exists


Error message

Error: do execute: deploy rpc: rpc error: code = Unknown desc = deployer execute: rpc error: code = Unknown desc = error execute "/opt/ymatrix/matrixdb5/bin/pg_basebackup"

STDERR:
    pg_basebackup: error: could not send replication command "CREATE_REPLICATION_SLOT "internal_wal_replication_slot" PHYSICAL RESERVE_WAL": ERROR:  replication slot "internal_wal_replication_slot" already exists

Solution

  1. Query Segment node link information
    =# SELECT E'PGOPTIONS=\'-c gp_role\=utility\' psql -h ' ||hostname ||' -p ' ||port  FROM gp_segment_configuration where content<>-1;
                       ?column?                       
    ----------------------------------------------------------------------------------------------------------------------------------
    PGOPTIONS='-c gp_role=utility' psql -h sdw3 -p 6000
    PGOPTIONS='-c gp_role=utility' psql -h sdw2 -p 6000
    PGOPTIONS='-c gp_role=utility' psql -h sdw1 -p 6000
    (3 rows)
  2. Log in to the first step to query the output node connection information and query the slot information
    SELECT * FROM pg_replication_slots ;
  3. Log in to the first step to query the output node connection information and delete the slot
    SELECT pg_drop_replication_slot('internal_wal_replication_slot');

    8 ERROR: interconnect Error: Could not set up tcp listener socket


Error message

**** con19047721,,seg14,,sx1,"FATAL","58M01","interconnect Error: Could not set up tcp listener socket","bind: Address already in use",,,,0,,"ic_tcp.c",293,
**** con19047731,,seg14,,sx1,"FATAL","58M01","interconnect Error: Could not set up tcp listener socket","bind: Address already in use",,,,0,,"ic_tcp.c",293,
**** con19047723,,seg14,,sx1,"FATAL","58M01","interconnect Error: Could not set up tcp listener socket","bind: Address already in use",,,,0,,"ic_tcp.c",293,

Problem Analysis

  1. Using journalctl, we found that NetworkManager has been restarting the dhclient service, which will cause network interruption during the restart.

    NetworkManager[2285]: <info>  [1691711564.7553] device (ens10f0): state change: config -> ip-config (rea
    NetworkManager[2285]: <info>  [1691711564.7562] dhcp4 (ens10f0): activation: beginning transaction (time
    NetworkManager[2285]: <info>  [1691711564.7599] dhcp4 (ens10f0): dhclient started with pid 88186
    dhclient[88186]: DHCPDISCOVER on ens10f0 to 255.255.255.255 port 67 interval 6 (xid=0x34e8d695)
    dhclient[88186]: DHCPDISCOVER on ens10f0 to 255.255.255.255 port 67 interval 12 (xid=0x34e8d695)
  2. In CentOS 7, there will be two services related to network configuration: network and NetworkManager. network is configured through a static network interface, while NetworkManager is a dynamic network to realize network management. When these two services are started at the same time, conflicts will arise. NetworkManager will clean up the route when the network is disconnected, causing the network to be disconnected.

  3. Check the NetStat panel on the monitoring to view connection usage.

Solution

  1. Turn off the NetworkManager service
    systemctl stop NetworkManager
    systemctl disable NetworkManager
  2. Check whether other applications have connections not closed.

9 How to add whitelist before using the mxshift tool for migration?


Add the IP segment of the target database to the pg_hba file of all servers where the source database is located.

10 index ron requires 19160 bvtes, maximum size is 8191


Error message

java. lang.RuntimeException: org.springframework.web.client.HttpServerErrorException$InternalServerError: 500 Internal Server Error:
[ERROR: index row requires 19160 bytes, maximum size is 8191 (seg35 10.17.8.132:6012 pid=15508) (SQLSTATE 54000)]

Problem Analysis

The number of row limit for the btree index was inserted.

postgres=# CREATE TABLE test ( x text ) DISTRIBUTED BY(x);
CREATE TABLE
Time: 26.768 ms

postgres=# INSERT INTO test(x) VALUES ( repeat('x', 900000) );
INSERT 0 1
Time: 16.707 ms

postgres=# CREATE INDEX test_x ON test USING btree(x);
ERROR:  index row requires 10328 bytes, maximum size is 8191  (seg0 172.16.172.148:6000 pid=21674)
Time: 16.257 ms

postgres=# CREATE INDEX test_x ON test USING brin(x);
CREATE INDEX
Time: 19.866 ms

Solution

Reduce the length of the field or replace other indexes.