This document describes common problems with YMatrix operation and maintenance.
Problem Description
Forgot your password using YMatrix.
Solution
=# alter user mxadmin with password 'you password';
/etc/matrixdb/auth.conf
.# sudo cat /etc/matrixdb/auth.conf
ssh_exchange_identification: Connection closed by remote host
Problem Description
When using the gpssh
command or using the ssh login command, you will encounter the following error:
ssh_exchange_identification: Connection closed by remote host
Problem Analysis
The number of ssh connections exceeds the limit.
Solution
# vim /etc/ssh/sshd_config
# MaxStartup 10:30:60
Parameter configuration meaning:
MaxStartup
triple form 10:30:60
10: When the number of connections reach 10, the connection is rejected, but not all of them are rejected.
30: When the number of connections reaching 10, there is a probability that the subsequent connections will be rejected.
60: When the number of connections reaching 60, all subsequent connections are rejected.
#
, MaxStartup 30:30:60
.systemctl restart sshd
# or
service sshd restart
psql: WARNING: database "testdb" must be vacuumed within 529926866 transactions (seg35 192.168.247.129:6005 pid=448116)
log
psql: WARNING: database "testdb" must be vacuumed within 529926866 transactions (seg35 10.162.115.5:6005 pid=448116)
HINT: To avoid a database shutdown, execute a database-wide VACUUM in that database.
You might also need to commit or roll back old prepared transactions, or drop stale replication slots.
VACUUM
Problem Analysis
The age of the transaction exceeds the limit.
Solution
$ psql -d testdb -c "vacuum freeze;"
could not read block 3684552 of temporary file
Problem Description
When running SQL statements, you will encounter the following error:
could not read block 3684552 of temporary file: read only 0 of 8192 bytes
Problem Analysis
Disk-related problems will occur if the server is insufficient during SQL calculation.
This problem also occurs when disk exceptions are performed during SQL calculations.
Solution
Use the following command to observe the disk usage in real time during SQL calculation.
du -sh
Use the journalctl
command to find error information about the disk.
// View the total save size of journalctl logs
journalctl --disk-usage
// View logs after 20:20 pm on March 30, 2023
journalctl --since "2023-03-30 20:20:00"
// View all logs from 00:00:00 on March 1, 2023 to 00:00:00 on March 31, 2023
journalctl --since "2023-03-01 00:00:00" --until "2023-03-31 00:00:00"
// You can use "yesterday", "today", "tomorrow" or "now" to get the time period of logs
journalctl --since yesterday
// Get the log after 2023-03-25 Save it to the local disk
journalctl --since "2023-03-25 00:00:00" > journalctl.log
// Find logs about disk errors in journalctl.log log
cat journalctl.log|grep "I/O"
OK.
Modify /etc/hosts
Backup saves the /etc/hosts
file.
cp /etc/hosts /tmp/hosts_bak
Modify the Hostname of YMatrix in the /etc/hosts
file.
Modify YMatrix metadata information Switch users.
$ su - mxadmin
Close YMatrix.
$ mxstop -af
Start the YMatrix Master node.
$ mxstart -m
Go to the YMatrix command line and run the command.
PGOPTIONS='-c gp_session_role=utility' psql -U mxadmin postgres
set allow_system_table_mods='ture';
update gp_segment_configuration set hostname=<Host name after change>,address=<Host name after change> where address=<original host name>;
Close YMatrix.
$ mxstop -mf
Start YMatrix.
$ mxstart -m
Modify YMatrix monitoring information Deploying Grafana monitoring requires the following operations:
psql -d matrixmgr
matrixmgr=# SELECT mxmgr_remove_all('local');
matrixmgr=# truncate local.matrix_manager_config;
matrixmgr=# SELECT mxmgr_init_local();
Modify the information related to Hostname in dashboard.json
.
For example, the original Hostname was dw1 and after modification it was sdw1, you need to modify all dw1 in dashboard.json
to sdw1 and re-import Dashboard.
Deploy Prometheus monitoring to perform the following operations
matrixmgr=# SELECT mxmgr_remove_exporter();
matrixmgr=# SELECT mxmgr_remove_gate_exporter();
matrixmgr=# truncate local.matrix_manager_config;
matrixmgr=# SELECT mxmgr_init_exporter();
FATAL: database "postgres" does not exist
Problem Analysis
The host is absolute, the cluster startup fails after restart, check the database log, and an error is reported:
FATAL","3D000","database ""postgres"" does not exist","The database subdirectory ""base/13376"" is missing
Check the corresponding file directory and find that the 13376
directory is missing.
Problem recurs
[mxadmin@mdw3 base]$ mxstop -af
13376
directory[mxadmin@mdw3 base]$ mv 13376 13376bak
[mxadmin@mdw3 ~]$ mxstart -a
2023-05-24:23:26:50.264 mxstart:mxadmin:mdw3:017121-[INFO]:-Starting MatrixDB cluster with: [no-prompt:true].
2023-05-24:23:26:50.264 mxstart:mxadmin:mdw3:017121-[INFO]:-Connecting to physical cluster's etcd cluster ...
2023-05-24:23:26:50.269 mxstart:mxadmin:mdw3:017121-[INFO]:-Connected to physical cluster's etcd cluster: [http://192.168.8.28:4679].
2023-05-24:23:26:50.269 mxstart:mxadmin:mdw3:017121-[INFO]:-Collecting database cluster ID ...
2023-05-24:23:26:50.269 mxstart:mxadmin:mdw3:017121-[INFO]:-Collected database cluster ID: AuWFhsrjyywC4xfMahgyor
2023-05-24:23:26:50.271 mxstart:mxadmin:mdw3:017121-[INFO]:-Collecting cluster info ...
2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:----------------------------
2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Master instance parameters
2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:----------------------------
2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Master Host = mdw3
2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Master Port = 5432
2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Master Directory = /mxdata_20230514185455/master/mxseg-1
2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:----------------------------------------
2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Segment instances that will be started
2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:----------------------------------------
2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Host Port DataDir Role
2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-mdw3 6000 /mxdata_20230514185455/primary/mxseg0 PRIMARY
2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-mdw3 6001 /mxdata_20230514185455/primary/mxseg1 PRIMARY
2023-05-24:23:26:50.272 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
2023-05-24:23:26:50.275 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
2023-05-24:23:26:50.603 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [1/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
2023-05-24:23:26:51.605 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
2023-05-24:23:26:51.608 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
2023-05-24:23:26:51.748 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [2/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
2023-05-24:23:26:52.749 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
2023-05-24:23:26:52.752 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
2023-05-24:23:26:52.781 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [3/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
2023-05-24:23:26:53.782 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
2023-05-24:23:26:53.785 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
2023-05-24:23:26:53.906 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [4/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
2023-05-24:23:26:54.907 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
2023-05-24:23:26:54.912 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
2023-05-24:23:26:54.933 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [5/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
2023-05-24:23:26:55.933 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
2023-05-24:23:26:55.936 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
2023-05-24:23:26:55.958 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [6/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
2023-05-24:23:26:56.958 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
2023-05-24:23:26:56.961 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
2023-05-24:23:26:57.085 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [7/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
2023-05-24:23:26:58.086 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
2023-05-24:23:26:58.089 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
2023-05-24:23:26:58.114 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [8/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
2023-05-24:23:26:59.115 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
2023-05-24:23:26:59.118 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
2023-05-24:23:26:59.140 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [9/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
2023-05-24:23:27:00.140 mxstart:mxadmin:mdw3:017121-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
2023-05-24:23:27:00.143 mxstart:mxadmin:mdw3:017121-[INFO]:-Send start to cluster service
2023-05-24:23:27:00.299 mxstart:mxadmin:mdw3:017121-[INFO]:-Retry start for [10/10], caused by rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000)).
2023-05-24:23:27:00.301 mxstart:mxadmin:mdw3:017121-[ERROR]:-MatrixDB cluster start failed: 'rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000))'. Please confirm database cluster has been deployed.
Error: rpc error: code = Unknown desc = query segments_configuration on master failed failed to connect to `host=mdw3 user=mxadmin database=postgres`: server error (FATAL: database "postgres" does not exist (SQLSTATE 3D000))
Solution
postgres
library file of the Master node to be lost. If there is Standby in this cluster, switch Standby to take over the Master and rebuild Standby.Notes!
It is recommended to configure a complete Standby and Mirror [mirror mechanism] (/doc/latest/reference/mirror_distribution_policy) to ensure the safe and available clusters.
rpc error: code = Unknown desc = primary not found in shard state
Problem Description
Error when executing mxstart -a
:
2023-05-31:16:03:25.325 mxstart:mxadmin:mdw:026098-[INFO]:-Retry start for [9/10], caused by rpc error: code = Unknown desc = start shard 1, rpc error: code = Unknown desc = primary not found in shard state.
2023-05-31:16:03:26.326 mxstart:mxadmin:mdw:026098-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait ...
2023-05-31:16:03:26.335 mxstart:mxadmin:mdw:026098-[INFO]:-Send start to cluster service
2023-05-31:16:03:26.716 mxstart:mxadmin:mdw:026098-[INFO]:-Retry start for [10/10], caused by rpc error: code = Unknown desc = start shard 1, rpc error: code = Unknown desc = primary not found in shard state.
2023-05-31:16:03:26.721 mxstart:mxadmin:mdw:026098-[ERROR]:-MatrixDB cluster start failed: 'rpc error: code = Unknown desc = start shard 1, rpc error: code = Unknown desc = primary not found in shard state'. Please confirm database cluster has been deployed.
Error: rpc error: code = Unknown desc = start shard 1, rpc error: code = Unknown desc = primary not found in shard state
Problem Analysis
Solution
mxpacklogs
tool to collect logs and send them to our pre-sales/after-sales personnel for analysis.Error message
[mxadmin@mdw ~]$ mxaddmirrors setup --plan-file /tmp/mxaddmirrors.plan --mode cli
[20230606:14:37:04][INFO] start: create_user_and_directories on 3 hosts
[20230606:14:37:04][INFO] done
[20230606:14:37:05][INFO] start: initialize_deployer on 1 host
[20230606:14:37:06][INFO] done
[20230606:14:37:06][INFO] start: reg_mirrors on 1 host
[20230606:14:37:06][INFO] done
[20230606:14:37:06][INFO] start: init_mirrors on 3 hosts
[20230606:14:37:06][ERROR] init_mirrors: rpc error: code = Unknown desc = error execute "/opt/ymatrix/matrixdb5/bin/pg_basebackup"
STDERR:
pg_basebackup: error: could not connect to server: FATAL: no pg_hba.conf entry for replication connection from host "192.168.247.131", user "mxadmin", SSL off
[20230606:14:37:06][INFO] revert start: init_mirrors on hosts
[20230606:14:37:06][INFO] revert done
[20230606:14:37:06][INFO] revert start: reg_mirrors on hosts
[20230606:14:37:06][INFO] revert done
[20230606:14:37:06][INFO] revert start: initialize_deployer on hosts
[20230606:14:37:13][INFO] revert done
[20230606:14:37:13][INFO] revert start: create_user_and_directories on 3 hosts
[20230606:14:37:13][INFO] revert done
Error: do execute: deploy rpc: rpc error: code = Unknown desc = deployer execute: rpc error: code = Unknown desc = error execute "/opt/ymatrix/matrixdb5/bin/pg_basebackup"
STDERR:
pg_basebackup: error: could not connect to server: FATAL: no pg_hba.conf entry for replication connection from host "192.168.247.131", user "mxadmin", SSL off
Solution
pg_hba.conf
file on all segment
nodes and add whitelist permissionsfor i in `find /mxdata_20230606142513/ -name pg_hba.conf` ; do echo "host all all 0.0.0.0/0 md5" >>$i; done
mxstop -u
ERROR: replication slot "internal_wal_replication_slot" already exists
Error message
Error: do execute: deploy rpc: rpc error: code = Unknown desc = deployer execute: rpc error: code = Unknown desc = error execute "/opt/ymatrix/matrixdb5/bin/pg_basebackup"
STDERR:
pg_basebackup: error: could not send replication command "CREATE_REPLICATION_SLOT "internal_wal_replication_slot" PHYSICAL RESERVE_WAL": ERROR: replication slot "internal_wal_replication_slot" already exists
Solution
=# SELECT E'PGOPTIONS=\'-c gp_role\=utility\' psql -h ' ||hostname ||' -p ' ||port FROM gp_segment_configuration where content<>-1;
?column?
----------------------------------------------------------------------------------------------------------------------------------
PGOPTIONS='-c gp_role=utility' psql -h sdw3 -p 6000
PGOPTIONS='-c gp_role=utility' psql -h sdw2 -p 6000
PGOPTIONS='-c gp_role=utility' psql -h sdw1 -p 6000
(3 rows)
slot
informationSELECT * FROM pg_replication_slots ;
slot
SELECT pg_drop_replication_slot('internal_wal_replication_slot');
ERROR: interconnect Error: Could not set up tcp listener socket
Error message
**** con19047721,,seg14,,sx1,"FATAL","58M01","interconnect Error: Could not set up tcp listener socket","bind: Address already in use",,,,0,,"ic_tcp.c",293,
**** con19047731,,seg14,,sx1,"FATAL","58M01","interconnect Error: Could not set up tcp listener socket","bind: Address already in use",,,,0,,"ic_tcp.c",293,
**** con19047723,,seg14,,sx1,"FATAL","58M01","interconnect Error: Could not set up tcp listener socket","bind: Address already in use",,,,0,,"ic_tcp.c",293,
Problem Analysis
Using journalctl
, we found that NetworkManager
has been restarting the dhclient
service, which will cause network interruption during the restart.
NetworkManager[2285]: <info> [1691711564.7553] device (ens10f0): state change: config -> ip-config (rea
NetworkManager[2285]: <info> [1691711564.7562] dhcp4 (ens10f0): activation: beginning transaction (time
NetworkManager[2285]: <info> [1691711564.7599] dhcp4 (ens10f0): dhclient started with pid 88186
dhclient[88186]: DHCPDISCOVER on ens10f0 to 255.255.255.255 port 67 interval 6 (xid=0x34e8d695)
dhclient[88186]: DHCPDISCOVER on ens10f0 to 255.255.255.255 port 67 interval 12 (xid=0x34e8d695)
In CentOS 7, there are two services related to network configuration, network
and NetworkManager
, which are configured through a static network interface, while NetworkManager
is a dynamic network to implement network management. However, when these two services are started at the same time, conflicts will arise. NetworkManager
will clean up the route when the network is disconnected, causing the network to be disconnected.
Check the NetStat panel on the monitoring to view connection usage.
Solution
NetworkManager
servicesystemctl stop NetworkManager
systemctl disable NetworkManager
Just configure the target IP segment to the pg_hba
file on the source master machine.
disable-connector = false
is not configured, a random port will be generated for the corresponding Primary instance on the Segment node of the source-side database; if configured as true
, the system will allocate based on the Primary instance ports configured when the source-side cluster is initialized.export MXSHIFT_WORKER_PORT_BASE = <port number>
, and you also need to specify the compression method in the configuration file --compress-method
. Turning on compression will enable the execution log to display the port range after mxshift is executed again. The maximum value in the range varies according to the degree of concurrency and the number of segments.index ron requires 19160 bvtes, maximum size is 8191
Error message
java. lang.RuntimeException: org.springframework.web.client.HttpServerErrorException$InternalServerError: 500 Internal Server Error:
[ERROR: index row requires 19160 bytes, maximum size is 8191 (seg35 10.17.8.132:6012 pid=15508) (SQLSTATE 54000)]
Problem Analysis
The number of row limit for the btree
index was inserted.
postgres=# CREATE TABLE test ( x text ) DISTRIBUTED BY(x);
CREATE TABLE
Time: 26.768 ms
postgres=# INSERT INTO test(x) VALUES ( repeat('x', 900000) );
INSERT 0 1
Time: 16.707 ms
postgres=# CREATE INDEX test_x ON test USING btree(x);
ERROR: index row requires 10328 bytes, maximum size is 8191 (seg0 172.16.172.148:6000 pid=21674)
Time: 16.257 ms
postgres=# CREATE INDEX test_x ON test USING brin(x);
CREATE INDEX
Time: 19.866 ms
Solution
Reduce the length of the field or replace other indexes.