This document describes common problems in cluster deployment.
Graphical interface initialization log:
"error": "execute: do execute: run: initialize_database: 7 errors occurred: *
error execute \"/usr/local/matrixdb-4.5.0.community/bin/initdb\"\n\n STDOUT:
The files belonging to this database system will be owned by user \"mxadmin\".
This user must also own the server process.
The database cluster will be initialized with locale \"en_US.utf8\".\n The default text search configuration will be set to \"english\".
Data page checksums are enabled.
STDERR:
initdb: error: could not access directory \"/data/mxdata_20221104084534/master/mxseg-1\": Permission denied\n * error execute \"/usr/local/matrixdb-4.5.0.community/bin/initdb\"
STDOUT:
The files belonging to this database system will be owned by user \"mxadmin\".
This user must also own the server process.\n\n The database cluster will be initialized with locale \"en_US.utf8\".
The default text search configuration will be set to \"english\".
Data page checksums are enabled.
Problem Analysis
Only the owner of the data directory has rwx permissions, and the group and other users do not have access rights.
[root@mdw ~]# ll /
total 36
lrwxrwxrwx. 1 root root 7 Jun 1 19:38 bin -> usr/bin
dr-xr-xr-x. 5 root root 4096 Oct 26 18:28 boot
drwxr-xr-x 20 root root 3200 Oct 26 14:45 dev
drwxr-xr-x. 80 root root 8192 Oct 28 13:53 etc
drwxr-xr-x. 5 root root 8192 Oct 26 18:17 export
drwxr-xr-x. 5 root root 105 Oct 26 18:28 home
drwx------. 5 root root 105 Oct 26 18:28 data
Solution
Just modify the data directory permissions.
sudo chmod 755 /data
yum
error after installing the matrixdb package cpio read error
Problem Analysis
The user environment is Windows, and you use a vm15 virtual machine. After Windows downloads the installation package, the file will be dragged to the virtual machine, causing the file to be truncated.
Solution
Use the vm shared directory mechanism to transfer data.
An error occurred during initialization:
could not connect to server: No route to host
Is the server running on host "192.168.88.203" and accepting
TCP/IP connections on port 40000?
(seg0 192.168.88.203:40000)
Problem Analysis
203 The machine turned off iptables, but there was no disable
. After restarting the machine, the firewall started again.
The port is not released by default, which causes the machine to be unable to communicate during initialization. The phenomenon is that the initialization is stuck and cannot be completed.
Solution
Clear the firewall rules on the 203 machine, stop the iptables service and disable
to prevent the network from running out after restarting.
unknown distribution option:"long_description_content_type'
Problem Analysis
The setuptools version is relatively old.
Solution
sudo python3 -m pip install --upgrade setuptools
Solution
Add the host name, port number and user configuration to the .ssh/config
file:
Host mdw
Hostname mdw
Port 29022
User mxadmin
Host sdw1
Hostname sdw1
Port 29022
User mxadmin
Problem Analysis
The same entry exists in /etc/hosts
, such as:
<IP 地址1> <主机名1>
<IP 地址1> <主机名1>
<IP 地址2> <主机名2>
<IP 地址2> <主机名2>
Solution
Delete extra entries in /etc/hosts
:
<IP 地址1> <主机名1>
<IP 地址2> <主机名2>
It can be initialized normally after modification.
host=mdw user=mxadmin database=postgres
: dial error (dial tcp 192.168.247.132:5432: connect: connection refused)The following error occurred in the graphic interface:
failed to connect to host=mdw user=mxadmin database=postgres: dial error (dial tcp 192.168.247.132:5432: connect: connection refused)
Problem Analysis
You probably have installed MatrixDB once using your browser. For some reason, the previous MatrixDB environment has been cleaned up. If this graphic interface is loaded again, the /datastream
path will be added to the URL address by default.
For example: http://192.168.247.132:8240/datastream
Solution
Change the datastream
keyword to installer
.
For example: http://192.168.247.132:8240/installer
Use the graphic interface again to perform the next installation.
When using a graphic interface to install and deploy a MatrixDB cluster, an error was reported for adding nodes:
Host addition failed collect: do collect: unmarshal remote: json: cannot unmarshal string into Go struct field Disk.hardware.disk.ineligibleDesc of type mxi18n.Message
Problem Analysis
The MatrixDB version installed by each server node is caused by inconsistent.
Checking Method
Check the MatrixDB version of each server node in turn.
Check the master node mdw MatrixDB version.
[root@mdw matrixdb]# ll /usr/local/matrixdb
lrwxrwxrwx 1 root root 25 12月 9 18:02 /usr/local/matrixdb -> matrixdb-4.7.5.enterprise
Check the data node sdw1 MatrixDB version.
[root@sdw1 ~]$ ll /usr/local/matrixdb
lrwxrwxrwx 1 root root 25 12月 22 17:24 /usr/local/matrixdb -> matrixdb-4.6.2.enterprise
Check the data node sdw2 MatrixDB version.
[root@sdw2 ~]# ll /usr/local/matrixdb
lrwxrwxrwx 1 root root 25 12月 9 18:02 /usr/local/matrixdb -> matrixdb-4.7.5.enterprise
Check the data node sdw3 MatrixDB version.
[root@sdw3 ~]# ll /usr/local/matrixdb
lrwxrwxrwx 1 root root 25 12月 9 18:02 /usr/local/matrixdb -> matrixdb-4.7.5.enterprise
Check results
The database version of the sdw1 node is 4.6.2, and the database version of the other nodes is 4.7.5.
Solution
Upgrade the database version of the sdw1 node with the same version as other nodes, and the command is as follows:
Stop Supervisor service.
[root@sdw1 ~]$ systemctl stop matrixdb.supervisor.service
Uninstall the old version of MatrixDB software.
[root@sdw1 ~]$ yum -y remove matrixdb
Install the new version of MatrixDB software.
[root@sdw1 ~]$ yum -y install /home/mxadmin/matrixdb-4.7.5.enterprise-1.el7.x86_64.rpm
Start the Supervisor service.
[root@sdw1 ~]$ systemctl start matrixdb.supervisor.service
Error message
20221223:09:55:10:001626 gpstart:mdw:mxadmin-[CRITICAL]:-Error occurred: non-zero rc: 1
Command was: 'env GPSESSID=0000000000 GPERA=None $GPHOME/bin/pg_ctl -D /mxdata_20221221165810/master/mxseg-1 -l /mxdata_20221221165810/master/mxseg-1/log/startup.log -w -t 600 -o " -p 5432 -c gp_role=utility " start'
rc=1, stdout='waiting for server to start.... stopped waiting
', stderr='pg_ctl: could not start server
Examine the log output.
'
Problem Analysis
View the log file.
[mxadmin@mdw ~]$ cd /mxdata_20221221165810/master/mxseg-1/log
[mxadmin@mdw log]$ vi startup.log
"FATAL","42501","could not create lock file ""/tmp/.s.PGSQL.5432.lock"": Permission denied",,,,,,,,"CreateLockFile","miscinit.c",994,1 0xd44e33 postgres errstart (elog.c:498)
Check the permissions for the /tmp
path. Since the permissions of the /tmp
path must be 777
, just modify it back.
[mxadmin@mdw ~]$ ll / | grep tmp
drw-r-xr-x. 7 root root 8192 12月 23 10:00 tmp
Solution
Under root user, change the /tmp
path permission to 777
permission.
[mxadmin@mdw ~]$ exit
[root@mdw ~]# chmod 777 /tmp
Restart the cluster.
[root@mdw ~]# su - mxadmin
[mxadmin@mdw ~]$ gpstart -a
Problem Analysis View graphic deployment log files
[mxadmin@mdw ~]$ cd /var/log/matrixdb/
[mxadmin@mdw matrixdb]$ vi mxui.log
[20221223:10:08:43][INFO] id=1; start: system_setup
[20221223:10:08:43][INFO] id=1; done
[20221223:10:08:43][INFO] id=2; start: create_user_and_directories
[20221223:10:08:43][INFO] id=2; done
[20221223:10:08:43][INFO] id=3; start: initialize_database
[20221223:10:08:44][INFO] id=3; running: 6%
[20221223:10:08:44][INFO] id=3; running: 6%
[20221223:10:08:44][INFO] id=3; running: 6%
[20221223:10:08:44][INFO] id=3; running: 6%
[20221223:10:08:44][INFO] id=3; running: 6%
[20221223:10:08:45][INFO] id=3; running: 6%
[20221223:10:08:45][INFO] id=3; running: 6%
[20221223:10:08:45][INFO] id=3; done
[20221223:10:08:45][INFO] id=4; start: launch_matrixdb
[20221223:10:08:45][ERROR] id=4; failed: launch_matrixdb: error execute "/usr/local/matrixdb-4.7.5.enterprise/bin/pg_ctl -w -l /mxdata_20221223100549/master/mxseg-1/log/startup.log -D /mxdata_20221223100549/master/mxseg-1 -o -i -p 5432 -c gp_role=utility -m start"
STDOUT:
waiting for server to start.... stopped waiting
STDERR:
pg_ctl: could not start server
Examine the log output.
[20221223:10:08:45][INFO] id=4; revert start: launch_matrixdb
[20221223:10:08:45][INFO] id=4; revert done
[20221223:10:08:45][INFO] id=3; revert start: initialize_database
[20221223:10:08:45][INFO] id=3; revert done
[20221223:10:08:45][INFO] id=2; revert start: create_user_and_directories
[20221223:10:08:45][INFO] id=2; revert done
[20221223:10:08:45][INFO] id=1; revert start: system_setup
[20221223:10:08:45][INFO] id=1; revert done
{
"error": "execute: do execute: run: launch_matrixdb: error execute \"/usr/local/matrixdb-4.7.5.enterprise/bin/pg_ctl -w -l /mxdata_20221223100549/master/mxseg-1/log/startup.log -D /mxdata_20221223100549/master/mxseg-1 -o -i -p 5432 -c gp_role=utility -m start\"\n\nSTDOUT:\n waiting for server to start.... stopped waiting\n\nSTDERR:\n pg_ctl: could not start server\nExamine the log output.\n"
}
execute: do execute: run: launch_matrixdb: error execute "/usr/local/matrixdb-4.7.5.enterprise/bin/pg_ctl -w -l /mxdata_20221223100549/master/mxseg-1/log/startup.log -D /mxdata_20221223100549/master/mxseg-1 -o -i -p 5432 -c gp_role=utility -m start"
STDOUT:
waiting for server to start.... stopped waiting
STDERR:
pg_ctl: could not start server
Examine the log output.
[GIN] 2022/12/23 - 10:08:45 | 200 | 148.13µs | 192.168.247.2 | GET "/api/installer/log"
The steps to view the graphic interface deployment are currently running to launch_matrixdb
, find the relevant operations in the log and analyze according to the context: it is found that the pg_ctl
step is currently running to start the instance. This indicates that the instance failed to start, resulting in the entire initialization operation failing and fallback.
There are several reasons that may cause the startup instance to fail. You need to check one by one according to the actual situation:
/tmp
permission is insufficient, and the lock file cannot be created.Solution
Since the causes of the first three problems are complicated, specific scenarios need to be analyzed in detail, and will not be described in detail here.
4. Change the /tmp
path permission to 777
permission under root user.
[root@mdw ~]# chmod 777 /tmp
Error message
[root@sdw4 yum.repos.d]# yum -y install /home/mxadmin/matrixdb-4.7.5.enterprise-1.el7.x86_64.rpm
已加载插件:fastestmirror
正在检查 /home/mxadmin/matrixdb-4.7.5.enterprise-1.el7.x86_64.rpm: matrixdb-4.7.5.enterprise-1.el7.x86_64
/home/mxadmin/matrixdb-4.7.5.enterprise-1.el7.x86_64.rpm 将被安装
正在解决依赖关系
--> Checking transactions
---> package matrixdb.x86_64.0.4.7.5.enterprise-1.el7 will be installed
--> Dependency sysstat is being processed, it is required by package matrixdb-4.7.5.enterprise-1.el7.x86_64
Loading mirror speeds from cached hostfile
--> Resolving dependency completion
错误:软件包:matrixdb-4.7.5.enterprise-1.el7.x86_64 (/matrixdb-4.7.5.enterprise-1.el7.x86_64)
需要:sysstat
您可以尝试添加 --skip-broken 选项来解决该问题
您可以尝试执行:rpm -Va --nofiles --nodigest
Problem Analysis
The sysstat package is missing.
Solution
Configure the yum source, use yum -y install sysstat
to install the sysstat tool, and then install the MatrixDB package.
Problem Analysis
After the deployment of the MatrixDB version 4.7.5 installation package is completed, the supervisor process starts an exception. After checking /var/log/messages
, you will find the following exception log.
Dec 22 19:08:59 sdw21 systemd: matrixdb.supervisor.service holdoff time over, scheduling restart.
Dec 22 19:08:59 sdw21 systemd: Stopped MatrixDB Supervisord Daemon.
Dec 22 19:08:59 sdw21 systemd: Started MatrixDB Supervisord Daemon.
Dec 22 19:08:59 sdw21 bash: time="2022-12-22T19:08:59+08:00" level=info msg="load configuration from file" file=/etc/matrixdb/supervisor.conf
Dec 22 19:08:59 sdw21 bash: time="2022-12-22T19:08:59+08:00" level=info msg="load config file over, content "
Dec 22 19:09:09 sdw21 bash: panic: timeout to start gRPC service
Dec 22 19:09:09 sdw21 bash: goroutine 1 [running]:
Dec 22 19:09:09 sdw21 bash: main.runServer()
Dec 22 19:09:09 sdw21 bash: /home/runner/work/matrixdb-ci/matrixdb-ci/cmd/supervisor/main.go:151 +0x4f0
Dec 22 19:09:09 sdw21 bash: main.main()
Dec 22 19:09:09 sdw21 bash: /home/runner/work/matrixdb-ci/matrixdb-ci/cmd/supervisor/main.go:216 +0x185
Dec 22 19:09:09 sdw21 systemd: matrixdb.supervisor.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Dec 22 19:09:09 sdw21 systemd: Unit matrixdb.supervisor.service entered failed state.
Dec 22 19:09:09 sdw21 systemd: matrixdb.supervisor.service failed.
Dec 22 19:09:14 sdw21 systemd: matrixdb.supervisor.service holdoff time over, scheduling restart.
Dec 22 19:09:14 sdw21 systemd: Stopped MatrixDB Supervisord Daemon.
Dec 22 19:09:14 sdw21 systemd: Started MatrixDB Supervisord Daemon.
After investigation, it was found that the variable value in the /etc/sysctl.conf
file was too large, which would cause the supervisor to run poorly. Just change to the normal size value, as follows.
Solution
/etc/sysctl.conf
to prevent the supervisor from blocking operation.################################## value too large, supervisor startup fail
#net.core.rmem_default = 1800262144
#net.core.wmem_default = 1800262144
#net.core.rmem_max = 2000777216
#net.core.wmem_max = 2000777216
##################
net.core.rmem_default = 262144
net.core.wmem_default = 262144
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
journalctl --no-pager
to grab detailed information, including the crashed stack, etc.Error message
Collection of information failed collect: do collect: hardware: GetDisk: createTempUser: error execute useradd: exit status 1 please create "mxadmin" user manually to workaround this issue useradd: Unable to open /etc/passwd
Problem Analysis
According to the analysis of the error information, it is the /etc/passwd file permission problem that causes the creation of the operating system user mxadmin to fail.
Solution
Create a test user manually and view the error message
[root@sdw4 ~]# useradd test1
useradd:无法打开 /etc/passwd
Check the /etc/passwd
permissions
[root@sdw4 ~]# ll /etc/passwd
-rw-r--r-- 1 root root 898 12月 24 01:48 /etc/passwd
The result shows that 644 permissions are normal permissions, no problems
Check whether /etc/passwd
has special permissions
[root@sdw4 ~]# lsattr /etc/passwd
----i--------- /etc/passwd
The result shows that the /etc/passwd file has special permissions "i" (Permission Description: No user may change or delete, including root).
Repeat the above steps to check the /etc/group
file
[root@sdw4 ~]# ll /etc/group
-rw-r--r-- 1 root root 460 12月 24 01:48 /etc/group
[root@sdw4 ~]# lsattr /etc/group
----i---------- /etc/group
Remove the special permissions of the two files /etc/passwd
and /etc/group
[root@sdw4 ~]# chattr -i /etc/passwd
[root@sdw4 ~]# chattr -i /etc/group
Try to create a test user manually again
[root@sdw4 ~]# useradd test1
As a result, users can be created normally.
Delete the test user
[root@sdw4 ~]# userdel -r test1
Continue installing and deploying MatrixDB using the graphic interface
When using MXUI for initialization, perform the last step and the error is reported as follows:
[20221101:14:59:02][ERROR] id=3; failed: initialize_database: error execute "/usr/local/matrixdb-4.7.2.enterprise/bin/initdb"
STDOUT:
The files belonging to this database system will be owned by user "mxadmin".
This user must also own the server process.
The database cluster will be initialized with locale "en_US.utf8".
The default text search configuration will be set to "english".
{
"error": "execute: do execute: run: initialize_database: error execute \"/usr/local/matrixdb-4.7.2.enterprise/bin/initdb\"\n\nSTDOUT:\n The files belonging to this database system will be owned by user \"mxadmin\".\nThis user must also own the server process.\n\nThe database cluster will be initialized with locale \"en_US.utf8\".\nThe default text search configuration will be set to \"english\".\n\nData page checksums are enabled.\n\nfixing permissions on existing directory /data/mxdata_20221101121858/primary/mxseg0 ... ok\ncreating subdirectories ... ok\nselecting dynamic shared memory implementation ... posix\nselecting default max_connections ... \nSTDERR:\n initdb: error: initdb: error 256 from: \"/usr/local/matrixdb-4.7.2.enterprise/bin/postgres\" --boot -x0 -F -c max_connections=1500 -c shared_buffers=262144 -c dynamic_shared_memory_type=posix \u003c \"/dev/null\" \u003e \"/dev/null\" 2\u003e\u00261\ninitdb: removing contents of data directory \"/data/mxdata_20221101121858/primary/mxseg0\"\n"
Data page checksums are enabled.
fixing permissions on existing directory /data/mxdata_20221101121858/primary/mxseg0 ... ok
creating subdirectories ... ok
}
selecting dynamic shared memory implementation ... posix
selecting default max_connections ...
STDERR:
initdb: error: initdb: error 256 from: "/usr/local/matrixdb-4.7.2.enterprise/bin/postgres" --boot -x0 -F -c max_connections=1500 -c shared_buffers=262144 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
initdb: removing contents of data directory "/data/mxdata_20221101121858/primary/mxseg0"
Problem Analysis
Solution
/etc/hosts
.free -g
command to check the memory size.OK.
The graphic client MXUI uses http://<IP>:8240
to provide services to the outside world. If you need to enable it to be directly accessed through the domain name http://hostname
, you can configure the reverse proxy with the help of Nginx.
For example: If you need to set mxui.ymatrix.cn
as an external access address, the Nginx configuration is as follows.
server
{
listen 80;
server_name mxui.ymatrix.cn; # 对外域名
# WebSocket forwarding rules
location /ws {
proxy_pass http://127.0.0.1:8240/ws;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "Upgrade";
proxy_set_header Host $host;
}
# Web forwarding rules
location / {
proxy_pass http://127.0.0.1:8240;
}
}
Notes!
MXUI needs to use the WebSocket API to communicate and need to configure forwarding rules separately for it to be used normally.
The output of the system log /var/log/message
is as follows:
Mar 10 06:26:15 sdw37 kernel: nf_conntrack: table full, dropping packet
Mar 10 06:26:15 sdw37 kernel: nf_conntrack: table full, dropping packet
Mar 10 06:26:15 sdw37 kernel: nf_conntrack: table full, dropping packet
Mar 10 06:26:15 sdw37 kernel: nf_conntrack: table full, dropping packet
Mar 10 06:26:15 sdw37 kernel: nf_conntrack: table full, dropping packet
Mar 10 06:26:15 sdw37 kernel: nf_conntrack: table full, dropping packet
Mar 10 06:26:15 sdw37 kernel: nf_conntrack: table full, dropping packet
Mar 10 06:26:15 sdw37 kernel: nf_conntrack: table full, dropping packet
Mar 10 06:26:15 sdw37 kernel: nf_conntrack: table full, dropping packet
Mar 10 06:26:20 sdw37 kernel: nf_conntrack: table full, dropping packet
Problem Analysis
System parameters net.netfilter.nf_conntrack_max default maximum tracking 65536 connections. This error occurs when there are a large number of connections. You can use the following command to view the current system setting maximum number of connections:
cat /proc/sys/net/netfilter/nf_conntrack_max
Solution
To increase the parameter net.netfilter.nf_conntrack_max.
sysctl -w net.netfilter.nf_conntrack_max=655360
Problem Description
After the data is deployed, the graphic interface alarm information keeps reporting errors, but it is normal to see the node status from the database gp_segment_configuration
.
Problem Analysis
First, you need to use the gpstate -a
command to see the status of the cluster node.
If the output information is as follows:
Database status = unkown -- unable to load segment status
Then you need to go to the /home/mxadmin/gpAdminLogs
directory to view the log file similar to gpgetstatusingtransition.py_mdw:mxadmin_20230702.log
.
In most cases, it can be clearly seen from the error message in the log that the netstat command failed to execute.
Solution
Just install the net-tools package for additional installation.
LOG: gp_role forced to 'utility' in single-user mode Y.sh: line 1: 11865 Illegal instruction
errorError message
"LogCheckpointEnd","xlog.c",8916, LOG: gp_role forced to 'utility' in single-user mode Y.sh: line 1:
11865 Illegal instruction (core dumped) "/opt/ymatrix/matrixdb-5.0.0+community/bin/postgres" --single -F -O -j -c
gp_role=utility -c search_path=pg_catalog -c exit_on_error=true template1 > /dev/null child process exited with exit code 132
initdb: data directory "/mxdata_20231018165815/master/mxseg-1" not removed at user's request * rpc error:
Problem Analysis
The new version of the database supports the use of SIMD instructions in vector sets. During installation, the CPU instruction set will be detected. If the CPU instruction set is not supported, the above error message will appear.
Solution
cat /proc/cpuinfo|grep -E "mmx|sse|sse2|ssse3|sse4_1|sse4_2|avx|avx2"
# All ports need to be opened between the IPs within the cluster, and only the external ports can be exposed.
# Database firewall configuration, assuming in the example 10.129.38.230, 10.129.38.231, 10.129.38.232 Three servers built a database cluster
# All ports are open between all hosts in the cluster, including TCP and UDP protocols
firewall-cmd --permanent --add-rich-rule="rule family="ipv4" source address="10.129.38.230" port protocol="tcp" port="0-65535" accept"
firewall-cmd --permanent --add-rich-rule="rule family="ipv4" source address="10.129.38.231" port protocol="tcp" port="0-65535" accept"
firewall-cmd --permanent --add-rich-rule="rule family="ipv4" source address="10.129.38.232" port protocol="tcp" port="0-65535" accept"
firewall-cmd --permanent --add-rich-rule="rule family="ipv4" source address="10.129.38.230" port protocol="udp" port="0-65535" accept"
firewall-cmd --permanent --add-rich-rule="rule family="ipv4" source address="10.129.38.231" port protocol="udp" port="0-65535" accept"
firewall-cmd --permanent --add-rich-rule="rule family="ipv4" source address="10.129.38.232" port protocol="udp" port="0-65535" accept"
# Turn on ping
firewall-cmd --permanent --add-rich-rule="rule family="ipv4" source address="10.129.38.230" port protocol="icmp" port="0-65535" accept"
firewall-cmd --permanent --add-rich-rule="rule family="ipv4" source address="10.129.38.231" port protocol="icmp" port="0-65535" accept"
firewall-cmd --permanent --add-rich-rule="rule family="ipv4" source address="10.129.38.232" port protocol="icmp" port="0-65535" accept"
# Master and standby master nodes are restricted to the public 5432 service port 8240
firewall-cmd --zone=public --add-port=5432/tcp --permanent
firewall-cmd --zone=public --add-port=8240/tcp --permanent
# Grafana
firewall-cmd --zone=public --add-port=3000/tcp --permanent
# View firewall rules
firewall-cmd --list-all
# Reload the firewall to make the configuration take effect
firewall-cmd --reload
systemctl restart firewalld.service