FAQs on Cluster Deployment

This document describes common problems in cluster deployment.


1 error: could not access directory \"/data/mxdata_20221104084534/master/mxseg-1\": Permission denied


MXUI initialization log:

"error": "execute: do execute: run: initialize_database: 7 errors occurred: * 
error execute \"/usr/local/matrixdb-4.5.0.community/bin/initdb\"\n\n STDOUT:
The files belonging to this database system will be owned by user \"mxadmin\".
This user must also own the server process.

  The database cluster will be initialized with locale \"en_US.utf8\".\n The default text search configuration will be set to \"english\".

  Data page checksums are enabled.

   STDERR:
      initdb: error: could not access directory \"/data/mxdata_20221104084534/master/mxseg-1\": Permission denied\n * error execute \"/usr/local/matrixdb-4.5.0.community/bin/initdb\"
   STDOUT:
      The files belonging to this database system will be owned by user \"mxadmin\".
      This user must also own the server process.\n\n The database cluster will be initialized with locale \"en_US.utf8\".
      The default text search configuration will be set to \"english\".
      Data page checksums are enabled.

Problem Analysis

Only the owner of the data directory has rwx permissions, and the group and other users do not have access rights.

[root@mdw ~]# ll /
total 36
lrwxrwxrwx.   1 root    root       7 Jun  1 19:38 bin -> usr/bin
dr-xr-xr-x.   5 root    root    4096 Oct 26 18:28 boot
drwxr-xr-x   20 root    root    3200 Oct 26 14:45 dev
drwxr-xr-x.  80 root    root    8192 Oct 28 13:53 etc
drwxr-xr-x.   5 root    root    8192 Oct 26 18:17 export
drwxr-xr-x.   5 root    root     105 Oct 26 18:28 home
drwx------.   5 root    root     105 Oct 26 18:28 data

Solution

Just modify the data directory permissions.

sudo chmod 755 /data 


2 yum error after installing the matrixdb package cpio read error


Problem Analysis

The user environment is Windows, and you use a vm15 virtual machine. After Windows downloads the installation package, the file will be dragged to the virtual machine, causing the file to be truncated.

Solution

Use the vm shared directory mechanism to transfer data.


3 could not connect to server: No route to host


An error occurred during initialization:

could not connect to server: No route to host
 Is the server running on host "192.168.88.203" and accepting
 TCP/IP connections on port 40000?
 (seg0 192.168.88.203:40000)

Problem Analysis

203 The machine turned off iptables, but there was no disable. After restarting the machine, the firewall started again. The port is not released by default, which causes the machine to be unable to communicate during initialization. The phenomenon is that the initialization is stuck and cannot be completed.

Solution

Clear the firewall rules on the 203 machine, stop the iptables service and disable to prevent the network from running out after restarting.


4 setuptools report does not support parameters: unknown distribution option:"long_description_content_type"


Problem Analysis

The setuptools version is relatively old.

Solution

sudo python3 -m pip install --upgrade setuptools


5 ssh The default port is not 22


Solution

Add the host name, port number and user configuration to the .ssh/config file:

Host mdw
   Hostname mdw
   Port 29022
   User mxadmin
Host sdw1
   Hostname sdw1
   Port 29022
   User mxadmin


6 Graphical interface initialization error: ping <hostname 1> error:lookup multiple ip:<IP address 1>, <IP address 1> ping <hostname 2> error:lookup multiple ip:<IP address 2>, <IP address 2>`


Problem Analysis

The same entry exists in /etc/hosts, such as:

<IP 地址1> <主机名1>
<IP 地址1> <主机名1>
<IP 地址2> <主机名2>
<IP 地址2> <主机名2>

Solution

Delete extra entries in /etc/hosts:

<IP 地址1> <主机名1>
<IP 地址2> <主机名2>

It can be initialized normally after modification.


7 Deploy YMatrix using a graphic interface failed to connect to host=mdw user=mxadmin database=postgres: dial error (dial tcp 192.168.247.132:5432: connect: connection refused)


The following error occurred in the graphic interface: failed to connect to host=mdw user=mxadmin database=postgres: dial error (dial tcp 192.168.247.132:5432: connect: connection refused)

Problem Analysis

You probably have installed YMatrix once using your browser. For some reason, the previous YMatrix environment has been cleaned up. If this graphic interface is loaded again, the /datastream path will be added to the URL address by default. For example: http://192.168.247.132:8240/datastream

Solution

Change the datastream keyword to installer. For example: http://192.168.247.132:8240/installer Use the graphic interface again to perform the next installation.


8 Failed to add host collect: do collect: unmarshal remote: json: cannot unmarshal string into Go struct field Disk.hardware.disk.ineligibleDesc of type mxi18n.Message


When installing and deploying a YMatrix cluster using a graphic interface, an error was reported for adding nodes:

Host addition failed collect: do collect: unmarshal remote: json: cannot unmarshal string into Go struct field Disk.hardware.disk.ineligibleDesc of type mxi18n.Message

Problem Analysis

The YMatrix version installed by each server node is caused by inconsistent.

Checking Method

Check the YMatrix version of each server node in turn.

Check the master node mdw YMatrix version.

[root@mdw matrixdb]# ll /usr/local/matrixdb
lrwxrwxrwx 1 root root 25 12月  9 18:02 /usr/local/matrixdb -> matrixdb-4.7.5.enterprise

Check the data node sdw1 YMatrix version.

[root@sdw1 ~]$ ll /usr/local/matrixdb
lrwxrwxrwx 1 root root 25 12月 22 17:24 /usr/local/matrixdb -> matrixdb-4.6.2.enterprise

Check the data node sdw2 YMatrix version.

[root@sdw2 ~]# ll /usr/local/matrixdb
lrwxrwxrwx 1 root root 25 12月  9 18:02 /usr/local/matrixdb -> matrixdb-4.7.5.enterprise

Check the data node sdw3 YMatrix version.

[root@sdw3 ~]# ll /usr/local/matrixdb
lrwxrwxrwx 1 root root 25 12月  9 18:02 /usr/local/matrixdb -> matrixdb-4.7.5.enterprise

Check results

The database version of the sdw1 node is 4.6.2, and the database version of the other nodes is 4.7.5.

Solution

Upgrade the database version of the sdw1 node with the same version as other nodes, and the command is as follows:

Stop Supervisor service.

[root@sdw1 ~]$ systemctl stop matrixdb.supervisor.service

Uninstall the old version of YMatrix software.

[root@sdw1 ~]$ yum -y remove matrixdb

Install the new version of YMatrix software.

[root@sdw1 ~]$ yum -y install /home/mxadmin/matrixdb-4.7.5.enterprise-1.el7.x86_64.rpm

Start the Supervisor service.

[root@sdw1 ~]$ systemctl start matrixdb.supervisor.service


9 Cluster startup error


Error message

20221223:09:55:10:001626 mxstart:mdw:mxadmin-[CRITICAL]:-Error occurred: non-zero rc: 1
 Command was: 'env GPSESSID=0000000000 GPERA=None $GPHOME/bin/pg_ctl -D /mxdata_20221221165810/master/mxseg-1 -l /mxdata_20221221165810/master/mxseg-1/log/startup.log -w -t 600 -o " -p 5432 -c gp_role=utility " start'
rc=1, stdout='waiting for server to start.... stopped waiting
', stderr='pg_ctl: could not start server
Examine the log output.
'

Problem Analysis

View the log file.

[mxadmin@mdw ~]$ cd /mxdata_20221221165810/master/mxseg-1/log
[mxadmin@mdw log]$ vi startup.log
"FATAL","42501","could not create lock file ""/tmp/.s.PGSQL.5432.lock"": Permission denied",,,,,,,,"CreateLockFile","miscinit.c",994,1    0xd44e33 postgres errstart (elog.c:498)

Check the permissions for the /tmp path. Since the permissions of the /tmp path must be 777, just modify it back.

[mxadmin@mdw ~]$ ll / | grep tmp
drw-r-xr-x.   7 root    root    8192 12月 23 10:00 tmp

Solution

Under root user, change the /tmp path permission to 777 permission.

[mxadmin@mdw ~]$ exit
[root@mdw ~]# chmod 777 /tmp

Restart the cluster.

[root@mdw ~]# su - mxadmin
[mxadmin@mdw ~]$ mxstart -a


10 Graphical deployment YMatrix error: Optimize operating system configuration... Revoked { "error": "execute: do execute: run: launch_matrixdb: error execute \"/usr/local/matrixdb-4.7.5.enterprise/bin/pg_ctl -w -l /mxdata_20221223100549/master/mxseg-1/log/startup.log -D /mxdata_20221223100549/master/mxseg-1 -o -i -p 5432 -c gp_role=utility -m start\"\n\nSTDOUT:\n waiting for server to start.... stopped waiting\n\nSTDERR:\n pg_ctl: could not start server\nExamine the log output.\n" }


Problem Analysis View graphic deployment log files

[mxadmin@mdw ~]$ cd /var/log/matrixdb/
[mxadmin@mdw matrixdb]$ vi mxui.log 

[20221223:10:08:43][INFO] id=1; start: system_setup
[20221223:10:08:43][INFO] id=1; done
[20221223:10:08:43][INFO] id=2; start: create_user_and_directories
[20221223:10:08:43][INFO] id=2; done
[20221223:10:08:43][INFO] id=3; start: initialize_database
[20221223:10:08:44][INFO] id=3; running: 6%
[20221223:10:08:44][INFO] id=3; running: 6%
[20221223:10:08:44][INFO] id=3; running: 6%
[20221223:10:08:44][INFO] id=3; running: 6%
[20221223:10:08:44][INFO] id=3; running: 6%
[20221223:10:08:45][INFO] id=3; running: 6%
[20221223:10:08:45][INFO] id=3; running: 6%
[20221223:10:08:45][INFO] id=3; done
[20221223:10:08:45][INFO] id=4; start: launch_matrixdb
[20221223:10:08:45][ERROR] id=4; failed: launch_matrixdb: error execute "/usr/local/matrixdb-4.7.5.enterprise/bin/pg_ctl -w -l /mxdata_20221223100549/master/mxseg-1/log/startup.log -D /mxdata_20221223100549/master/mxseg-1 -o -i -p 5432 -c gp_role=utility -m start"
STDOUT:
    waiting for server to start.... stopped waiting
STDERR:
    pg_ctl: could not start server
Examine the log output.
[20221223:10:08:45][INFO] id=4; revert start: launch_matrixdb
[20221223:10:08:45][INFO] id=4; revert done
[20221223:10:08:45][INFO] id=3; revert start: initialize_database
[20221223:10:08:45][INFO] id=3; revert done
[20221223:10:08:45][INFO] id=2; revert start: create_user_and_directories
[20221223:10:08:45][INFO] id=2; revert done
[20221223:10:08:45][INFO] id=1; revert start: system_setup
[20221223:10:08:45][INFO] id=1; revert done
{
  "error": "execute: do execute: run: launch_matrixdb: error execute \"/usr/local/matrixdb-4.7.5.enterprise/bin/pg_ctl -w -l /mxdata_20221223100549/master/mxseg-1/log/startup.log -D /mxdata_20221223100549/master/mxseg-1 -o -i -p 5432 -c gp_role=utility -m start\"\n\nSTDOUT:\n    waiting for server to start.... stopped waiting\n\nSTDERR:\n    pg_ctl: could not start server\nExamine the log output.\n"
}
execute: do execute: run: launch_matrixdb: error execute "/usr/local/matrixdb-4.7.5.enterprise/bin/pg_ctl -w -l /mxdata_20221223100549/master/mxseg-1/log/startup.log -D /mxdata_20221223100549/master/mxseg-1 -o -i -p 5432 -c gp_role=utility -m start"

STDOUT:
    waiting for server to start.... stopped waiting

STDERR:
    pg_ctl: could not start server
Examine the log output.

[GIN] 2022/12/23 - 10:08:45 | 200 |      148.13µs |   192.168.247.2 | GET      "/api/installer/log"

The steps to view the graphic interface deployment are currently running to launch_matrixdb. Find the relevant operation in the log and analyze according to the context: it is found that the pg_ctl starts the instance step is currently running. This indicates that the instance failed to start, resulting in the entire initialization operation failing and fallback.

There are several reasons that may cause the startup instance to fail. You need to check one by one according to the actual situation:

  1. The CPU load is too high.
  2. The memory usage is too large, and the remaining memory cannot support the startup of the instance (Instance).
  3. The network is unstable.
  4. The /tmp permission is insufficient, and the lock file cannot be created.

Solution

Since the causes of the first three problems are complicated, specific scenarios need to be analyzed in detail, and will not be described in detail here.

Under root user, change the /tmp path permission to 777 permission.

[root@mdw ~]# chmod 777 /tmp


11 An error occurred when installing YMatrix installation package, missing dependencies


Error message

[root@sdw4 yum.repos.d]# yum -y install /home/mxadmin/matrixdb5-5.0.0.enterprise-1.el7.x86_64.rpm 
已加载插件:fastestmirror
正在检查 /home/mxadmin/matrixdb5-5.0.0.enterprise-1.el7.x86_64.rpm: matrixdb5-5.0.0.enterprise-1.el7.x86_64
/home/mxadmin/matrixdb5-5.0.0.enterprise-1.el7.x86_64.rpm 将被安装
正在解决依赖关系
--> Checking transactions
---> package matrixdb.x86_64.0.5.0.enterprise-1.el7 will be installed
--> Dependency sysstat is being processed, it is required by package matrixdb5-5.0.0.enterprise-1.el7.x86_64
Loading mirror speeds from cached hostfile
--> Resolving dependency completion
错误:软件包:matrixdb5-5.0.0.enterprise-1.el7.x86_64 (/matrixdb5-5.0.0.enterprise-1.el7.x86_64)
          需要:sysstat
 您可以尝试添加 --skip-broken 选项来解决该问题
 您可以尝试执行:rpm -Va --nofiles --nodigest

Problem Analysis

The sysstat package is missing.

Solution

Configure the yum source, use yum -y install sysstat to install the sysstat tool, and then install the YMatrix package.


12 Supervisor startup exception due to abnormal server kernel parameter configuration "panic: timeout to start gRPC service"


Problem Analysis

After the YMatrix deployment is completed, the Supervisor process starts an exception. After checking /var/log/messages, you will find the following exception log

Dec 22 19:08:59 sdw21 systemd: matrixdb.supervisor.service holdoff time over, scheduling restart.
Dec 22 19:08:59 sdw21 systemd: Stopped MatrixDB Supervisord Daemon.
Dec 22 19:08:59 sdw21 systemd: Started MatrixDB Supervisord Daemon.
Dec 22 19:08:59 sdw21 bash: time="2022-12-22T19:08:59+08:00" level=info msg="load configuration from file" file=/etc/matrixdb/supervisor.conf
Dec 22 19:08:59 sdw21 bash: time="2022-12-22T19:08:59+08:00" level=info msg="load config file over, content "
Dec 22 19:09:09 sdw21 bash: panic: timeout to start gRPC service
Dec 22 19:09:09 sdw21 bash: goroutine 1 [running]:
Dec 22 19:09:09 sdw21 bash: main.runServer()
Dec 22 19:09:09 sdw21 bash: /home/runner/work/matrixdb-ci/matrixdb-ci/cmd/supervisor/main.go:151 +0x4f0
Dec 22 19:09:09 sdw21 bash: main.main()
Dec 22 19:09:09 sdw21 bash: /home/runner/work/matrixdb-ci/matrixdb-ci/cmd/supervisor/main.go:216 +0x185
Dec 22 19:09:09 sdw21 systemd: matrixdb.supervisor.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Dec 22 19:09:09 sdw21 systemd: Unit matrixdb.supervisor.service entered failed state.
Dec 22 19:09:09 sdw21 systemd: matrixdb.supervisor.service failed.
Dec 22 19:09:14 sdw21 systemd: matrixdb.supervisor.service holdoff time over, scheduling restart.
Dec 22 19:09:14 sdw21 systemd: Stopped MatrixDB Supervisord Daemon.
Dec 22 19:09:14 sdw21 systemd: Started MatrixDB Supervisord Daemon.

After investigation, it was found that the variable value in the /etc/sysctl.conf file was too large, which would cause the Supervisor to run poorly. Just change to the normal size value, as follows.

Solution

  1. Reduce the kernel parameter configuration of /etc/sysctl.conf to prevent Supervisor from blocking operation
    ################## value too large, supervisor startup fail
    #net.core.rmem_default = 1800262144
    #net.core.wmem_default = 1800262144
    #net.core.rmem_max = 2000777216
    #net.core.wmem_max = 2000777216
    ############
    net.core.rmem_default = 262144
    net.core.wmem_default = 262144
    net.core.rmem_max = 16777216
    net.core.wmem_max = 16777216
  2. Supervisor failed or failed to start. You can use journalctl --no-pager to grab detailed information, including the crashed stack, etc.


13 Installation of YMatrix failed to collect information


Error message

Collection of information failed collect: do collect: hardware: GetDisk: createTempUser: error execute useradd: exit status 1 please create "mxadmin" user manually to workaround this issue useradd: Unable to open /etc/passwd

Problem Analysis

According to the analysis of the error information, it is the /etc/passwd file permission problem that caused the creation of the operating system user mxadmin failed.

Solution

  1. Create a test user manually and view the error message

    [root@sdw4 ~]# useradd test1
    useradd:无法打开 /etc/passwd
  2. Check the /etc/passwd permissions

    [root@sdw4 ~]# ll /etc/passwd
    -rw-r--r-- 1 root root 898 12月 24 01:48 /etc/passwd

    The result shows that the 644 permission is normal permission, no problem

  3. Check whether /etc/passwd has special permissions

    [root@sdw4 ~]# lsattr /etc/passwd
    ----i--------- /etc/passwd

    The result shows that the /etc/passwd file has special permissions "i" (Permission Description: No user may change or delete, including root).

  4. Repeat the above steps to check the /etc/group file

    [root@sdw4 ~]# ll /etc/group
    -rw-r--r-- 1 root root 460 12月 24 01:48 /etc/group
    [root@sdw4 ~]# lsattr /etc/group
    ----i---------- /etc/group
  5. Remove the special permissions of the two files /etc/passwd and /etc/group

    [root@sdw4 ~]# chattr -i /etc/passwd
    [root@sdw4 ~]# chattr -i /etc/group
  6. Try to create a test user manually again

    [root@sdw4 ~]# useradd test1

    As a result, users can be created normally.

  7. Delete the test user

    [root@sdw4 ~]# userdel -r test1
  8. Continue installing and deploying YMatrix using the graphic interface


14 Initialization failed, reported failed: initialize_database: error execute "/opt/ymatrix/matrixdb-5.0.0+enterprise/bin/initdb"


When using MXUI for initialization, perform the last step and the error is reported as follows:

[20221101:14:59:02][ERROR] id=3; failed: initialize_database: error execute "/opt/ymatrix/matrixdb-5.0.0+enterprise/bin/initdb"
STDOUT:
    The files belonging to this database system will be owned by user "mxadmin".
This user must also own the server process.
The database cluster will be initialized with locale "en_US.utf8".
The default text search configuration will be set to "english".
{
  "error": "execute: do execute: run: initialize_database: error execute \"/opt/ymatrix/matrixdb-5.0.0+enterprise/bin/initdb\"\n\nSTDOUT:\n    The files belonging to this database system will be owned by user \"mxadmin\".\nThis user must also own the server process.\n\nThe database cluster will be initialized with locale \"en_US.utf8\".\nThe default text search configuration will be set to \"english\".\n\nData page checksums are enabled.\n\nfixing permissions on existing directory /data/mxdata_20221101121858/primary/mxseg0 ... ok\ncreating subdirectories ... ok\nselecting dynamic shared memory implementation ... posix\nselecting default max_connections ... \nSTDERR:\n    initdb: error: initdb: error 256 from: \"/opt/ymatrix/matrixdb-5.0.0+enterprise/bin/postgres\" --boot -x0 -F -c max_connections=1500 -c shared_buffers=262144 -c dynamic_shared_memory_type=posix \u003c \"/dev/null\" \u003e \"/dev/null\" 2\u003e\u00261\ninitdb: removing contents of data directory \"/data/mxdata_20221101121858/primary/mxseg0\"\n"
Data page checksums are enabled.
fixing permissions on existing directory /data/mxdata_20221101121858/primary/mxseg0 ... ok
creating subdirectories ... ok
}
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 
STDERR:
    initdb: error: initdb: error 256 from: "/opt/ymatrix/matrixdb-5.0.0+enterprise/bin/postgres" --boot -x0 -F -c max_connections=1500 -c shared_buffers=262144 -c dynamic_shared_memory_type=posix < "/dev/null" > "/dev/null" 2>&1
initdb: removing contents of data directory "/data/mxdata_20221101121858/primary/mxseg0"

Problem Analysis

  1. The IP bound to the hostname is inconsistent.
  2. The memory margin is too low.

Solution

  1. Associate the local IP with the hostname in /etc/hosts.
  2. Use the free -g command to check the memory size.


15 Can Nginx be used to configure domain names for graphic interfaces?


OK.

The graphic client MXUI uses http://<IP>:8240 to provide services to the outside world. If you need to enable it to be directly accessed through the domain name http://hostname, you can configure the reverse proxy with the help of Nginx. For example: If you need to set mxui.ymatrix.cn as an external access address, the Nginx configuration is as follows.

 server
        {
                listen  80;
                server_name mxui.ymatrix.cn; # 对外域名


                # WebSocket forwarding rules
                location /ws {
                    proxy_pass http://127.0.0.1:8240/ws;
                    proxy_http_version 1.1;
                    proxy_set_header Upgrade $http_upgrade;
                    proxy_set_header Connection "Upgrade";
                    proxy_set_header Host $host;
                }  

                # Web forwarding rules
                location / {
                    proxy_pass http://127.0.0.1:8240; 
                }

        }

Notes!
MXUI needs to use the WebSocket API to communicate and need to configure forwarding rules separately for it to be used normally.


16 Interconnect error writing an outgoing packet: operation not allowed


The output of the system log /var/log/message is as follows:

Mar 10 06:26:15 sdw37 kernel: nf_conntrack: table full, dropping packet
Mar 10 06:26:15 sdw37 kernel: nf_conntrack: table full, dropping packet
Mar 10 06:26:15 sdw37 kernel: nf_conntrack: table full, dropping packet
Mar 10 06:26:15 sdw37 kernel: nf_conntrack: table full, dropping packet
Mar 10 06:26:15 sdw37 kernel: nf_conntrack: table full, dropping packet
Mar 10 06:26:15 sdw37 kernel: nf_conntrack: table full, dropping packet
Mar 10 06:26:15 sdw37 kernel: nf_conntrack: table full, dropping packet
Mar 10 06:26:15 sdw37 kernel: nf_conntrack: table full, dropping packet
Mar 10 06:26:15 sdw37 kernel: nf_conntrack: table full, dropping packet
Mar 10 06:26:20 sdw37 kernel: nf_conntrack: table full, dropping packet

Problem Analysis

System parameters net.netfilter.nf_conntrack_max default maximum tracking 65536 connections. This error occurs when there are a large number of connections. You can use the following command to view the current system setting maximum number of connections:

cat /proc/sys/net/netfilter/nf_conntrack_max

Solution

To increase the parameter net.netfilter.nf_conntrack_max.

sysctl -w net.netfilter.nf_conntrack_max=655360



17 Due to environment restrictions, I cannot use graphic deployment of YMatrix. Can I use the command line deployment method?


OK. Please refer to the following steps to deploy.

Notes!
Please replace the graphic interface database deployment section in the standard cluster deployment document with the following command. This part of the commands must be executed on the Master using the root user.

Switch to root user.

$ sudo su
  1. Collect information First, generate the initial file (which already contains the native (Master) information).
    # echo "" | /opt/ymatrix/matrixdb5/bin/mxctl setup collect > /tmp/collect.1

    Then, set the number of instances on each host.

    # export MXDEBUG_PRIMARY_PER_HOST=1

    Collect information from each host. It needs to be collected one by one, and the Master has collected it and does not need to be executed again.

    # cat /tmp/collect.1 | /opt/ymatrix/matrixdb5/bin/mxctl setup collect --host sxd1 > /tmp/collect.2
    # cat /tmp/collect.2 | /opt/ymatrix/matrixdb5/bin/mxctl setup collect --host sxd2 > /tmp/collect.3
  2. Check network communication
    # cat /tmp/collect.3 | /opt/ymatrix/matrixdb5/bin/mxctl setup netcheck > /tmp/collect.3c
  3. Set up the primary node standby node (Standby) If we want to set the sxd2 node to the Standby node, we can have two options, and choose one of them to execute according to our needs:

a. Set sxd2 only to Standby (sxd2 only serves the Standby role and does not store business data).

# tac /tmp/collect.3c | sed '0,/"isSegment":\ true/{s/"isSegment":\ true/"isSegment":\ false/}' | tac > /tmp/collect.3d
# tac /tmp/collect.3d | sed '0,/"isStandby":\ false/{s/"isStandby":\ false/"isStandby":\ true/}' | tac > /tmp/collect.3e

b. Add Standby node instance on sxd2 (sxd2 plays two roles: Segment + Standby).

# tac /tmp/collect.3c | sed '0,/"isStandby":\ false/{s/"isStandby":\ false/"isStandby":\ true/}' | tac > /tmp/collect.3e

In the example we choose to set sxd2 to Segment + Standby.

  1. Adjust disk planning (optional)

    # cat /tmp/collect.3e |grep -i disklist
         "diskList": []
         "diskList": []
         "diskList": []

    You can modify diskList by editing the collect.3e file generated in 1.3 (for example: "diskList": ["/data1", "/data2", "/data3"]). In this file, diskList defaults to an empty array. If you choose to keep the default file, the data will be automatically stored on the disk with the largest space.

  2. Enable Mirror Set up the mirror node in the collect.3e file.

    # cat /tmp/collect.3e

    Modify the relevant content of Mirror settings.

    {
     "versionID": "iAioSDT2QJnXAXRzBfYsx6",
     "genDateTime": "2023-04-12 07:26:49",
     "strategy": {
       "hostRole": {
         "isMaster": true,
         "isSegment": false,
         "isStandby": false
       },
       "highAvailability": {
         "useMirror": "auto",  // auto 是默认值,可用 yes/no 强制启用。如果 Segment 主机有 2 个或更多,YMatrix 会自动启用 Mirror 机制,如果只有 1 个则不启用
         "mirrorStrategy": "ring",
         "haProvisioner": ""
       }
  3. Generate the final deployment plan

    # cat /tmp/collect.3e | /opt/ymatrix/matrixdb5/bin/mxbox deployer plan | tee /tmp/p4
  4. Cluster deployment

    # cat /tmp/p4 | /opt/ymatrix/matrixdb5/bin/mxbox deployer setup --debug


    18 The /etc/hosts` file lacks localhost configuration option causing Supervisor service to start exception


Error message

collect: do collect: remote host 172.16.100.144: rpc error: code = Unavailable desc = connection error:desc = "ransport: Error while dialing dial tcp 172.16.100.144:4617: connect: connection refused"

Problem phenomenon

The installation and deployment cluster always reports an error and cannot connect. Check that the network is not absolute. Check out supervisor periodicity is good and bad.

[root@ljb-sdw2 software]# systemctl status matrixdb5.supervisor.service
● matrixdb5.supervisor.service - MatrixDB 5 Supervisord Daemon
   Loaded: loaded (/usr/lib/systemd/system/matrixdb5.supervisor.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2023-06-05 17:46:20 CST; 2s ago
 Main PID: 14587 (supervisord)
    Tasks: 9
   CGroup: /system.slice/matrixdb5.supervisor.service
           └─14587 /opt/ymatrix/matrixdb5/bin/supervisord -c /etc/matrixdb5/supervisor.conf

Jun 05 17:46:20 ljb-sdw2 systemd[1]: Started MatrixDB 5 Supervisord Daemon.
Jun 05 17:46:20 ljb-sdw2 bash[14587]: time="2023-06-05T17:46:20+08:00" level=info msg="load configuration from file" file=/etc/matrixdb5/supervisor.conf
Jun 05 17:46:20 ljb-sdw2 bash[14587]: time="2023-06-05T17:46:20+08:00" level=info msg="load config file over, content "

[root@ljb-sdw2 software]# systemctl status matrixdb5.supervisor.service
● matrixdb5.supervisor.service - MatrixDB 5 Supervisord Daemon
   Loaded: loaded (/usr/lib/systemd/system/matrixdb5.supervisor.service; enabled; vendor preset: enabled)
   Active: activating (auto-restart) (Result: exit-code) since Mon 2023-06-05 17:46:23 CST; 4s ago
  Process: 14587 ExecStart=/bin/bash -c PATH="$MXHOME/bin:$PATH" exec "$MXHOME"/bin/supervisord -c "$MX_SUPERVISOR_CONF" (code=exited, status=2)
 Main PID: 14587 (code=exited, status=2)

Jun 05 17:46:23 ljb-sdw2 systemd[1]: Unit matrixdb5.supervisor.service entered failed state.
Jun 05 17:46:23 ljb-sdw2 systemd[1]: matrixdb5.supervisor.service failed.

Check the cluster log file and found that it was not generated.

[root@ljb-sdw2 software]# cd /var/log/matrixdb5/
[root@ljb-sdw2 matrixdb5]# ls -l
total 0

View the supervisor real-time dynamic log.

[root@ljb-sdw2 software]# journalctl -u matrixdb5.supervisor.service  -f     
-- Logs began at Sun 2023-05-07 21:33:02 CST. --
Jun 05 17:52:31 ljb-sdw2 bash[15171]: time="2023-06-05T17:52:31+08:00" level=info msg="load configuration from file" file=/etc/matrixdb5/supervisor.conf
Jun 05 17:52:31 ljb-sdw2 bash[15171]: time="2023-06-05T17:52:31+08:00" level=info msg="load config file over, content "
Jun 05 17:52:34 ljb-sdw2 bash[15171]: panic: timeout to start gRPC service
Jun 05 17:52:34 ljb-sdw2 bash[15171]: goroutine 1 [running]:
Jun 05 17:52:34 ljb-sdw2 bash[15171]: main.runServer()
Jun 05 17:52:34 ljb-sdw2 bash[15171]: /home/runner/work/matrixdb-ci/matrixdb-ci/cmd/supervisor/main.go:154 +0x4f0
Jun 05 17:52:34 ljb-sdw2 bash[15171]: main.main()
Jun 05 17:52:34 ljb-sdw2 bash[15171]: /home/runner/work/matrixdb-ci/matrixdb-ci/cmd/supervisor/main.go:219 +0x185
Jun 05 17:52:34 ljb-sdw2 systemd[1]: matrixdb5.supervisor.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Jun 05 17:52:34 ljb-sdw2 systemd[1]: Unit matrixdb5.supervisor.service entered failed state.
Jun 05 17:52:34 ljb-sdw2 systemd[1]: matrixdb5.supervisor.service failed.
Jun 05 17:52:39 ljb-sdw2 systemd[1]: matrixdb5.supervisor.service holdoff time over, scheduling restart.
Jun 05 17:52:39 ljb-sdw2 systemd[1]: Stopped MatrixDB 5 Supervisord Daemon.
Jun 05 17:52:39 ljb-sdw2 systemd[1]: Started MatrixDB 5 Supervisord Daemon.
Jun 05 17:52:40 ljb-sdw2 bash[15186]: time="2023-06-05T17:52:40+08:00" level=info msg="load configuration from file" file=/etc/matrixdb5/supervisor.conf
Jun 05 17:52:40 ljb-sdw2 bash[15186]: time="2023-06-05T17:52:40+08:00" level=info msg="load config file over, content "
Jun 05 17:52:43 ljb-sdw2 bash[15186]: panic: timeout to start gRPC service
Jun 05 17:52:43 ljb-sdw2 bash[15186]: goroutine 1 [running]:
Jun 05 17:52:43 ljb-sdw2 bash[15186]: main.runServer()
Jun 05 17:52:43 ljb-sdw2 bash[15186]: /home/runner/work/matrixdb-ci/matrixdb-ci/cmd/supervisor/main.go:154 +0x4f0
Jun 05 17:52:43 ljb-sdw2 bash[15186]: main.main()
Jun 05 17:52:43 ljb-sdw2 bash[15186]: /home/runner/work/matrixdb-ci/matrixdb-ci/cmd/supervisor/main.go:219 +0x185
Jun 05 17:52:43 ljb-sdw2 systemd[1]: matrixdb5.supervisor.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Jun 05 17:52:43 ljb-sdw2 systemd[1]: Unit matrixdb5.supervisor.service entered failed state.
Jun 05 17:52:43 ljb-sdw2 systemd[1]: matrixdb5.supervisor.service failed.

Solution

  1. Check the /etc/hosts file and add the localhost configuration
    # vi /etc/hosts
    127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
    ::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
  2. Restart supervisor
    # sudo systemctl restart matrixdb5.supervisor.service


    19 How many servers does YMatrix require at least to achieve high availability of etcd clusters in the new architecture?


Solution

  1. If it is a deployment method of master node + 2 data nodes, etcd cluster will be deployed on each node to form high availability of etcd clusters, that is, at least 3 servers are required.
  2. If it is a deployment method of primary node + primary node standby node + 1 data node, etcd cluster will also be deployed on each node to form high availability of etcd clusters, that is, at least 3 servers are required.
  3. If it is a deployment method of primary node + primary node standby node + 2 data nodes, etcd clusters will be deployed on the primary node, primary node standby node and 1 data node to form high availability, that is, at least 4 servers are required.


20 LOG: gp_role forced to 'utility' in single-user mode Y.sh: line 1: 11865 Illegal instruction error


Error message

"LogCheckpointEnd","xlog.c",8916, LOG: gp_role forced to 'utility' in single-user mode Y.sh: line 1:
11865 Illegal instruction (core dumped) "/opt/ymatrix/matrixdb-5.0.0+community/bin/postgres" --single -F -O -j -c
gp_role=utility -c search_path=pg_catalog -c exit_on_error=true template1 > /dev/null child process exited with exit code 132
initdb: data directory "/mxdata_20231018165815/master/mxseg-1" not removed at user's request * rpc error:

Problem Analysis

The new version of the database supports the use of SIMD instructions in vector sets. During installation, the CPU instruction set will be detected. If the CPU instruction set is not supported, the above error message will appear.

Solution

  1. Use the following command to see if the following command is supported
    cat /proc/cpuinfo|grep -E "mmx|sse|sse2|ssse3|sse4_1|sse4_2|avx|avx2"
  2. If it is a virtual machine, select pass-through mode in the CPU operating mode, and then restart the virtual machine to try.


21 When the firewall is turned on, what firewall policies should be configured to make the database cluster run normally?


# All ports need to be opened between the IPs within the cluster, and only the external ports can be exposed.
# Database firewall configuration, assuming in the example 10.129.38.230, 10.129.38.231, 10.129.38.232 Three servers built a database cluster
# All ports are open between all hosts in the cluster, including TCP and UDP protocols
firewall-cmd --permanent --add-rich-rule="rule family="ipv4" source address="10.129.38.230" port protocol="tcp" port="0-65535" accept"
firewall-cmd --permanent --add-rich-rule="rule family="ipv4" source address="10.129.38.231" port protocol="tcp" port="0-65535" accept"
firewall-cmd --permanent --add-rich-rule="rule family="ipv4" source address="10.129.38.232" port protocol="tcp" port="0-65535" accept"

firewall-cmd --permanent --add-rich-rule="rule family="ipv4" source address="10.129.38.230" port protocol="udp" port="0-65535" accept"
firewall-cmd --permanent --add-rich-rule="rule family="ipv4" source address="10.129.38.231" port protocol="udp" port="0-65535" accept"
firewall-cmd --permanent --add-rich-rule="rule family="ipv4" source address="10.129.38.232" port protocol="udp" port="0-65535" accept"

# Turn on ping
firewall-cmd --permanent --add-rich-rule="rule family="ipv4" source address="10.129.38.230" port protocol="icmp" port="0-65535" accept"
firewall-cmd --permanent --add-rich-rule="rule family="ipv4" source address="10.129.38.231" port protocol="icmp" port="0-65535" accept"
firewall-cmd --permanent --add-rich-rule="rule family="ipv4" source address="10.129.38.232" port protocol="icmp" port="0-65535" accept"

# Master and standby master nodes are restricted to the public 5432 service port 8240
firewall-cmd --zone=public --add-port=5432/tcp --permanent
firewall-cmd --zone=public --add-port=8240/tcp --permanent

#  Grafana
firewall-cmd --zone=public --add-port=3000/tcp --permanent

# View firewall rules
firewall-cmd --list-all

# Reload the firewall to make the configuration take effect
firewall-cmd --reload
systemctl restart firewalld.service