This document describes common problems with the new YMatrix 5.X architecture.
In the 4.X era, it is supported to use sudo systemctl restart matrixdb.supervisor.service
to restart components such as graphical interface (MXUI).
However, under the 5.X architecture, Supervisor manages more subprocesses, including etcd, highly available services, and Postmaster processes.
Therefore, if you shut down or restart Supervisor at will, it will cause key processes such as etcd and Postmaster to restart. At the least, the cluster crashes, and at the worst, data will be lost.
Therefore, in addition to deleting the database cluster, uninstalling the database software, etc., do not use the command to restart Supervisor.
The new architecture is designed to take into account high availability under the most stringent network partitions.
The Cluster/Shard service is responsible for managing the state of the database cluster. If network partitions cause them to be unable to connect with other parts of the cluster that should work properly, they will not be able to make correct decisions on the state of the database cluster.
In order to keep the database cluster available as much as possible in any failure situation, the services that require cluster state decisions are also highly available in themselves. Being able to make objective judgments like humans under various network exceptions is the core significance of introducing etcd clusters.
Typically, a set of Cluster/Shard services will be run on each machine. But in the entire cluster, each service has at most one active instance. Other inactive instances will automatically select one to become the new active instance after the original active instance expires.
You can use the [Graphical Interface for expansion operation] (/doc/latest/maintain/mxui_expand).
Reasons for not using matrixdb.supervisor.service
Under the YMatrix 5 architecture, supervisor
manages more subprocesses, including etcd clusters, high availability services, postmaster
processes, etc. Shutting down or restarting the supervisor
at will will cause key processes such as etcd
and postmaster
to restart, which will cause database problems.
Therefore, when you need to restart a certain component, we recommend using the supervisorctl
tool.
Single Component Management
supervisor
[mxadmin@mdw3 ~]$ systemctl status matrixdb5.supervisor.service
● matrixdb5.supervisor.service - MatrixDB 5 Supervisord Daemon
Loaded: loaded (/usr/lib/systemd/system/matrixdb5.supervisor.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2023-05-24 21:04:35 PDT; 1h 28min ago
Process: 4426 ExecStop=/bin/bash -c exec "$MXHOME"/bin/supervisorctl shutdown (code=exited, status=1/FAILURE)
Main PID: 4439 (supervisord)
Memory: 605.4M
CGroup: /system.slice/matrixdb5.supervisor.service
├─ 954 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
├─ 955 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
├─ 3357 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
├─ 4439 /opt/ymatrix/matrixdb5/bin/supervisord -c /etc/matrixdb5/supervisor.conf
├─ 4451 mxctl telegraf exec --gpname mdw3 --mxui-collector --socket-addr mdw3:51574 --cluster-id AuWFhsrjyywC4xfMahgyor --master-role --dbhost mdw3 --dbport ...
├─ 4461 mxctl telegraf exec --gpname mdw3 --mxui-collector --socket-addr mdw3:56639 --cluster-id GFpQhTxkwGqb7qM6iYVA8y --master-role --dbhost mdw3 --dbport ...
├─ 4470 /opt/ymatrix/matrixdb5/bin/cylinder -nofile -port 4637 -db-cluster-id AuWFhsrjyywC4xfMahgyor
├─ 4479 /opt/ymatrix/matrixdb5/bin/telegraf --config /tmp/mxui_collector_AuWFhsrjyywC4xfMahgyor.conf
├─ 4515 /opt/ymatrix/matrixdb5/bin/telegraf --config /tmp/mxui_collector_GFpQhTxkwGqb7qM6iYVA8y.conf
├─ 4528 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
├─ 4539 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
├─ 4997 /opt/ymatrix/matrixdb5/bin/mxui
├─12093 /usr/lib64/sa/sadc -S DISK 4 2 /tmp/sysstat-3640257011
└─12094 /usr/lib64/sa/sadc -S DISK 4 2 /tmp/sysstat-3256168522
[mxadmin@mdw3 ~]$ supervisorctl status
Status:
1. pc_id:{group:"mxui_collector_AuWFhsrjyywC4xfMahgyor" name:"mxui_collector_AuWFhsrjyywC4xfMahgyor"} describe:"pid 4451, uptime 1:29:17" now:1684992833 state:"Running" log_file:"/var/log/matrixdb5/mxui_collector_AuWFhsrjyywC4xfMahgyor.log" stdout_log_file:"/var/log/matrixdb5/mxui_collector_AuWFhsrjyywC4xfMahgyor.log" pid:4451
2. pc_id:{group:"cylinder_AuWFhsrjyywC4xfMahgyor" name:"cylinder_AuWFhsrjyywC4xfMahgyor"} describe:"pid 4470, uptime 1:29:17" now:1684992833 state:"Running" log_file:"/var/log/matrixdb5/cylinder_AuWFhsrjyywC4xfMahgyor.log" stdout_log_file:"/var/log/matrixdb5/cylinder_AuWFhsrjyywC4xfMahgyor.log" pid:4470
3. pc_id:{group:"shard_AuWFhsrjyywC4xfMahgyor" name:"shard_AuWFhsrjyywC4xfMahgyor"} describe:"pid 4477, uptime 1:29:17" now:1684992833 state:"Running" log_file:"/var/log/matrixdb5/shard_AuWFhsrjyywC4xfMahgyor.log" stdout_log_file:"/var/log/matrixdb5/shard_AuWFhsrjyywC4xfMahgyor.log" pid:4477
4. pc_id:{group:"mxui" name:"mxui"} describe:"pid 4997, uptime 1:24:43" now:1684992833 state:"Running" log_file:"/var/log/matrixdb5/mxui.log" stdout_log_file:"/var/log/matrixdb5/mxui.log" pid:4997
5. pc_id:{group:"replication-1_AuWFhsrjyywC4xfMahgyor" name:"replication-1_AuWFhsrjyywC4xfMahgyor"} describe:"pid 4484, uptime 1:29:17" now:1684992833 state:"Running" log_file:"/var/log/matrixdb5/replication-1_AuWFhsrjyywC4xfMahgyor.log" stdout_log_file:"/var/log/matrixdb5/replication-1_AuWFhsrjyywC4xfMahgyor.log" pid:4484
6. pc_id:{group:"replication-3_AuWFhsrjyywC4xfMahgyor" name:"replication-3_AuWFhsrjyywC4xfMahgyor"} describe:"pid 4466, uptime 1:29:17" now:1684992833 state:"Running" log_file:"/var/log/matrixdb5/replication-3_AuWFhsrjyywC4xfMahgyor.log" stdout_log_file:"/var/log/matrixdb5/replication-3_AuWFhsrjyywC4xfMahgyor.log" pid:4466
7. pc_id:{group:"etcd" name:"etcd"} describe:"pid 4450, uptime 1:29:17" now:1684992833 state:"Running" log_file:"/mxdata_20230514185455/etcd/log/etcd.log" stdout_log_file:"/mxdata_20230514185455/etcd/log/etcd.log" pid:4450
8. pc_id:{group:"replication-2_AuWFhsrjyywC4xfMahgyor" name:"replication-2_AuWFhsrjyywC4xfMahgyor"} describe:"pid 4453, uptime 1:29:17" now:1684992833 state:"Running" log_file:"/var/log/matrixdb5/replication-2_AuWFhsrjyywC4xfMahgyor.log" stdout_log_file:"/var/log/matrixdb5/replication-2_AuWFhsrjyywC4xfMahgyor.log" pid:4453
9. pc_id:{group:"cluster_AuWFhsrjyywC4xfMahgyor" name:"cluster_AuWFhsrjyywC4xfMahgyor"} describe:"pid 4454, uptime 1:29:17" now:1684992833 state:"Running" log_file:"/var/log/matrixdb5/cluster_AuWFhsrjyywC4xfMahgyor.log" stdout_log_file:"/var/log/matrixdb5/cluster_AuWFhsrjyywC4xfMahgyor.log" pid:4454
10. pc_id:{group:"deployer" name:"deployer"} describe:"pid 4457, uptime 1:29:17" now:1684992833 state:"Running" log_file:"/var/log/matrixdb5/deployer.log" stdout_log_file:"/var/log/matrixdb5/deployer.log" pid:4457
11. pc_id:{group:"mxui_collector_GFpQhTxkwGqb7qM6iYVA8y" name:"mxui_collector_GFpQhTxkwGqb7qM6iYVA8y"} describe:"pid 4461, uptime 1:29:17" now:1684992833 state:"Running" log_file:"/var/log/matrixdb5/mxui_collector_GFpQhTxkwGqb7qM6iYVA8y.log" stdout_log_file:"/var/log/matrixdb5/mxui_collector_GFpQhTxkwGqb7qM6iYVA8y.log" pid:4461
[mxadmin@mdw3 ~]$ supervisorctl restart mxui
Restarted:
1. name:"mxui"
Can't.
The supervisor service does not automatically start a Segment (or a postgresql process group). Only services registered by the supervisor can be started by the supervisor. The postgresql process group is a descendant of the replication service process and is managed by the replication service process rather than the supervisor service process. The start of the process group is only started or stopped by the cluster management tools (mxstart
, mxstop
, mxrecover
, etc.) by calling the replication service API.