New architecture FAQ

This document describes common problems with the new YMatrix 5.X architecture.

1 I heard that Supervisor cannot be restarted casually under 5.X. Why, what will happen if it is restarted?


In the 4.X era, it is supported to use sudo systemctl restart matrixdb.supervisor.service to restart components such as graphical interface (MXUI).
However, under the 5.X architecture, Supervisor manages more subprocesses, including etcd, highly available services, and Postmaster processes.
Therefore, if you shut down or restart Supervisor at will, it will cause key processes such as etcd and Postmaster to restart. At the least, the cluster crashes, and at the worst, data will be lost.

Therefore, in addition to deleting the database cluster, uninstalling the database software, etc., do not use the command to restart Supervisor.

2 Why does each server have a Shard/Cluster service process, but only one is active?


The new architecture is designed to take into account high availability under the most stringent network partitions.

The Cluster/Shard service is responsible for managing the state of the database cluster. If network partitions cause them to be unable to connect with other parts of the cluster that should work properly, they will not be able to make correct decisions on the state of the database cluster.
In order to keep the database cluster available as much as possible in any failure situation, the services that require cluster state decisions are also highly available in themselves. Being able to make objective judgments like humans under various network exceptions is the core significance of introducing etcd clusters.
Typically, a set of Cluster/Shard services will be run on each machine. But in the entire cluster, each service has at most one active instance. Other inactive instances will automatically select one to become the new active instance after the original active instance expires.

3 5. How to expand the current version online?


You can use the [Graphical Interface for expansion operation] (/doc/latest/maintain/mxui_expand).

4 YMatrix 5 Single component startup


Reasons for not using matrixdb.supervisor.service

Under the YMatrix 5 architecture, supervisor manages more subprocesses, including etcd clusters, high availability services, postmaster processes, etc. Shutting down or restarting the supervisor at will will cause key processes such as etcd and postmaster to restart, which will cause database problems.
Therefore, when you need to restart a certain component, we recommend using the supervisorctl tool.

Single Component Management

  1. Check the status and information of supervisor
    [mxadmin@mdw3 ~]$ systemctl status matrixdb5.supervisor.service
    ● matrixdb5.supervisor.service - MatrixDB 5 Supervisord Daemon
    Loaded: loaded (/usr/lib/systemd/system/matrixdb5.supervisor.service; enabled; vendor preset: enabled)
    Active: active (running) since Wed 2023-05-24 21:04:35 PDT; 1h 28min ago
    Process: 4426 ExecStop=/bin/bash -c exec "$MXHOME"/bin/supervisorctl shutdown (code=exited, status=1/FAILURE)
    Main PID: 4439 (supervisord)
    Memory: 605.4M
    CGroup: /system.slice/matrixdb5.supervisor.service
            ├─  954 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
            ├─  955 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
            ├─ 3357 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
            ├─ 4439 /opt/ymatrix/matrixdb5/bin/supervisord -c /etc/matrixdb5/supervisor.conf
            ├─ 4451 mxctl telegraf exec --gpname mdw3 --mxui-collector --socket-addr mdw3:51574 --cluster-id AuWFhsrjyywC4xfMahgyor --master-role --dbhost mdw3 --dbport ...
            ├─ 4461 mxctl telegraf exec --gpname mdw3 --mxui-collector --socket-addr mdw3:56639 --cluster-id GFpQhTxkwGqb7qM6iYVA8y --master-role --dbhost mdw3 --dbport ...
            ├─ 4470 /opt/ymatrix/matrixdb5/bin/cylinder -nofile -port 4637 -db-cluster-id AuWFhsrjyywC4xfMahgyor
            ├─ 4479 /opt/ymatrix/matrixdb5/bin/telegraf --config /tmp/mxui_collector_AuWFhsrjyywC4xfMahgyor.conf
            ├─ 4515 /opt/ymatrix/matrixdb5/bin/telegraf --config /tmp/mxui_collector_GFpQhTxkwGqb7qM6iYVA8y.conf
            ├─ 4528 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
            ├─ 4539 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
            ├─ 4997 /opt/ymatrix/matrixdb5/bin/mxui
            ├─12093 /usr/lib64/sa/sadc -S DISK 4 2 /tmp/sysstat-3640257011
            └─12094 /usr/lib64/sa/sadc -S DISK 4 2 /tmp/sysstat-3256168522
  2. Check the running status of each component
    [mxadmin@mdw3 ~]$ supervisorctl status
    Status:
         1. pc_id:{group:"mxui_collector_AuWFhsrjyywC4xfMahgyor" name:"mxui_collector_AuWFhsrjyywC4xfMahgyor"} describe:"pid 4451, uptime 1:29:17" now:1684992833 state:"Running" log_file:"/var/log/matrixdb5/mxui_collector_AuWFhsrjyywC4xfMahgyor.log" stdout_log_file:"/var/log/matrixdb5/mxui_collector_AuWFhsrjyywC4xfMahgyor.log" pid:4451
         2. pc_id:{group:"cylinder_AuWFhsrjyywC4xfMahgyor" name:"cylinder_AuWFhsrjyywC4xfMahgyor"} describe:"pid 4470, uptime 1:29:17" now:1684992833 state:"Running" log_file:"/var/log/matrixdb5/cylinder_AuWFhsrjyywC4xfMahgyor.log" stdout_log_file:"/var/log/matrixdb5/cylinder_AuWFhsrjyywC4xfMahgyor.log" pid:4470
         3. pc_id:{group:"shard_AuWFhsrjyywC4xfMahgyor" name:"shard_AuWFhsrjyywC4xfMahgyor"} describe:"pid 4477, uptime 1:29:17" now:1684992833 state:"Running" log_file:"/var/log/matrixdb5/shard_AuWFhsrjyywC4xfMahgyor.log" stdout_log_file:"/var/log/matrixdb5/shard_AuWFhsrjyywC4xfMahgyor.log" pid:4477
         4. pc_id:{group:"mxui" name:"mxui"} describe:"pid 4997, uptime 1:24:43" now:1684992833 state:"Running" log_file:"/var/log/matrixdb5/mxui.log" stdout_log_file:"/var/log/matrixdb5/mxui.log" pid:4997
         5. pc_id:{group:"replication-1_AuWFhsrjyywC4xfMahgyor" name:"replication-1_AuWFhsrjyywC4xfMahgyor"} describe:"pid 4484, uptime 1:29:17" now:1684992833 state:"Running" log_file:"/var/log/matrixdb5/replication-1_AuWFhsrjyywC4xfMahgyor.log" stdout_log_file:"/var/log/matrixdb5/replication-1_AuWFhsrjyywC4xfMahgyor.log" pid:4484
         6. pc_id:{group:"replication-3_AuWFhsrjyywC4xfMahgyor" name:"replication-3_AuWFhsrjyywC4xfMahgyor"} describe:"pid 4466, uptime 1:29:17" now:1684992833 state:"Running" log_file:"/var/log/matrixdb5/replication-3_AuWFhsrjyywC4xfMahgyor.log" stdout_log_file:"/var/log/matrixdb5/replication-3_AuWFhsrjyywC4xfMahgyor.log" pid:4466
         7. pc_id:{group:"etcd" name:"etcd"} describe:"pid 4450, uptime 1:29:17" now:1684992833 state:"Running" log_file:"/mxdata_20230514185455/etcd/log/etcd.log" stdout_log_file:"/mxdata_20230514185455/etcd/log/etcd.log" pid:4450
         8. pc_id:{group:"replication-2_AuWFhsrjyywC4xfMahgyor" name:"replication-2_AuWFhsrjyywC4xfMahgyor"} describe:"pid 4453, uptime 1:29:17" now:1684992833 state:"Running" log_file:"/var/log/matrixdb5/replication-2_AuWFhsrjyywC4xfMahgyor.log" stdout_log_file:"/var/log/matrixdb5/replication-2_AuWFhsrjyywC4xfMahgyor.log" pid:4453
         9. pc_id:{group:"cluster_AuWFhsrjyywC4xfMahgyor" name:"cluster_AuWFhsrjyywC4xfMahgyor"} describe:"pid 4454, uptime 1:29:17" now:1684992833 state:"Running" log_file:"/var/log/matrixdb5/cluster_AuWFhsrjyywC4xfMahgyor.log" stdout_log_file:"/var/log/matrixdb5/cluster_AuWFhsrjyywC4xfMahgyor.log" pid:4454
         10. pc_id:{group:"deployer" name:"deployer"} describe:"pid 4457, uptime 1:29:17" now:1684992833 state:"Running" log_file:"/var/log/matrixdb5/deployer.log" stdout_log_file:"/var/log/matrixdb5/deployer.log" pid:4457
         11. pc_id:{group:"mxui_collector_GFpQhTxkwGqb7qM6iYVA8y" name:"mxui_collector_GFpQhTxkwGqb7qM6iYVA8y"} describe:"pid 4461, uptime 1:29:17" now:1684992833 state:"Running" log_file:"/var/log/matrixdb5/mxui_collector_GFpQhTxkwGqb7qM6iYVA8y.log" stdout_log_file:"/var/log/matrixdb5/mxui_collector_GFpQhTxkwGqb7qM6iYVA8y.log" pid:4461
  3. Restart a component, example: graphical interface (Mxui)
    [mxadmin@mdw3 ~]$ supervisorctl restart mxui
    Restarted:
         1. name:"mxui"

    5 Can Segment be automatically restored by supervisor?


Can't.

The supervisor service does not automatically start a Segment (or a postgresql process group). Only services registered by the supervisor can be started by the supervisor. The postgresql process group is a descendant of the replication service process and is managed by the replication service process rather than the supervisor service process. The start of the process group is only started or stopped by the cluster management tools (mxstart, mxstop, mxrecover, etc.) by calling the replication service API.