etcd Monitoring

YMatrix 6 introduces etcd to store cluster configuration and state information. etcd is critical: any abnormality may lead to database instability or even system crashes.


This document describes how to install and deploy etcd monitoring. We recommend installing this monitoring setup for all production clusters, as healthy etcd status is a prerequisite for stable database operation.

1. Install Prometheus (Skip if Already Installed)

Visit the Prometheus official website and download the following:

Upload the downloaded tar package to a Linux server. If possible, run Prometheus on a dedicated server. If resources are limited, you may temporarily place it on the standby or master host.

$ tar xvfz prometheus-*.tar.gz

You can move the extracted prometheus-* directory to /usr/local/.

2. Configure Prometheus to Scrape etcd Metrics

Edit the configuration file:

$ cd prometheus-*
sudo vi prometheus.yml

Append the following content to the end of the file:

  - job_name: "etcd"
    static_configs:
      - targets: ["172.31.33.128:4679", "172.31.45.253:4679", "172.31.35.134:4679"]

The targets array must be replaced with the addresses of all etcd nodes in your cluster.
You can find this information in the /etc/matrixdb6/physical_cluster.toml file on the Master host:

$ cat physical_cluster.toml
cluster_id = '79LhQxjuwmXgSWZCjcdigF'
supervisord_grpc_port = 4617
deployer_port = 4627
etcd_endpoints = ['http://10.0.159.1:4679', 'http://10.0.172.185:4679', 'http://10.0.170.90:4679', 'http://10.0.146.2:4679', 'http://10.0.146.195:4679', 'http://10.0.150.110:4679', 'http://10.0.169.149:4679']

Note!
If the /etc/matrixdb6/physical_cluster.toml file does not exist and no etcd process is found, your cluster was not deployed using the 6.x architecture and does not require etcd monitoring.

3. Start Prometheus

./prometheus --config.file=prometheus.yml

Note!
Typically, you should run Prometheus as a background system service by configuring it as a systemd service.

4. Install Grafana (Skip if Already Installed)

Refer to the official documentation for installing Grafana.

Note!
Grafana version 8.2.5 or higher is required.

5. Configure the etcd Dashboard in Grafana

First, log in to the Grafana web interface. The default URL is:

http://<IP_or_domain_of_the_host>:3000

5.1 Configure Data Source (DataSource)


172.31.23.21:9090 is the service port of the Prometheus instance deployed in the previous step.

5.2 Import Dashboard

Visit the Grafana community:
https://grafana.com/grafana/dashboards/?search=Etcd+Cluster+Overview

The above link shows community-provided dashboards:

Click into the desired dashboard to obtain its ID:

Note!
The dashboard associated with this ID may change over time. Currently, the ID is 15308; users should verify based on their own search results.

As shown below, import dashboard ID 15308:

Then select the data source configured earlier and load the dashboard:

If the server is in an internal network, importing by ID 15308 may fail. In that case, download the JSON file of the dashboard from an internet-connected machine and import it manually:

6. Key Metrics in the etcd Dashboard

  • etcd Service Status and Member Count

    • Has Leader must be Yes. If not, etcd service is unavailable.
    • Members count is automatically determined by the number of hosts in the system, with a maximum of 7.
  • Quorum Member List and Process Status

    • DB Capacity indicates available space. The etcd database has a default limit of 2GB. If free space is too low, the database may soon fill up, causing etcd failure. Consider performing Compact or Defragment operations (similar to Vacuum and Vacuum Full in PostgreSQL).
    • Mem Used shows memory usage per etcd instance, typically several hundred MB.
  • DB Total Size displays the growth of etcd data over time. An abnormal increase may lead to the database reaching its size limit.

Note!
For more metrics and detailed explanations, refer to the etcd official documentation.