Custom Monitoring Feature

This document describes how to configure custom monitoring in YMatrix. This feature is supported starting from version v4.8.2.

1 Background

First, you should understand how to deploy Grafana monitoring or Prometheus monitoring.

Taking local Grafana monitoring as an example (matrixdb@mxagent),
one to two minutes after the monitoring deployment completes, you can query the monitoring data in the local.system table under the matrixmgr database.

This data is collected by telegraf and inserted into the table via mxgate. The telegraf and mxgate processes are deployed as background services by the matrixdb function, as shown in the output of the supervisorctl status command:

$ supervisorctl status
Status:
        1. pc_id:{group:"mxui_collector" name:"mxui_collector"} describe:"pid 8223, uptime 2:27:36" now:1682517012 state:"Running" log_file:"/var/log/matrixdb/mxui_collector_5432.log" stdout_log_file:"/var/log/matrixdb/mxui_collector_5432.log" pid:8223
        2. pc_id:{group:"mxmgr_gate_ctrl" name:"mxmgr_gate_ctrl"} describe:"pid 10295, uptime 2:25:28" now:1682517012 state:"Running" log_file:"/var/log/matrixdb/mxmgr_gate_ctrl_5432.log" stdout_log_file:"/var/log/matrixdb/mxmgr_gate_ctrl_5432.log" pid:10295
        3. pc_id:{group:"mxmgr_telegraf_ctrl" name:"mxmgr_telegraf_ctrl"} describe:"pid 10350, uptime 2:25:26" now:1682517012 state:"Running" log_file:"/var/log/matrixdb/mxmgr_telegraf_ctrl_5432.log" stdout_log_file:"/var/log/matrixdb/mxmgr_telegraf_ctrl_5432.log" pid:10350
        4. pc_id:{group:"cylinder" name:"cylinder"} describe:"pid 6038, uptime 2:33:30" now:1682517012 state:"Running" log_file:"/var/log/matrixdb/cylinder.log" stdout_log_file:"/var/log/matrixdb/cylinder.log" pid:6038
        5. pc_id:{group:"mxui" name:"mxui"} describe:"pid 6041, uptime 2:33:30" now:1682517012 state:"Running" log_file:"/var/log/matrixdb/mxui.log" stdout_log_file:"/var/log/matrixdb/mxui.log" pid:6041

The data stored in the local.system table is primarily used to display cluster monitoring charts on Grafana Dashboards.

The system information collected in the local.system table includes the following 24 categories of runtime metrics. If you have operational experience, you may explore their contents:

matrixmgr=# select distinct category from local.system order by 1;
   category
---------------
 cpu
 disk
 diskio
 kernel
 mem
 net
 netstat
 postgresql
 processes
 sar_cpu
 sar_cpu_util
 sar_disk
 sar_hugepages
 sar_inode
 sar_io
 sar_mem
 sar_network
 sar_paging
 sar_queue
 sar_swap
 sar_swap_util
 sar_task
 swap
 system
(24 rows)

2 Custom Monitoring

Starting from v4.8.2, YMatrix supports user-defined custom monitoring. You can define your own monitoring scripts, which can either be inserted into the local.system table or reported to Prometheus. After successful setup, you can create custom Dashboard panels and alerts in Grafana.

2.1 Writing Custom Scripts

After deploying the monitoring components, you can find the custom monitoring scripts in the following directory: /etc/matrixdb/scripts.

This directory contains:

  • monitor_bootstrap.sh
    This is the entry point for custom monitoring. The content is pre-written and should not be modified by users. The telegraf process calls this script periodically to collect custom monitoring data.

  • monitor_plugins/
    This directory is for user-defined scripts. For security reasons, only the root user has write permissions to this directory.

Below are two example custom monitoring scripts placed in the monitor_plugins/ directory:

  1. Create a script named nic.sh to display NIC (network interface) statistics:
#!/bin/bash
style=grafana
#######################################################################
# NIC statistics
#######################################################################
nic_stats_output() {
    METRIC="net_dev";
    for NIC in $(ls /sys/class/net/ | grep -E 'eth|enp|ens|bond')
    do
        VAL=""
        for f in $(ls /sys/class/net/$NIC/statistics/); do
            v=$(cat /sys/class/net/$NIC/statistics/$f);
            if [ "#$style" == "#prometheus" ];then
                echo "matrixdb,device=$NIC,metric=$f $METRIC=$v"
                continue
            fi
            if [ ! -z $VAL ];then
                VAL+=","
            fi
            VAL+="$f=$v";
        done
        if [ "#$style" == "#grafana" ];then
            echo "$METRIC,device=$NIC $VAL"
        fi
    done
}
nic_stats_output

This script outputs statistics from /sys/class/net/<NIC>/statistics/ on the local host. A standalone run produces output like:

net_dev,device=eth0 collisions=0,multicast=0,rx_bytes=54701811772,rx_compressed=0,rx_crc_errors=0,rx_dropped=0,rx_errors=0,rx_fifo_errors=0,rx_frame_errors=0,rx_length_errors=0,rx_missed_errors=0,rx_nohandler=0,rx_over_errors=0,rx_packets=328974378,tx_aborted_errors=0,tx_bytes=89613462060,tx_carrier_errors=0,tx_compressed=0,tx_dropped=0,tx_errors=0,tx_fifo_errors=0,tx_heartbeat_errors=0,tx_packets=283871697,tx_window_errors=0

Output format rules for the script: each line of data must follow this structure:

  • The first word is the metric name. In the example above, it is net_dev.
  • After the first word, a comma-separated list of key=value tags may appear before the first space. These are labels for the metric, such as device=eth0.
  • After the space, one or more key=value pairs represent the actual metric values.
  1. Create a script named interrupts.sh to display hardware interrupt statistics:
#!/bin/bash
style=grafana
#######################################################################
# Hardware Interrupts
#######################################################################
interrupts_output() {
    PATTERN=$(awk -F ':' '{i++; if(i>2){print $1}}' /proc/net/dev | sed 's/ //g' | tr '\n' '|' | sed 's/|$//')
    egrep "$PATTERN" /proc/interrupts | awk -v style="#$style" \
        '{ for (i=2;i<=NF-2;i++) sum[i]+=$i;}
             END {
             for (i=2;i<=NF-2; i++)
             {
                 if(style=="#prometheus"){
                     print("matrixdb,device=cpu" i-2 " net_interrupts_by_cpu="sum[i]);
                     continue;
                 }
                 val=sprintf(val "cpu" i-2 "=" sum[i]);
                       if(i!=NF-2 )
                     val=sprintf(val ",");
             }
             if(style=="#grafana")
                       print("net_interrupts_by_cpu,device=all " val)
         }'
    egrep "$PATTERN" /proc/interrupts | awk -v style="#$style" \
        '{ for (i=2;i<=NF-2; i++)
               sum+=$i;
               tags=sprintf("%s", $NF);
               if (NR!=1)
                   val=sprintf(val ",");
               val=sprintf(val tags "=" sum);

               if(style=="#prometheus"){
                   print("matrixdb,device=" $NF " net_interrupts_by_queue=" sum)
               }
               sum=0;
         } END{ if(style=="#grafana") print("net_interrupts_by_queue,device=all " val) }'
}
interrupts_output

A single execution produces:

net_interrupts_by_cpu,device=all cpu0=284551104,cpu1=308556439
net_interrupts_by_queue,device=all eth0-Tx-Rx-0=298072844,eth0-Tx-Rx-1=295034700

Note: Scripts placed in the monitor_plugins/ directory must have execute permissions:

$ ls -l /etc/matrixdb/scripts/monitor_plugins/
-rwxr-xr-x 1 root root 1491 Apr 26 12:51 interrupts.sh
-rwxr-xr-x 1 root root  855 Apr 26 12:45 nic.sh

If execute permissions are removed, the script will be effectively disabled and will not be executed during periodic collection.

2.2 Verify Script Effectiveness

After placing the custom monitoring scripts and verifying the output format, you can test them using telegraf --test.

First, locate the telegraf configuration file in /tmp:

$ ls -l /tmp | grep telegraf
-rw-r--r-- 1 root     root      12676 Apr 26 11:24 telegraf_5432.conf

Run telegraf in test mode:

$ sudo /usr/local/matrixdb/bin/telegraf --config /tmp/telegraf_5432.conf --test

The output will include the results from your custom scripts, confirming they are properly loaded:

Once the scripts are in place, they will be automatically executed during each system monitoring collection cycle. Wait 1–2 minutes, then query the local.system table to verify the custom monitoring data has been inserted.

  1. Query local.system to confirm new categories net_dev, net_interrupts_by_cpu, and net_interrupts_by_queue have appeared:
matrixmgr=# SELECT distinct category FROM local.system ORDER BY 1;
        category
-------------------------
 cpu
 disk
 diskio
 kernel
 mem
 net
 net_dev
 net_interrupts_by_cpu
 net_interrupts_by_queue
 netstat
 postgresql
 processes
 sar_cpu
 sar_cpu_util
 sar_disk
 sar_hugepages
 sar_inode
 sar_io
 sar_mem
 sar_network
 sar_paging
 sar_queue
 sar_swap
 sar_swap_util
 sar_task
 swap
 system
(27 rows)

Query the actual collected data:

matrixmgr=# SELECT * FROM local.system
WHERE category IN ('net_interrupts_by_cpu', 'net_interrupts_by_queue', 'net_dev')
ORDER BY ts DESC LIMIT 10;

The output shows that the custom script data has been stored as structured records, ready for analysis:

For Prometheus users, visit http://<prometheus-server>:9090 to verify that the custom metrics are present.

2.3 Add to Grafana Dashboard

After the data is ingested into the database or Prometheus, you can create custom panels in Grafana Dashboards and design visualizations as needed.

This is not the focus of this document, so only basic steps are provided.

As an example for non-Prometheus users, write a SQL query to retrieve monitoring data:

After adding the panel, your custom monitoring data will be visualized, for example as a line chart:

Click the gear icon to configure alerts: