Custom monitoring function

This document describes how to customize monitoring configurations in YMatrix. This feature is supported starting from v4.8.2.

1 Background

First, you need to understand how to deploy Grafana Monitoring or Prometheus Monitoring.

Taking local Grafana monitoring as an example (SELECT mxmgr_init_local();) After the monitoring deployment is completed for one or two minutes, you can query the monitoring data entry into the local.system table under the matrixmgr database.

This information is collected by telegraf and inserted into the table via mxgate. The telegraf and mxgate processes are deployed as background services by the mxmgr_init_local() function, i.e. the mxmgr_gate_ctrl and mxmgr_telegraf_ctrl services shown in the output of the supervisorctl status command:

$ supervisorctl status
Status:
        1. pc_id:{group:"mxui_collector" name:"mxui_collector"} describe:"pid 8223, uptime 2:27:36" now:1682517012 state:"Running" log_file:"/var/log/matrixdb/mxui_collector_5432.log" stdout_log_file:"/var/log/matrixdb/mxui_collector_5432.log" pid:8223
        2. pc_id:{group:"mxmgr_gate_ctrl" name:"mxmgr_gate_ctrl"} describe:"pid 10295, uptime 2:25:28" now:1682517012 state:"Running" log_file:"/var/log/matrixdb/mxmgr_gate_ctrl_5432.log" stdout_log_file:"/var/log/matrixdb/mxmgr_gate_ctrl_5432.log" pid:10295
        3. pc_id:{group:"mxmgr_telegraf_ctrl" name:"mxmgr_telegraf_ctrl"} describe:"pid 10350, uptime 2:25:26" now:1682517012 state:"Running" log_file:"/var/log/matrixdb/mxmgr_telegraf_ctrl_5432.log" stdout_log_file:"/var/log/matrixdb/mxmgr_telegraf_ctrl_5432.log" pid:10350
        4. pc_id:{group:"cylinder" name:"cylinder"} describe:"pid 6038, uptime 2:33:30" now:1682517012 state:"Running" log_file:"/var/log/matrixdb/cylinder.log" stdout_log_file:"/var/log/matrixdb/cylinder.log" pid:6038
        5. pc_id:{group:"mxui" name:"mxui"} describe:"pid 6041, uptime 2:33:30" now:1682517012 state:"Running" log_file:"/var/log/matrixdb/mxui.log" stdout_log_file:"/var/log/matrixdb/mxui.log" pid:6041

The information stored in the local.system table is mainly used to display cluster monitoring charts on Grafana's Dashboard.

The system information collected in the local.system table includes the following 24 types of system operation information. If you have rich operating and maintenance experience, you can explore its content by yourself:

matrixmgr=# select distinct category from local.system order by 1;
   category
------------------------
 cpu
 disk
 diskio
 kernel
 mem
 net
 netstat
 postgresql
 processes
 sar_cpu
 sar_cpu_util
 sar_disk
 sar_hugepages
 sar_inode
 sar_io
 sar_mem
 sar_network
 sar_paging
 sar_queue
 sar_swap
 sar_swap_util
 sar_task
 swap
 system
(24 rows)

2 Custom monitoring

Starting with v4.8.2, YMatrix supports new features for users to develop custom monitoring projects. You can define monitoring items yourself, insert them into the local.system file or report them to Prometheus. Once successful, you can write your own Dashboard panel and alert projects in Grafana.

2.1 Writing custom scripts

After the above monitoring is deployed, you can discover custom monitoring scripts in the following directory: /etc/matrixdb/scripts/.

This directory includes:

  • monitor_bootstrap.sh This file is the entrance to custom monitoring, the content has been written and users do not need to edit it. Monitoring component telegraf will call this script periodically to output custom monitoring information.
  • monitor_plugins/ User-defined scripts can be placed under this directory. For security reasons, only the root user has write permissions for this directory.

The following are two custom monitoring script examples, placed in the monitor_plugins/ directory:

  1. Create a nic.sh script to display the NIC information of the network card:
    #!/bin/bash
    style=grafana
    #########################################################################################################################################################################
    # NIC statistics
    #########################################################################################################################################################################
    nic_stats_output() {
     METRIC="net_dev";
     for NIC in  `ip link|grep mtu|awk -F ':' '{print $2}'|grep -vE "lo|docker|bond"`
     do
         VAL=""
         for f in $(ls /sys/class/net/$NIC/statistics/); do
             v=$(cat /sys/class/net/$NIC/statistics/$f);
             if [ "#$style" == "#prometheus" ];then
                 echo "matrixdb,device=$NIC,metric=$f $METRIC=$v"
                 continue
             fi
             if [ ! -z $VAL ];then
                 VAL+=","
             fi
             VAL+="$f=$v";
         done
         if [ "#$style" == "#grafana" ];then
             echo "$METRIC,device=$NIC $VAL"
         fi
     done
    }
    nic_stats_output

    The above script is used to output the statistical information under /sys/class/net/ of the machine. The output results are as follows when running separately:

    net_dev,device=eth0 collisions=0,multicast=0,rx_bytes=54701811772,rx_compressed=0,rx_crc_errors=0,rx_dropped=0,rx_errors=0,rx_fifo_errors=0,rx_frame_errors=0,rx_length_errors=0,rx_missed_errors=0,rx_nohandler=0,rx_over_errors=0,rx_packets=328974378,tx_aborted_errors=0,tx_bytes=89613462060,tx_carrier_errors=0,tx_compressed=0,tx_dropped=0,tx_errors=0,tx_fifo_errors=0,tx_heartbeat_errors=0,tx_packets=283871697,tx_window_errors=0

    The rule of script output is, for each line of data:

  • The first word is the name of the monitoring project, corresponding to the above example: net_dev
  • The comma after the first word starts, before the first space character, there can be one to multiple groups of key=value, representing the device tags that monitor information of the line. The corresponding example above is: device=eth0
  • After the space, there are one to multiple groups of key=value pairs, and the content is the specific monitoring value.
  1. Create an interrupt.sh script to display system interrupt information:

    #!/bin/bash
    style=grafana
    #########################################################################################################################################################################
    # Hardware Interrupts
    #########################################################################################################################################################################
    interrupts_output() {
     PATTERN=$(awk -F ':' '{i++; if(i>2){print $1}}' /proc/net/dev | sed 's/ //g' | tr '\n' '|' | sed 's/|$//')
     egrep "$PATTERN" /proc/interrupts | awk -v style="#$style" \
         '{ for (i=2;i<=NF-2;i++) sum[i]+=$i;}
              END {
              for (i=2;i<=NF-2; i++)
              {
                  if(style=="#prometheus"){
                      print("matrixdb,device=cpu" i-2 " net_interrupts_by_cpu="sum[i]);
                      continue;
                  }
                  val=sprintf(val "cpu" i-2 "=" sum[i]);
                        if(i!=NF-2 )
                      val=sprintf(val ",");
              }
              if(style=="#grafana")
                        print("net_interrupts_by_cpu,device=all " val)
          }'
     egrep "$PATTERN" /proc/interrupts | awk -v style="#$style" \
         '{ for (i=2;i<=NF-2; i++)
                sum+=$i;
                tags=sprintf("%s", $NF);
                if (NR!=1)
                    val=sprintf(val ",");
                val=sprintf(val tags "=" sum);
    
                if(style=="#prometheus"){
                    print("matrixdb,device=" $NF " net_interrupts_by_queue=" sum)
                }
                sum=0;
          } END{ if(style=="#grafana") print("net_interrupts_by_queue,device=all " val) }'
    }
    interrupts_output

    The result of a single execution is:

    net_interrupts_by_cpu,device=all cpu0=284551104,cpu1=308556439
    net_interrupts_by_queue,device=all eth0-Tx-Rx-0=298072844,eth0-Tx-Rx-1=295034700

    It is also worth noting that the scripts placed under monitor_plugins/ must be given runnable permissions:

    $ ls -l /etc/matrixdb/scripts/monitor_plugins/
    -rwxr-xr-x 1 root root 1491 Apr 26 12:51 interrupts.sh
    -rwxr-xr-x 1 root root  855 Apr 26 12:45 nic.sh

    If the run permission is removed, it is equivalent to disabling the script and will not be executed during periodic collection.

2.2 Confirm the script to be effective

After placing the custom monitoring script, confirm that the output format is correct and you can run the telegraf test output.

First find the telegraf configuration file name under /tmp:

$ ls -l /tmp | grep telegraf
-rw-r--r-- 1 root     root      12676 Apr 26 11:24 telegraf_5432.conf

Run the telegraf test mode:

$ sudo /usr/local/matrixdb/bin/telegraf --config /tmp/telegraf_5432.conf --test

Get the following output result, you can see that the output of our custom script is already included in the telegraf test output:

After the script is placed, each time telegraf collects system monitoring information, the custom script will be automatically executed and loaded into the library. After waiting for 1 to 2 minutes, query the database local.system table and you can see the custom monitoring information after entering the database.

  1. Query category shows more custom projects than before net_dev, net_interrupts_by_cpu, net_interrupts_by_queue:
    matrixmgr=# SELECT distinct category FROM local.system ORDER BY 1;
         category
    ----------------------------------------------------------------------------------------------------------------------------------
    cpu
    disk
    diskio
    kernel
    mem
    net
    net_dev
    net_interrupts_by_cpu
    net_interrupts_by_queue
    netstat
    postgresql
    processes
    sar_cpu
    sar_cpu_util
    sar_disk
    sar_hugepages
    sar_inode
    sar_io
    sar_mem
    sar_network
    sar_paging
    sar_queue
    sar_swap
    sar_swap_util
    sar_task
    swap
    system
    (27 rows)

    Query the specific content collected by the custom project:

    matrixmgr=# SELECT * FROM local.system
    WHERE category IN ('net_interrupts_by_cpu', 'net_interrupts_by_queue', 'net_dev')
    ORDER BY ts DESC LIMIT 10;

    It can be seen that the output of the above custom script has been recorded as structured data in the data table, which is convenient for query and analysis: ![](https://img.ymatrix.cn/ymatrix_home/Structured Data_1682580020.png)

If you are a Prometheus monitor, you can visit http://<IP>:<port>/metrics to see if the monitoring information contains custom items.

2.3 Add to Grafana Panel

After the data is stored or entered into Prometheus, you can create a custom panel on the Grafana Dashboard and design and create visual charts according to your needs.

This is not the focus of this article, so only simple steps are given.

Here, take non-Prometheus users as an example to write SQL statements to query monitoring information: ![](https://img.ymatrix.cn/ymatrix_home/Writing SQL_1682580082.png)

After adding the chart, you can see that our customized monitoring information has been visualized as a line chart: ![](https://img.ymatrix.cn/ymatrix_home/line chart_1682580115.png)

Further click on the gear icon to add an alarm item:

![](https://img.ymatrix.cn/ymatrix_home/Add Alarm_1682580152.png)