This document describes how to customize monitoring configurations in YMatrix. This feature is supported starting from v4.8.2.
First, you need to understand how to deploy Grafana Monitoring or Prometheus Monitoring.
Taking local Grafana monitoring as an example (SELECT mxmgr_init_local();
)
After the monitoring deployment is completed for one or two minutes, you can query the monitoring data entry into the local.system
table under the matrixmgr database.
This information is collected by telegraf and inserted into the table via mxgate. The telegraf and mxgate processes are deployed as background services by the mxmgr_init_local()
function, i.e. the mxmgr_gate_ctrl
and mxmgr_telegraf_ctrl
services shown in the output of the supervisorctl status
command:
$ supervisorctl status
Status:
1. pc_id:{group:"mxui_collector" name:"mxui_collector"} describe:"pid 8223, uptime 2:27:36" now:1682517012 state:"Running" log_file:"/var/log/matrixdb/mxui_collector_5432.log" stdout_log_file:"/var/log/matrixdb/mxui_collector_5432.log" pid:8223
2. pc_id:{group:"mxmgr_gate_ctrl" name:"mxmgr_gate_ctrl"} describe:"pid 10295, uptime 2:25:28" now:1682517012 state:"Running" log_file:"/var/log/matrixdb/mxmgr_gate_ctrl_5432.log" stdout_log_file:"/var/log/matrixdb/mxmgr_gate_ctrl_5432.log" pid:10295
3. pc_id:{group:"mxmgr_telegraf_ctrl" name:"mxmgr_telegraf_ctrl"} describe:"pid 10350, uptime 2:25:26" now:1682517012 state:"Running" log_file:"/var/log/matrixdb/mxmgr_telegraf_ctrl_5432.log" stdout_log_file:"/var/log/matrixdb/mxmgr_telegraf_ctrl_5432.log" pid:10350
4. pc_id:{group:"cylinder" name:"cylinder"} describe:"pid 6038, uptime 2:33:30" now:1682517012 state:"Running" log_file:"/var/log/matrixdb/cylinder.log" stdout_log_file:"/var/log/matrixdb/cylinder.log" pid:6038
5. pc_id:{group:"mxui" name:"mxui"} describe:"pid 6041, uptime 2:33:30" now:1682517012 state:"Running" log_file:"/var/log/matrixdb/mxui.log" stdout_log_file:"/var/log/matrixdb/mxui.log" pid:6041
The information stored in the local.system
table is mainly used to display cluster monitoring charts on Grafana's Dashboard.
The system information collected in the local.system
table includes the following 24 types of system operation information. If you have rich operating and maintenance experience, you can explore its content by yourself:
matrixmgr=# select distinct category from local.system order by 1;
category
------------------------
cpu
disk
diskio
kernel
mem
net
netstat
postgresql
processes
sar_cpu
sar_cpu_util
sar_disk
sar_hugepages
sar_inode
sar_io
sar_mem
sar_network
sar_paging
sar_queue
sar_swap
sar_swap_util
sar_task
swap
system
(24 rows)
Starting with v4.8.2, YMatrix supports new features for users to develop custom monitoring projects. You can define monitoring items yourself, insert them into the local.system
file or report them to Prometheus. Once successful, you can write your own Dashboard panel and alert projects in Grafana.
After the above monitoring is deployed, you can discover custom monitoring scripts in the following directory: /etc/matrixdb/scripts/
.
This directory includes:
The following are two custom monitoring script examples, placed in the monitor_plugins/
directory:
nic.sh
script to display the NIC information of the network card:#!/bin/bash
style=grafana
#########################################################################################################################################################################
# NIC statistics
#########################################################################################################################################################################
nic_stats_output() {
METRIC="net_dev";
for NIC in `ip link|grep mtu|awk -F ':' '{print $2}'|grep -vE "lo|docker|bond"`
do
VAL=""
for f in $(ls /sys/class/net/$NIC/statistics/); do
v=$(cat /sys/class/net/$NIC/statistics/$f);
if [ "#$style" == "#prometheus" ];then
echo "matrixdb,device=$NIC,metric=$f $METRIC=$v"
continue
fi
if [ ! -z $VAL ];then
VAL+=","
fi
VAL+="$f=$v";
done
if [ "#$style" == "#grafana" ];then
echo "$METRIC,device=$NIC $VAL"
fi
done
}
nic_stats_output
The above script is used to output the statistical information under /sys/class/net/
of the machine. The output results are as follows when running separately:
net_dev,device=eth0 collisions=0,multicast=0,rx_bytes=54701811772,rx_compressed=0,rx_crc_errors=0,rx_dropped=0,rx_errors=0,rx_fifo_errors=0,rx_frame_errors=0,rx_length_errors=0,rx_missed_errors=0,rx_nohandler=0,rx_over_errors=0,rx_packets=328974378,tx_aborted_errors=0,tx_bytes=89613462060,tx_carrier_errors=0,tx_compressed=0,tx_dropped=0,tx_errors=0,tx_fifo_errors=0,tx_heartbeat_errors=0,tx_packets=283871697,tx_window_errors=0
The rule of script output is, for each line of data:
net_dev
key=value
, representing the device tags that monitor information of the line. The corresponding example above is: device=eth0
key=value
pairs, and the content is the specific monitoring value.Create an interrupt.sh
script to display system interrupt information:
#!/bin/bash
style=grafana
#########################################################################################################################################################################
# Hardware Interrupts
#########################################################################################################################################################################
interrupts_output() {
PATTERN=$(awk -F ':' '{i++; if(i>2){print $1}}' /proc/net/dev | sed 's/ //g' | tr '\n' '|' | sed 's/|$//')
egrep "$PATTERN" /proc/interrupts | awk -v style="#$style" \
'{ for (i=2;i<=NF-2;i++) sum[i]+=$i;}
END {
for (i=2;i<=NF-2; i++)
{
if(style=="#prometheus"){
print("matrixdb,device=cpu" i-2 " net_interrupts_by_cpu="sum[i]);
continue;
}
val=sprintf(val "cpu" i-2 "=" sum[i]);
if(i!=NF-2 )
val=sprintf(val ",");
}
if(style=="#grafana")
print("net_interrupts_by_cpu,device=all " val)
}'
egrep "$PATTERN" /proc/interrupts | awk -v style="#$style" \
'{ for (i=2;i<=NF-2; i++)
sum+=$i;
tags=sprintf("%s", $NF);
if (NR!=1)
val=sprintf(val ",");
val=sprintf(val tags "=" sum);
if(style=="#prometheus"){
print("matrixdb,device=" $NF " net_interrupts_by_queue=" sum)
}
sum=0;
} END{ if(style=="#grafana") print("net_interrupts_by_queue,device=all " val) }'
}
interrupts_output
The result of a single execution is:
net_interrupts_by_cpu,device=all cpu0=284551104,cpu1=308556439
net_interrupts_by_queue,device=all eth0-Tx-Rx-0=298072844,eth0-Tx-Rx-1=295034700
It is also worth noting that the scripts placed under monitor_plugins/
must be given runnable permissions:
$ ls -l /etc/matrixdb/scripts/monitor_plugins/
-rwxr-xr-x 1 root root 1491 Apr 26 12:51 interrupts.sh
-rwxr-xr-x 1 root root 855 Apr 26 12:45 nic.sh
If the run permission is removed, it is equivalent to disabling the script and will not be executed during periodic collection.
After placing the custom monitoring script, confirm that the output format is correct and you can run the telegraf
test output.
First find the telegraf
configuration file name under /tmp
:
$ ls -l /tmp | grep telegraf
-rw-r--r-- 1 root root 12676 Apr 26 11:24 telegraf_5432.conf
Run the telegraf
test mode:
$ sudo /usr/local/matrixdb/bin/telegraf --config /tmp/telegraf_5432.conf --test
Get the following output result, you can see that the output of our custom script is already included in the telegraf
test output:
After the script is placed, each time telegraf
collects system monitoring information, the custom script will be automatically executed and loaded into the library. After waiting for 1 to 2 minutes, query the database local.system
table and you can see the custom monitoring information after entering the database.
category
shows more custom projects than before net_dev
, net_interrupts_by_cpu
, net_interrupts_by_queue
:matrixmgr=# SELECT distinct category FROM local.system ORDER BY 1;
category
----------------------------------------------------------------------------------------------------------------------------------
cpu
disk
diskio
kernel
mem
net
net_dev
net_interrupts_by_cpu
net_interrupts_by_queue
netstat
postgresql
processes
sar_cpu
sar_cpu_util
sar_disk
sar_hugepages
sar_inode
sar_io
sar_mem
sar_network
sar_paging
sar_queue
sar_swap
sar_swap_util
sar_task
swap
system
(27 rows)
Query the specific content collected by the custom project:
matrixmgr=# SELECT * FROM local.system
WHERE category IN ('net_interrupts_by_cpu', 'net_interrupts_by_queue', 'net_dev')
ORDER BY ts DESC LIMIT 10;
It can be seen that the output of the above custom script has been recorded as structured data in the data table, which is convenient for query and analysis: 
If you are a Prometheus monitor, you can visit http://<IP>:<port>/metrics
to see if the monitoring information contains custom items.
After the data is stored or entered into Prometheus, you can create a custom panel on the Grafana Dashboard and design and create visual charts according to your needs.
This is not the focus of this article, so only simple steps are given.
Here, take non-Prometheus users as an example to write SQL statements to query monitoring information: 
After adding the chart, you can see that our customized monitoring information has been visualized as a line chart: 
Further click on the gear icon to add an alarm item:
