Interpretation of MatrixDB monitoring parameters

1. Overview section

Overview section The overview section displays the overall operating status of the cluster, including:

| Parameter name | Description | Reference Alarm Threshold | | --- | --- | |Cluster status | Cluster node status, including:
0: Normal
1: No Standby
2: No Mirror
10: Distribution unbalanced (Some nodes are down and restored, and the master-slave role is not rebalanced)
11: There are master-slave asynchronous nodes (Some mirror nodes are not synchronous with primary)
12: Only Master (The cluster only starts the Master node, usually used during diagnosis)
20: Segment downtime (There is an unavailable Segment node, the cluster is not available) | Segment downtime is a serious event, and an alarm is required | |Runtime|Includes MatrixDB's run time since startup and master host operating system run time| | | Version | Version of MatrixDB | | |Connection status |Connection status displays the number of connections in the database system, including: total number of connections, number of blocked connection queries, number of idle connections, number of idle connections in transactions | | |Slow query|In the current system, the number of queries that have been executed for more than 1 day| greater than 0 means that there are particularly slow queries and an alarm is required| |Transactions | Statistics on transaction submission and rollback count | Rollback alarm threshold can be set | |Disk usage |Disk usage and remaining space for master nodes and segment nodes | Alarms are recommended to set directly in node_exporter | |Node status|State of each node, including:
0: UP (Normal)
10: Switched (Role swap, indicating that master-slave switching has occurred and needs to be rebalanced)
11: Resync (Master-slave synchronization)
20: Down (Downtime) | 11 and 20 need to add alarms |

2. Database Performance section

Database Performance Section The Database Performance section demonstrates database performance, including:

| Parameter name | Description | Reference Alarm Threshold | | --- | --- | |Page Hit Ratio|Hit Buffer Ratio when reading a data page| | |Temp Size|Temp file usage| | |Deadlocks|Number of deadlocks|Automatically greater than 0| |Checksum Failures|Number of data page verification failure|Automatically greater than 0| |Sessions Per Database|Number of connections per database| | |Page Cache Hit|blks_hit: Number of hit caches when reading data pages
blks_read: Number of times cache missed and disks to be read| | |Rows Read|Query read and return tuple number| | |Checkpoints|checkpoints trigger times, including:
checkpoints_req: manual trigger
checkpoints_timed: periodic trigger| | |Replication Latency|Master-slave replication delay, unit ms
write_lag: delay in log writing to mirror file cache
flush_lag: delay in log flushing to mirror disk
replay_lag: delay in log playback completion
Top Segment: All nodes write_lag+flush_lag+replay_lag delay and maximum value|Alarm threshold can be set according to the situation| |Rows Insert/Update/Delete|Rows Insert: Insert number of rows
Rows Update: Updating number of rows
Rows Delete: Delete number of rows| | |Checkpoint buffers|Dirty page writing statistics
buffers_checkpoint: checkpoint number of dirty pages
buffers_clean: bgwriter number of dirty pages
buffers_backend: backend process writing number of dirty pages| | |Top 10 Replication Lag Size|Statistics of the delay amount of the Top 10 nodes, the calculation method is the difference between the sent lsn and the replay lsn|The alarm threshold can be set according to the situation|

3. Storage section

Storage section Storage section displays storage-related statistics, including:

| Parameter name | Description | Reference Alarm Threshold | | --- | --- | |Top 10 Databases|Top 10 Databases| | |Top 10 Users|Top 10 User Generated Data Size| | |Top 10 Aging Database|Top10 Database Age (transaction ID less than this value is replaced by Frozen)| | |Top 10 Big Tables|Top 10 Database Table Size| | |Top 10 Big Partitions|Top 10 Partition Table Size| | |Top 10 Growth Today|Top 10 table size increments on the day| | |Top 10 Growth Last 7 Days|Top 10 table size increments in the past 7 days| |