The overview section displays the overall operating status of the cluster, including:
| Parameter name | Description | Reference Alarm Threshold |
| --- | --- |
|Cluster status | Cluster node status, including:
0: Normal
1: No Standby
2: No Mirror
10: Distribution unbalanced (Some nodes are down and restored, and the master-slave role is not rebalanced)
11: There are master-slave asynchronous nodes (Some mirror nodes are not synchronous with primary)
12: Only Master (The cluster only starts the Master node, usually used during diagnosis)
20: Segment downtime (There is an unavailable Segment node, the cluster is not available) | Segment downtime is a serious event, and an alarm is required |
|Runtime|Includes MatrixDB's run time since startup and master host operating system run time| |
| Version | Version of MatrixDB | |
|Connection status |Connection status displays the number of connections in the database system, including: total number of connections, number of blocked connection queries, number of idle connections, number of idle connections in transactions | |
|Slow query|In the current system, the number of queries that have been executed for more than 1 day| greater than 0 means that there are particularly slow queries and an alarm is required|
|Transactions | Statistics on transaction submission and rollback count | Rollback alarm threshold can be set |
|Disk usage |Disk usage and remaining space for master nodes and segment nodes | Alarms are recommended to set directly in node_exporter |
|Node status|State of each node, including:
0: UP (Normal)
10: Switched (Role swap, indicating that master-slave switching has occurred and needs to be rebalanced)
11: Resync (Master-slave synchronization)
20: Down (Downtime) | 11 and 20 need to add alarms |
The Database Performance section demonstrates database performance, including:
| Parameter name | Description | Reference Alarm Threshold |
| --- | --- |
|Page Hit Ratio|Hit Buffer Ratio when reading a data page| |
|Temp Size|Temp file usage| |
|Deadlocks|Number of deadlocks|Automatically greater than 0|
|Checksum Failures|Number of data page verification failure|Automatically greater than 0|
|Sessions Per Database|Number of connections per database| |
|Page Cache Hit|blks_hit: Number of hit caches when reading data pages
blks_read: Number of times cache missed and disks to be read| |
|Rows Read|Query read and return tuple number| |
|Checkpoints|checkpoints trigger times, including:
checkpoints_req: manual trigger
checkpoints_timed: periodic trigger| |
|Replication Latency|Master-slave replication delay, unit ms
write_lag: delay in log writing to mirror file cache
flush_lag: delay in log flushing to mirror disk
replay_lag: delay in log playback completion
Top Segment: All nodes write_lag+flush_lag+replay_lag delay and maximum value|Alarm threshold can be set according to the situation|
|Rows Insert/Update/Delete|Rows Insert: Insert number of rows
Rows Update: Updating number of rows
Rows Delete: Delete number of rows| |
|Checkpoint buffers|Dirty page writing statistics
buffers_checkpoint: checkpoint number of dirty pages
buffers_clean: bgwriter number of dirty pages
buffers_backend: backend process writing number of dirty pages| |
|Top 10 Replication Lag Size|Statistics of the delay amount of the Top 10 nodes, the calculation method is the difference between the sent lsn and the replay lsn|The alarm threshold can be set according to the situation|
Storage section displays storage-related statistics, including:
| Parameter name | Description | Reference Alarm Threshold | | --- | --- | |Top 10 Databases|Top 10 Databases| | |Top 10 Users|Top 10 User Generated Data Size| | |Top 10 Aging Database|Top10 Database Age (transaction ID less than this value is replaced by Frozen)| | |Top 10 Big Tables|Top 10 Database Table Size| | |Top 10 Big Partitions|Top 10 Partition Table Size| | |Top 10 Growth Today|Top 10 table size increments on the day| | |Top 10 Growth Last 7 Days|Top 10 table size increments in the past 7 days| |