Grafana Monitoring Indicators Interpretation

This document introduces YMatrix-related metrics and reference alarm thresholds in the Grafana monitoring panel.

Alarm level description

  • p0: Need to be processed immediately, the cluster is no longer available.
  • p1: It needs to be processed as soon as possible, and it is not processed in a short period of time, which may affect the use of the cluster.
  • p2: It is necessary to pay attention to that not processing for a long time may affect cluster use.
  • p3: It will not affect the use of the cluster, just configure it on demand.

Notes!
There is no indicator for reference alarm threshold, please judge and configure alarm conditions based on actual conditions.

1 YMatrix Database

1.1 Overview

This section shows the overall operating status of the cluster, including:

Indicator Name Description Unit Level Reference Alarm Threshold
Cluster status Cluster node status, including:
0: Normal
1: None Standby
2: None Mirror
10: Distribution unbalanced (Some nodes are not rebalancing after the downtime and recovery)
11: There are master-slave asynchronous nodes (Some Mirror nodes are not synchronized with Primary)
12: Only Master (The cluster only starts the Master node, which is usually used during diagnosis)
20: Segment downtime (There is an unavailable Segment node, and the cluster is not available)
short p0 20: The Segment downtime is a serious event and requires an alarm
uptime uptime. Includes YMatrix run time since startup and Master host operating system run time seconds(s)
version YMatrix version
Connection status Connection status displays the number of connections in the database system, including: Total number of connections (Total), number of connection queries blocked (Blocked), number of idle connections (Idle), number of idle in transactions (Idle in TXN) short
Long queries Slow query. In the current system, the number of queries with execution time exceeding 1 day short p3 greater than 0 means that there are particularly slow queries and an alarm is required
Node status State of each node, including:
0: UP (Normal)
10: Switched (Role swap, indicating that master-slave switching has occurred and needs to be rebalanced)
11: Resync (Master-slave synchronization)
20: Down (Downtime)
short 11, 20 Two values ​​need to add alarms
license_expire_date LICENSE Expiration remaining time seconds(s) p3/p2 Expiration will cause some components to be unavailable and need to be processed as soon as possible, 30 days-15 days
Disk Space in Use Disk usage. Disk usage of Master node or Segment node instance 0-1 Alarms are recommended to set directly in node_exporter
Available Disk free space. Free disk space for Master node or Segment node instance 0-1 Alarms are recommended to set directly in node_exporter
CPU Host CPU Usage 0-1
Memory Memory usage information 0-1
Load Host Load short
Transactions Short Current rollback alarm threshold can be set
DiskIO Disk write data amount bytes
Network Network data transmission volume bytes
Process Number of various state processes short

1.2 Disk Performance

Indicator Name Description Unit Level Reference Alarm Threshold
Top 10 Disk %Util Disk occupancy Top 10 0-1 Suggest to configure it in node_exporter
Disk Throughput bytes Suggested to configure in node_exporter
Disk IOPS Disk read and write times (<font color=Blueread, <font color=Orangewrite, the value is absolute) I/O ops/sec It is recommended to configure it in node_exporter

1.3 Network Performance

Indicator Name Description Unit Level Reference Alarm Threshold
NetStat NetStat Short Suggest to configure it in node_exporter
Network Throughput Network Throughput (blue reception, orange transmission, absolute value) bytes It is recommended to configure it in node_exporter
Network IO Network I/O times (greenreceive, yellows, the value is absolute) io/s It is recommended to configure it in node_exporter
Packet Loss/Sec Number of packet loss due to insufficient kernel buffer space short p3 It is recommended to configure it in node_exporter
Packet Error/Sec Number of packets that failed to send and receive packet/s It is recommended to configure in node_exporter

1.4 System Performance

| Indicator Name | Description | Unit | Level | Reference Alarm Threshold | | --- | --- | --- | --- | --- | |IO TPS | Total number of physical disks transferred per second. A transmission is an I/O request to a physical device. Multiple logical requests are merged into one I/O request to the device. The amount of data transmitted is uncertain|iops| | | |Context Switches/Sec|Kernel context switches per second (maximum host and average of all hosts) |short| | | |Memory|Used - Percentage of Memory
Buff/Cache - Percentage of Memory|0-1| | | |Page Statistics|PageIn/s - The total page size of the system reads from disk per second
PageOut/s - The total page size of the system writes to disk per second
Note: In kernel versions 2.2.x and earlier, this value is the number of pages, not the total page size|KB| | | |IO Throuhgput|Read - Number of blocks read from disk per second
Write - Number of blocks written to disk per second
In 2.4 and newer kernel versions, blocks and sectors are equivalent, 512 bytes. Early kernel version block size is uncertain |iops| | | |Process Forked/Sec|Number of Fork Processes per Second|short| | | |Commit Memory|Memory usage under current load. This value may be greater than 100%, because the kernel usually overuses memory|0-1|p3/p2|60% 80%, insufficient memory, no OOM protection is set, and may be killed by OOM killer| |Page Faults|fault/s - Page errors occurring every second of the system. Page errors do not necessarily cause I/O operations, because some page errors can be resolved without initiating I/O operations
majflt/s - Page errors caused by requiring memory pages to be loaded from disk|short| | | |File Handles|Number of file handles used by the system|short| | | |Interrupts/Ses|Interrupts|short| | | |Memory Statistics|frmpg/s - The number of memory pages released by the system per second. Negative numbers indicate the number of pages applied by the system
bufpg/s - The number of additional memory pages used by the system for the buffer per second. Negative values ​​indicate the number of pages used in the buffer
campg/s - The number of additional memory pages cached by the system per second. Negative numbers indicate that fewer pages are cached
Note that the machine architecture is different, the page size may be 4KB or 8KB|page| | | |Swap Activity|Number of pages entering and exiting the swap partition per second|page| | | |Load|Load1 - System average load in the last 1 minute. The average load is the average task in the runnable state, running state and uninterruptible sleep state
Load5 - System average load in the last 5 minutes
Load15 - System average load in the last 15 minutes|short|p3/p2|CPU cores 3 / CPU cores 5| |Run Quene|Run Quene|Length of the run queue (number of tasks waiting to run, <font color=Purple is the maximum value of all hosts, Green is the average value of all hosts)|short| | | |Hugepage Used|Large page memory usage|0-1| | | |%vmeff|The ratio of page recycling to page scanning. A higher value means that most pages are recycled and released after scanning. If this value is 100%, it means that each page is recycled after scanning. If the value is lower (less than 30%), it means it is difficult to free memory. If no pages are scanned, the value is 0. So the value is preferably 0 or 100%|0-1| | | | |iNodes|Number of inode handles used by the system|short|p3| | |Pseudo Terminals|Number of pseudo terminals used by the system|short| | | |Unused Cache Entries|Number of unused cache entries in the cache directory (pink is the minimum host value, yellow is the average of all hosts)|short| | | |Entropy Available|The system collects some "real" random numbers by focusing on different events, such as: network activity, hardware random number generator, etc. And provide them to the kernel entropy pool used by /dev/random. Applications that require extremely secure tend to use /dev/random as their entropy source, or random source
If /dev/random runs out of available entropy, it cannot provide more randomness, and the application waiting for randomness to stop until more random material is available|short| | |

2 YMatrix Database

The YMatrix Database interface includes two sections: Database Performance and Storage.

2.1 Database Performance

This section shows database performance, including:

Indicator Name Description Unit Level Reference Alarm Threshold
Page Hit Ratio HEAP table read operation hit block cache ratio to the total number of read operations. (The cache only includes caches maintained by the HEAP table itself, and does not include operating system cache)
The displayed value is the current value, and the curve is the historical value
The value is usually required to be above 90%
0-1
Temp Size The total amount of data written to temporary files in the database is queryed. Regardless of the reason for creating the temporary file and the log_temp_files setting, all temporary files will be counted bytes
Sessions Per Database Number of sessions per database short p2/p1 Maximum connection %60 %80
Activities Number of sessions in various states short
Deadlocks Number of deadlocks found short At greater than 0, alarm can be called
Checksum Failures NULL short p3
Rows Read Read data row count short
Checkpoints Checkpoint statistics. <font color=Orange is the number of operations that actively requests to generate checkpoints, Green is the number of operations that automatically generate checkpoints due to timeout short
Page Cache Hit blks_hit: Number of hit caches when reading data pages
blks_read: Number of times cache missed and disks to be read
Replication Latency write_lag - The elapsed time between the local flushing the latest WAL and the receipt of the Standby/Mirror write to the WAL successfully (but has not been refreshed or applied). If Standby/Mirror is configured, it can be used to measure the time elapsed between the local flushing of the latest WAL and the receipt of Standby/Mirror writing to WAL and the flashing of the disk successfully (but it has not been applied yet). If Standby/Mirror is configured, it can be used to measure the delay in submission when synchronous_commit is configured to on
replay_lag - The elapsed time between the local flushing of the latest WAL and the receipt of Standby/Mirror writing to the WAL, flushing and successfully applying. If Standby/Mirror is configured, it can be used to measure the delay caused by commit when synchronous_commit is configured as remote_apply
milliseconds(ms) p3 Suggested value: 10s
Master-slave synchronous replication, too high latency may affect slow write transactions
Rows Insert/Update/Delete Rows Insert: Insert number of rows
Rows Update: Updating number of rows
Rows Delete: Delete number of rows
short
Checkpoint buffers buffers_checkpoint - Number of caches written during checkpoint generation
buffers_clean - Number of caches written by background write process
buffers_backend - Number of caches written directly by worker process
short
Top 10 Replication Lag Size Top 10 Replication Delay WAL Size bytes p3 1GB

2.2 Storage

This section displays storage-related statistics, including:| Indicator Name | Description | Unit | Level | Reference Alarm Threshold | | --- | --- | --- | --- | --- | |Top 10 Database|Database Size Top10|bytes| | | |Top 10 Users|User Data Size Top10|bytes| | | |Top 10 Aging Database|Database Age Top 10. When the database age exceeds 20E, the database will be unavailable|short|p2|1500000000| |Top 10 Big Tables|Table Size Top10|bytes| | | |Top 10 Big Partitions|Partition table size Top10|bytes| | | |Top 10 Growth Today|The 10 Tables with the Fastest Data Growth on the Day|bytes| | | |Top 10 Growth Last 7 Days|10 Tables with the Fastest Data Growth in 7 Days|bytes| |