Grafana Monitoring Indicators Interpretation

This document introduces YMatrix-related metrics and reference alarm thresholds in the Grafana monitoring panel.

Alarm level description

p0: Need to be processed immediately, the cluster is no longer available.
p1: It needs to be processed as soon as possible, and it is not processed in a short period of time, which may affect the use of the cluster.
p2: It is necessary to pay attention to that not processing for a long time may affect cluster use.
p3: It will not affect the use of the cluster, just configure it on demand.

Notes!
There is no indicator for reference alarm threshold, please judge and configure alarm conditions based on actual conditions.

1 YMatrix Database

1.1 Overview

This section shows the overall operating status of the cluster, including:

Indicator Name	Description	Unit	Level	Reference Alarm Threshold
Cluster status	Cluster node status, including: 0: Normal 1: None Standby 2: None Mirror 10: Distribution unbalanced (Some nodes are not rebalancing after the downtime and recovery) 11: There are master-slave asynchronous nodes (Some Mirror nodes are not synchronized with Primary) 12: Only Master (The cluster only starts the Master node, which is usually used during diagnosis) 20: Segment downtime (There is an unavailable Segment node, and the cluster is not available)	short	p0	20: The Segment downtime is a serious event and requires an alarm
uptime	uptime. Includes YMatrix run time since startup and Master host operating system run time	seconds(s)
version	YMatrix version
Connection status	Connection status displays the number of connections in the database system, including: Total number of connections (Total), number of connection queries blocked (Blocked), number of idle connections (Idle), number of idle in transactions (Idle in TXN)	short
Long queries	Slow query. In the current system, the number of queries with execution time exceeding 1 day	short	p3	greater than 0 means that there are particularly slow queries and an alarm is required
Node status	State of each node, including: 0: UP (Normal) 10: Switched (Role swap, indicating that master-slave switching has occurred and needs to be rebalanced) 11: Resync (Master-slave synchronization) 20: Down (Downtime)	short		11, 20 Two values need to add alarms
license_expire_date	LICENSE Expiration remaining time	seconds(s)	p3/p2	Expiration will cause some components to be unavailable and need to be processed as soon as possible, 30 days-15 days
Disk Space in Use	Disk usage. Disk usage of Master node or Segment node instance	0-1	Alarms are recommended to set directly in node_exporter
Available	Disk free space. Free disk space for Master node or Segment node instance	0-1	Alarms are recommended to set directly in node_exporter
CPU	Host CPU Usage	0-1
Memory	Memory usage information	0-1
Load	Host Load	short
Transactions	Short	Current rollback alarm threshold can be set
DiskIO	Disk write data amount	bytes
Network	Network data transmission volume	bytes
Process	Number of various state processes	short

1.2 Disk Performance

Indicator Name	Description	Unit	Reference Alarm Threshold
Top 10 Disk %Util	Disk occupancy Top 10	0-1	Suggest to configure it in node_exporter
Disk Throughput	bytes	Suggested to configure in node_exporter
Disk IOPS	Disk read and write times (<font color=Blueread, <font color=Orangewrite, the value is absolute)	I/O ops/sec	It is recommended to configure it in node_exporter

1.3 Network Performance

Indicator Name	Description	Unit	Level	Reference Alarm Threshold
NetStat	NetStat	Short	Suggest to configure it in node_exporter
Network Throughput	Network Throughput (blue reception, orange transmission, absolute value)	bytes		It is recommended to configure it in node_exporter
Network IO	Network I/O times (greenreceive, yellows, the value is absolute)	io/s		It is recommended to configure it in node_exporter
Packet Loss/Sec	Number of packet loss due to insufficient kernel buffer space	short	p3	It is recommended to configure it in node_exporter
Packet Error/Sec	Number of packets that failed to send and receive	packet/s	It is recommended to configure in node_exporter

1.4 System Performance

Indicator Name	Description	Unit	Level	Reference Alarm Threshold
IO TPS	Total number of physical disks transferred per second. A transmission is an I/O request to a physical device. Multiple logical requests are merged into one I/O request to the device. The amount of data transmitted is uncertain	iops
Context Switches/Sec	Kernel context switches per second (maximum host and average of all hosts)	short
Memory	Used - Percentage of Memory Buff/Cache - Percentage of Memory	0-1
Page Statistics	PageIn/s - The total page size of the system reads from disk per second PageOut/s - The total page size of the system writes to disk per second Note: In kernel versions 2.2.x and earlier, this value is the number of pages, not the total page size	KB
IO Throuhgput	Read - Number of blocks read from disk per second Write - Number of blocks written to disk per second In 2.4 and newer kernel versions, blocks and sectors are equivalent, 512 bytes. Early kernel version block size is uncertain	iops
Process Forked/Sec	Number of Fork Processes per Second	short
Commit Memory	Memory usage under current load. This value may be greater than 100%, because the kernel usually overuses memory	0-1	p3/p2	60% 80%, insufficient memory, no OOM protection is set, and may be killed by OOM killer
Page Faults	fault/s - Page errors occurring every second of the system. Page errors do not necessarily cause I/O operations, because some page errors can be resolved without initiating I/O operations majflt/s - Page errors caused by requiring memory pages to be loaded from disk	short
File Handles	Number of file handles used by the system	short
Interrupts/Ses	Interrupts	short
Memory Statistics	frmpg/s - The number of memory pages released by the system per second. Negative numbers indicate the number of pages applied by the system bufpg/s - The number of additional memory pages used by the system for the buffer per second. Negative values indicate the number of pages used in the buffer campg/s - The number of additional memory pages cached by the system per second. Negative numbers indicate that fewer pages are cached Note that the machine architecture is different, the page size may be 4KB or 8KB	page
Swap Activity	Number of pages entering and exiting the swap partition per second	page
Load	Load1 - System average load in the last 1 minute. The average load is the average task in the runnable state, running state and uninterruptible sleep state Load5 - System average load in the last 5 minutes Load15 - System average load in the last 15 minutes	short	p3/p2	CPU cores 3 / CPU cores 5
Run Quene	Run Quene	Length of the run queue (number of tasks waiting to run, <font color=Purple is the maximum value of all hosts, Green is the average value of all hosts)	short
Hugepage Used	Large page memory usage	0-1
%vmeff	The ratio of page recycling to page scanning. A higher value means that most pages are recycled and released after scanning. If this value is 100%, it means that each page is recycled after scanning. If the value is lower (less than 30%), it means it is difficult to free memory. If no pages are scanned, the value is 0. So the value is preferably 0 or 100%	0-1
iNodes	Number of inode handles used by the system	short	p3
Pseudo Terminals	Number of pseudo terminals used by the system	short
Unused Cache Entries	Number of unused cache entries in the cache directory (pink is the minimum host value, yellow is the average of all hosts)	short
Entropy Available	The system collects some "real" random numbers by focusing on different events, such as: network activity, hardware random number generator, etc. And provide them to the kernel entropy pool used by /dev/random. Applications that require extremely secure tend to use /dev/random as their entropy source, or random source If /dev/random runs out of available entropy, it cannot provide more randomness, and the application waiting for randomness to stop until more random material is available	short

2 YMatrix Database

The YMatrix Database interface includes two sections: Database Performance and Storage.

2.1 Database Performance

This section shows database performance, including:

Indicator Name	Description	Unit	Level	Reference Alarm Threshold
Page Hit Ratio	HEAP table read operation hit block cache ratio to the total number of read operations. (The cache only includes caches maintained by the HEAP table itself, and does not include operating system cache) The displayed value is the current value, and the curve is the historical value The value is usually required to be above 90%	0-1
Temp Size	The total amount of data written to temporary files in the database is queryed. Regardless of the reason for creating the temporary file and the log_temp_files setting, all temporary files will be counted	bytes
Sessions Per Database	Number of sessions per database	short	p2/p1	Maximum connection %60 %80
Activities	Number of sessions in various states	short
Deadlocks	Number of deadlocks found	short	At greater than 0, alarm can be called
Checksum Failures	NULL	short	p3
Rows Read	Read data row count	short
Checkpoints	Checkpoint statistics. <font color=Orange is the number of operations that actively requests to generate checkpoints, Green is the number of operations that automatically generate checkpoints due to timeout	short
Page Cache Hit	blks_hit: Number of hit caches when reading data pages blks_read: Number of times cache missed and disks to be read
Replication Latency	write_lag - The elapsed time between the local flushing the latest WAL and the receipt of the Standby/Mirror write to the WAL successfully (but has not been refreshed or applied). If Standby/Mirror is configured, it can be used to measure the time elapsed between the local flushing of the latest WAL and the receipt of Standby/Mirror writing to WAL and the flashing of the disk successfully (but it has not been applied yet). If Standby/Mirror is configured, it can be used to measure the delay in submission when synchronous_commit is configured to on replay_lag - The elapsed time between the local flushing of the latest WAL and the receipt of Standby/Mirror writing to the WAL, flushing and successfully applying. If Standby/Mirror is configured, it can be used to measure the delay caused by commit when synchronous_commit is configured as remote_apply	milliseconds(ms)	p3	Suggested value: 10s Master-slave synchronous replication, too high latency may affect slow write transactions
Rows Insert/Update/Delete	Rows Insert: Insert number of rows Rows Update: Updating number of rows Rows Delete: Delete number of rows	short
Checkpoint buffers	buffers_checkpoint - Number of caches written during checkpoint generation buffers_clean - Number of caches written by background write process buffers_backend - Number of caches written directly by worker process	short
Top 10 Replication Lag Size	Top 10 Replication Delay WAL Size	bytes	p3	1GB

2.2 Storage

This section displays storage-related statistics, including: | Indicator Name | Description | Unit | Level | Reference Alarm Threshold | | --- | --- | --- | --- | --- | |Top 10 Database|Database Size Top10|bytes| | | |Top 10 Users|User Data Size Top10|bytes| | | |Top 10 Aging Database|Database Age Top 10. When the database age exceeds 20E, the database will be unavailable|short|p2|1500000000| |Top 10 Big Tables|Table Size Top10|bytes| | | |Top 10 Big Partitions|Partition table size Top10|bytes| | | |Top 10 Growth Today|The 10 Tables with the Fastest Data Growth on the Day|bytes| | | |Top 10 Growth Last 7 Days|10 Tables with the Fastest Data Growth in 7 Days|bytes| |

简体中文