Grafana Monitoring Metrics Interpretation

This document describes the YMatrix-related metrics in the Grafana monitoring dashboard and provides reference alert thresholds.

Alert Level Description

p0: Requires immediate action; the cluster is unavailable.
p1: Requires prompt action; if not addressed shortly, it may affect cluster operations.
p2: Needs attention; if left unresolved for a long time, it may impact cluster performance.
p3: Does not affect cluster operations; configure alerts as needed.

Note!
For metrics without reference alert thresholds, determine and configure alert conditions based on actual circumstances.

1 YMatrix Database

1.1 Overview

This section displays the overall operational status of the cluster, including:

Metric	Description	Unit	Level	Reference Alert Threshold
Cluster Status	Node status of the cluster, including: 0: Normal 1: No Standby 2: No Mirror 10: Data Imbalance (after a node recovery, primary-mirror roles have not been rebalanced) 11: Unsynchronized Nodes (some Mirror nodes are out of sync with their Primary) 12: Master Only (only the Master node is running, typically used during diagnostics) 20: Segment Down (unavailable Segment nodes exist; cluster is unusable)	short	p0	20: Segment Down is a critical event; alert required
uptime	Uptime. Includes both YMatrix runtime since startup and the operating system uptime of the Master host	seconds (s)
Version	YMatrix version
Connection Status	Displays connection statistics in the database system: total connections (Total), blocked queries (Blocked), idle connections (Idle), idle in transaction (Idle in TXN)	short
Long queries	Long-running queries. Number of queries currently executing for more than 1 day	short	p3	Alert if greater than 0, indicating extremely slow queries
Node Status	Status of each node, including: 0: UP (normal) 10: Switched (role switch occurred; rebalancing required) 11: Resync (synchronizing between primary and mirror) 20: Down (node down)	short		Alert on values 11 and 20
license_expire_date	Remaining time until LICENSE expiration	seconds (s)	p3/p2	Expiration may cause component failures; handle promptly; alert at 30 days to 15 days remaining
Disk Space in Use	Disk usage on Master or Segment instances	0-1		Alerting is recommended to be configured directly in node_exporter
Available	Free disk space on Master or Segment instances	0-1		Alerting is recommended to be configured directly in node_exporter
CPU	Host CPU utilization	0-1
Memory	Memory usage information	0-1
Load	Host load	short
Transactions	Statistics on transaction commits and rollbacks	short		Rollback threshold can be set
DiskIO	Volume of data written to disk	bytes
Network	Volume of network traffic	bytes
Process	Number of processes in various states	short

1.2 Disk Performance

Metric	Description	Unit	Reference Alert Threshold
Top 10 Disk %Util	Top 10 disks by utilization	0-1	Recommended to configure in node_exporter
Disk Throughput	Disk throughput	bytes	Recommended to configure in node_exporter
Disk IOPS	Disk read/write operations (blue for read, orange for write, absolute values)	I/O ops/sec	Recommended to configure in node_exporter

1.3 Network Performance

Metric	Description	Unit	Level	Reference Alert Threshold
NetStat	Network status	short		Recommended to configure in node_exporter
Network Throughput	Network throughput (blue for receive, orange for send, absolute values)	bytes		Recommended to configure in node_exporter
Network IO	Network I/O operations (green for receive, yellow for send, absolute values)	io/s		Recommended to configure in node_exporter
Packet Loss/Sec	Number of packets dropped due to insufficient kernel buffer space	short	p3	Recommended to configure in node_exporter
Packet Error/Sec	Number of failed send/receive packets	packet/s		Recommended to configure in node_exporter

1.4 System Performance

Metric	Description	Unit	Level	Reference Alert Threshold
IO TPS	Total number of physical disk transfers per second. One transfer is an I/O request to a physical device. Multiple logical requests may be merged into one device I/O request. Data volume per transfer is not fixed	iops
Context Switches/Sec	Number of kernel context switches per second (maximum and average across hosts)	short
Memory	Used - Percentage of memory used Buff/Cache - Percentage of memory used for buffers and cache	0-1
Page Statistics	PageIn/s - Total size of pages read from disk per second PageOut/s - Total size of pages written to disk per second Note: On kernel versions 2.2.x and earlier, this value represents page count, not total size	KB
IO Throughput	Read - Number of blocks read from disk per second Write - Number of blocks written to disk per second On kernel 2.4 and later, block equals sector (512 bytes). Block size is variable on earlier kernels	iops
Process Forked/Sec	Number of processes forked per second	short
Commit Memory	Memory usage under current load. This value may exceed 100% due to kernel memory overcommitment	0-1	p3/p2	Alert at 60%–80%; if OOM protection is not set, processes may be killed by OOM killer
Page Faults	fault/s - Number of page faults per second. Page faults do not necessarily trigger I/O, as some can be resolved without disk access majflt/s - Major page faults requiring loading pages from disk	short
File Handles	Number of file handles used by the system	short
Interrupts/Sec	Number of interrupts per second	short
Memory Statistics	frmpg/s - Number of memory pages freed per second (negative values indicate pages allocated) bufpg/s - Additional pages used for buffers per second (negative values indicate fewer pages used) campg/s - Additional pages added to cache per second (negative values indicate less caching) Note: Page size may be 4KB or 8KB depending on architecture	page
Swap Activity	Number of pages swapped in/out per second	page
Load	Load1 - 1-minute average system load. Represents average number of tasks in runnable, running, or uninterruptible sleep states Load5 - 5-minute average load Load15 - 15-minute average load	short	p3/p2	CPU cores × 3 / CPU cores × 5
Run Queue	Length of the run queue (number of tasks waiting to run). Purple shows maximum across all hosts, Green shows average	short
Hugepage Used	Hugepage memory usage	0-1
%vmeff	Ratio of pages reclaimed to pages scanned. A higher value means most scanned pages are reclaimed. 100% means every scanned page is reclaimed. A low value (<30%) indicates difficulty freeing memory. 0 means no pages were scanned. Ideal values are 0 or 100%	0-1
iNodes	Number of inode handles used by the system	short	p3
Pseudo Terminals	Number of pseudo-terminals used by the system	short
Unused Cache Entries	Number of unused entries in directory cache (Pink shows minimum across hosts, Yellow shows average)	short
Entropy Available	The system collects "true" randomness from various events (e.g., network activity, hardware RNGs) and feeds it into the kernel entropy pool used by /dev/random. High-security applications often use /dev/random as their entropy source. If /dev/random runs out of entropy, it blocks until more randomness is available, potentially halting dependent applications	short

2 YMatrix Database

The YMatrix Database interface includes two sections: Database Performance and Storage.

2.1 Database Performance

This section displays database performance metrics, including:

Metric	Description	Unit	Level	Reference Alert Threshold
Page Hit Ratio	Ratio of HEAP table read operations that hit the block cache to total read operations. (Cache includes only the HEAP table's internal cache, not OS cache.) Displayed value is current; curve shows historical values Typically should be above 90%	0-1
Temp Size	Total volume of data written to temporary files by queries. All temporary files are counted regardless of the reason for creation or log_temp_files setting	bytes
Sessions Per Database	Number of sessions per database	short	p2/p1	Alert at 60% and 80% of max connections
Activities	Number of sessions in various states	short
Deadlocks	Number of deadlocks detected	short		Alert if greater than 0
Checksum Failures	Number of database page checksum failures. NULL if not enabled	short	p3
Rows Read	Number of data rows read	short
Checkpoints	Checkpoint statistics.Orange indicates checkpoints triggered by explicit requests, Green indicates automatic checkpoints due to timeout	short
Page Cache Hit	blks_hit: Number of cache hits during data page reads blks_read: Number of disk reads due to cache misses
Replication Latency	write_lag - Time between local WAL flush and Standby/Mirror acknowledging receipt (but not yet flushed or applied). When Standby/Mirror is configured, measures commit delay when synchronous_commit is set to remote_write flush_lag - Time between local WAL flush and Standby/Mirror acknowledging flush (but not yet applied). Measures commit delay when synchronous_commit is set to on replay_lag - Time between local WAL flush and Standby/Mirror acknowledging replay (fully applied). Measures commit delay when synchronous_commit is set to remote_apply	milliseconds (ms)	p3	Recommended threshold: 10s High replication lag may slow down write transactions
Rows Insert/Update/Delete	Row operation statistics Rows Insert: Number of inserted rows Rows Update: Number of updated rows Rows Delete: Number of deleted rows	short
Checkpoint buffers	buffers_checkpoint - Number of buffers written during checkpoint buffers_clean - Number of buffers written by background writer buffers_backend - Number of buffers written directly by backend processes	short
Top 10 Replication Lag Size	Top 10 WAL sizes by replication lag	bytes	p3	1GB

2.2 Storage

This section displays storage-related statistics, including:

Metric	Description	Unit	Level	Reference Alert Threshold
Top 10 Database	Top 10 largest databases by size	bytes
Top 10 Users	Top 10 users by data volume	bytes
Top 10 Aging Database	Top 10 databases by age. Databases with age exceeding 2 billion may become unusable	short	p2	1500000000
Top 10 Big Tables	Top 10 largest tables by size	bytes
Top 10 Big Partitions	Top 10 largest partitions by size	bytes
Top 10 Growth Today	Top 10 tables with highest data growth today	bytes
Top 10 Growth Last 7 Days	Top 10 tables with highest data growth over the last 7 days	bytes

← Previous

Monitoring

Prometheus

English Русский 简体中文