Prometheus Monitoring Metrics Interpretation

This document describes the metrics and recommended alert thresholds for YMatrix, MatrixGate, and host node monitoring in the Prometheus monitoring dashboard.

Alert Level Description

p0: Requires immediate action; the cluster is unavailable.
p1: Requires prompt action; cluster functionality may be affected if not resolved shortly.
p2: Requires attention; prolonged inaction may affect cluster operations.
p3: Does not impact cluster operations; configure alerts as needed.

Note!
Metrics without reference alert thresholds should be evaluated and configured based on actual conditions.

1 YMatrix Monitoring Metrics

1.1 Overview

This section displays the overall cluster status, including:

Metric	Description	Unit	Level	Recommended Alert Threshold
Cluster Status	Node status of the cluster, including: 0: Normal 1: No Standby 2: No Mirror 10: Imbalanced Distribution (after node recovery, primary-mirror roles are not rebalanced) 11: Asynchronous Nodes (some Mirror nodes are out of sync with Primary) 12: Master Only (only Master node is running, typically used for diagnostics) 20: Segment Down (unavailable Segment nodes, cluster is unusable)	short	p0	20: Segment Down is critical and requires alerting
Uptime	Runtime since YMatrix started and OS uptime of the Master host	seconds (s)
Version	YMatrix version
Connection Status	Database connection statistics: Total connections, Blocked queries, Idle connections, Idle in transaction	short
Slow Queries	Number of queries running longer than 1 day	short		Alert if greater than 0
Transactions	Statistics on transaction commits and rollbacks	short
Disk Space in Use	Disk usage of Master or Segment instances	0-1
Node Status	Status of each node: 0: UP (normal) 10: Switched (role switch occurred; rebalancing needed) 11: Resync (synchronizing) 20: Down (down)	short	p2/p1	Alert at p2 if non-zero for more than 5 minutes Alert at p1 for value 20

1.2 Database Performance

This section shows database performance metrics:

Metric	Description	Unit	Level	Recommended Alert Threshold
Page Hit Ratio	Ratio of HEAP table read operations hitting block cache to total reads. (Cache includes only HEAP table's own cache, not OS cache.) Displayed value is current; curve shows historical data. Typically should be above 90%	0-1
Temp Size	Total data volume written to temporary files by queries. All temp files are counted regardless of log_temp_files setting	bytes
Sessions Per Database	Number of sessions per database	short
Activities	Number of sessions in various states	short
Deadlocks	Number of deadlocks	short	p3	YMatrix automatically resolves deadlocks; failed queries can be retried. Alerting can be configured.
Checksum Failures	Number of data page checksum failures. NULL if not enabled	short	p3
Rows Read	Number of rows read	short
Checkpoints	Checkpoint statistics.Orange indicates checkpoints triggered manually.Green indicates checkpoints triggered by timeout	short
Page Cache Hit	blks_hit: number of cache hits during data page reads blks_read: number of disk reads due to cache misses
Replication Latency	write_lag - Time between local WAL flush and Standby/Mirror acknowledging receipt (but not yet flushed or applied). Measures commit delay when synchronous_commit is set to remote_write flush_lag - Time between local WAL flush and Standby/Mirror acknowledging flush (but not yet applied). Measures commit delay when synchronous_commit is set to on replay_lag - Time between local WAL flush and Standby/Mirror acknowledging application. Measures commit delay when synchronous_commit is set to remote_apply	milliseconds (ms)	p3	In default synchronous replication between Primary and Mirror, values >1s slow down transactions. For async replication, increase threshold accordingly.
Rows Insert/Update/Delete	Number of INSERT, UPDATE, or DELETE operations	short
Checkpoint buffers	buffers_checkpoint: buffers written during checkpoint buffers_clean: buffers written by background writer buffers_backend: buffers written directly by backend processes	short
Top 10 Replication Lag Size	Top 10 WAL sizes for replication lag	bytes	p3	In default synchronous replication, >1GB causes slow commits. For async replication, increase threshold accordingly.

1.3 Storage

This section shows storage-related statistics:

Metric	Description	Unit	Level	Recommended Alert Threshold
Top 10 Database	Top 10 largest databases	bytes
Top 10 Users	Top 10 users by data size	bytes
Top 10 Aging Database	Top 10 databases by age	short	p2	Maximum database age is 21E. Instance stops when only 1E remains. Logs warn at 5E. Recommend alerting at 6E and 2E.
Top 10 Big Tables	Top 10 largest tables	bytes
Top 10 Big Partitions	Top 10 largest partitioned tables	bytes
Top 10 Growth Today	Top 10 tables with highest data growth today	bytes
Top 10 Growth Last 7 Days	Top 10 tables with highest data growth in last 7 days	bytes

2 MatrixGate Monitoring Metrics

2.1 Basic Information

Metric	Description	Unit	Level	Recommended Alert Threshold
Version	mxgate version
Uptime	mxgate runtime	seconds (s)
Process ID	mxgate backend process PID	short	p2	No PID may indicate mxgate is down

2.2 Task Information

Metric	Description	Unit	Level	Recommended Alert Threshold
Target Table	Target table for data insertion
Total Rows Inserted	Total successfully inserted rows since mxgate started	short
Total Failed Rows	Total failed insertions since mxgate started	short	p3	Set threshold based on requirements
Total Inserted Size	Total data size successfully inserted since mxgate started	short
Concurrency	Concurrency total: configured as stream - prepared + 1 (maximum concurrency) Active count: actual working concurrency (some threads may sleep, so actual concurrency may be lower)	short
Transaction Time Granularity	Time span of data transaction commits	short
Target Table Blocked	Number of blocked target tables	short

2.3 Load Statistics

Metric	Description	Unit	Level	Recommended Alert Threshold
Committed Rows	Number of rows committed for this job	short
Inserted Rows	Number of rows inserted for this job	short
Blocked Rows	Number of blocked rows for this job	short	p3	Set threshold based on requirements
Failed Rows	Number of write failures for this job	short	p3	Set threshold based on requirements
Written Data Size	Total bytes written by this job	bytes

2.4 Latency Statistics

Latency across data ingestion stages, shown as statistical values over time:

max: maximum
min: minimum
95%: average of 95% of data

Metric	Description	Unit	Level	Recommended Alert Threshold
Total Latency	Sum of the following latencies	nanoseconds (ns)	p3	30s
insertStart Latency	Time from executing INSERT to sending first data to Segment	nanoseconds (ns)
write Latency	Time mxgate takes to send this batch to Segment	nanoseconds (ns)
insertDone Latency	Time from sending last data to Segment to completion of INSERT (data redistribution and disk persistence across Segments)	nanoseconds (ns)
commit Latency	Time to execute COMMIT command	nanoseconds (ns)

2.5 Database Events

Metric	Description	Unit
CHECKPOINT Count	Number of CHECKPOINT executions per minute	short
CHECKPOINT Write Latency	Total time spent writing files to disk during checkpoint, in milliseconds	milliseconds (ms)
CHECKPOINT Sync Latency	Total time spent syncing files to disk during checkpoint, in milliseconds	milliseconds (ms)
Allocated Buffer Count	Number of allocated buffers	short
Written Disk Buffer Count	Three categories: 1. Buffers written during checkpoint 2. Buffers written by background writer 3. Buffers written directly by a backend	short
Dirty Page Flush Limit Reached	Number of times background writer stopped due to exceeding buffer write limit	short
Primary-Standby WAL Lag	WAL delay between Master and Standby or Primary and Mirror	bytes
Primary-Standby Latency	Time delay between Master and Standby or Primary and Mirror	milliseconds (ms)
Target Table Blocking Trend	Four categories: 1. Lock-related 2. Replication-related 3. Resource group-related 4. Resource queue-related	short

3 Host Node Monitoring

3.1 Quick CPU / Mem / Disk

Metric	Description	Unit	Level	Recommended Alert Threshold
CPU Busy	Percentage of time all CPU cores are busy	0-1
Sys Load (5m avg)	Average CPU load across all cores over 5 minutes	0-1	p3/p2	CPU cores 3 / CPU cores 5
Sys Load (15m avg)	Average CPU load across all cores over 15 minutes	0-1	p3/p2	CPU cores 3 / CPU cores 5
RAM Used	Used memory (total - free - buffer/cache)	0-1
SWAP Used	Used swap memory	0-1	p3	80%
Root FS Used	Root filesystem usage	0-1	p3/p2	60%/80%
CPU Cores	Number of physical CPU cores	short
RootFS Total	Total root filesystem space	bytes	p3/p2	60%/80%
Uptime	System uptime	seconds (s)
RAM Total	Total memory size	bytes
SWAP Total	Swap partition size	bytes

3.2 Basic CPU / Mem / Disk

Metric	Description	Unit	Level	Recommended Alert Threshold
CPU Basic	Basic CPU information from /proc/stat	0-1
Memory Basic	Basic memory information	bytes
Network Traffic Basic	Basic network info per interface	bit	p3/p2	60% / 80% of NIC max bandwidth
Disk Space Used Basic	Disk usage percentage of all mounted filesystems	0-1	p3	60% / 80% disk usage

3.3 CPU / Memory / Net / Disk

Metric	Description	Unit	Level	Recommended Alert Threshold
CPU	CPU time spent in kernel mode	short
Memory Stack	Memory stack from /proc/meminfo	bytes
Network Traffic	Transfer rate per network interface	bytes/sec
Disk Space Used	Disk space used on all mounted filesystems	bytes
Disk IOps	Disk read/write operations	I/O ops/sec (iops)
I/O Usage Read / Write	Disk read/write throughput	bytes
I/O Utilization	I/O utilization	0-1	p3/p2	60% / 80%
CPU spent seconds in guests (VMs)	Time spent running a guest with nice value	milliseconds (ms)

3.4 Memory Meminfo

Metric	Description	Unit	Level	Recommended Alert Threshold
Memory Active / Inactive	Frequently/recently used vs. less used memory
Memory Active / Inactive Detail	Inactive_file - File-backed pages not accessed recently (LRU_INACTIVE_FILE) Inactive_anon - Anonymous pages not accessed recently (LRU_INACTIVE_ANON) Active_file - Recently accessed file-backed pages (LRU_ACTIVE_FILE) Active_anon - Recently accessed anonymous pages (LRU_ACTIVE_ANON)	bytes
Memory Shared and Mapped	Mapped - Memory used by mapped cache pages (Mapped) Shmem - Shared memory (Shmem)	bytes
Memory Vmalloc	VmallocChunk - Largest contiguous vmalloc memory block VmallocTotal - Total vmalloc memory available VmallocUsed - Total vmalloc memory used	bytes
Memory Anonymous	Active_anon - Recently used anonymous virtual memory pages (nr_active_anon) Active_file - Recently used file-backed virtual memory pages (nr_active_file)	bytes
Memory HugePages Counter	HugePages_Free - Free HugePages count HugePages_Rsvd - Reserved HugePages (requested but not yet allocated) HugePages_Surp - Surplus HugePages beyond configured resident count	bytes
Memory DirectMap	DirectMap1G - Memory mapped with 1G pages DirectMap2M - Memory mapped with 2M pages DirectMap4K - Memory mapped with 4K pages	bytes
Memory NFS	NFS Unstable - Pages sent to NFS server but not yet written to disk	bytes
Memory Committed	Currently allocated memory (including allocated but unused) Total allocatable memory	bytes	p3/p2	60% / 80%
Memory Writeback and Dirty	Writeback - Pages actively being written back to disk WritebackTmp - Memory used by FUSE for temporary write buffers Dirty - Data pending write to disk	bytes
Memory Slab	Reclaimable - Reclaimable slab memory (nr_slab_reclaimable) Unreclaimable - Non-reclaimable slab memory (nr_slab_unreclaimable)	bytes
Memory Bounce	Bounce - Memory used by bounce buffers	bytes
Memory Kernel / CPU	KernelStack - Kernel stack size (resident, non-reclaimable) PerCPU - Memory allocated per CPU for module loading	bytes
Memory HugePages Size	HugePages - Total number of HugePages Hugepagesize - Size per HugePage	bytes
Memory Unevictable MLocked	Unevictable - Non-reclaimable memory MLocked - Memory locked by mlock()	bytes

3.5 Memory Vmstat

Metric	Description	Unit	Level	Recommended Alert Threshold
Memory Pages In / Out	Pagesin - Rate of data read from disk to physical memory (5-minute average) Pagesout - Rate of data written from physical memory to disk (5-minute average)	short
Memory Page Faults	Pgfault - Average minor and major page faults (5-minute average) Pgmajfault - Average major page faults Pgminfault - Average minor page faults	short
Memory Pages Swap In / Out	Pswpin - Rate of data swapped in from disk (5-minute average) Pswpout - Rate of data swapped out to disk (5-minute average)	short
OOM Killer	Number of OOM Killer invocations	short	p3	Alert on any change

3.6 System Timesync

Metric	Description	Unit
Time Synchronized Drift	Estimated error (seconds) Time offset between local system and reference clock Maximum error (seconds)	short
Time Synchronized Status	Whether clock is synchronized with a reliable server Estimated error (seconds)	short
Time PLL Adjust	Phase-locked loop time adjustment	short
Time Misc	Seconds between clock ticks TAI (International Atomic Time) offset	short

3.7 System Processes

Metric	Description	Unit	Level	Recommended Alert Threshold
Processes Status	Processes blocked - Number of currently blocked tasks (procs_blocked) Processes in runnable state - Number of tasks in run queue (procs_running)	short	p3	blocked: 10
Processes Forks	Processes forks per second - Number of processes created per second	short
PIDS Number and Limit	Current number of running processes Maximum process limit on host	short	p3/p2	15000 / 20000
Processes Memory	Virtual memory used by processes Maximum virtual memory available to processes	bytes
Process schedule stats Running / Waiting	Time to start a process CPU wait time	ms
Threads Number and Limit	Total number of threads Maximum thread limit on host	short

3.8 System Misc

Metric	Description	Unit
Context Switches / Interrupts	Context switches - Average number of CPU context switches (5-minute average) Interrupts - Average total interrupts handled (5-minute average)	short
Interrupts Detail	List of soft interrupts and their average counts (5-minute average)	short
Entropy	Available for random number generation	short
File Descriptors	Maximum open file descriptors Current open file descriptors	short
Schedule timeslices executed by each cpu	Time slices scheduled per CPU	short
CPU time spent in user and system contexts	CPU time in user and system contexts	short

3.9 Hardware Misc

Metric	Description	Unit
Hardware temperature monitor	Hardware temperature monitoring	Celsius (℃)
Power supply	Whether powered	short
Throttle cooling device	Cooling device status	short

3.10 Systemd

Metric	Description	Unit	Level	Recommended Alert Threshold
Systemd Sockets	Total accepted connections on sockets	short
Systemd Units State	inactive - Inactive Systemd units failed - Failed Systemd units deactivating - Deactivating units active - Active units activating - Activating units	short

--- SPLIT ---

3.11 Storage Disk

Metric Name	Description	Unit	Severity Level	Recommended Alert Threshold
Disk IOps Completed	Number of read completions per second on each disk partition Number of write completions per second on each disk partition	I/O ops/sec (iops)
Disk Average Wait Time	Average wait time for reads on each disk Average wait time for writes on each disk	Milliseconds (ms)	p3	1s
Disk R/W Merged	Number of merged read operations completed per second on each disk partition Number of merged write operations completed per second on each disk partition	I/O ops/sec (iops)
Instantaneous Queue Size	Instantaneous queue size; number of requests pending at sample time. Increases when requests are queued to the request_queue, decreases as requests complete	short
Disk R/W Data	Number of bytes read per second from each disk partition Number of bytes written per second to each disk partition	bytes/sec
Average Queue Size	Average queue length of requests issued to the device	short
Time Spent Doing I/Os	Percentage of elapsed time during which I/O requests were issued to the device (device bandwidth utilization). For devices that process requests serially, nearing 100% indicates saturation. For parallel devices such as RAID arrays and modern SSDs, this value does not necessarily reflect performance limits.	0-1
Disk IOps Discards completed / merged	Disk Discards completed IOPS Disk Discards merged IOPS	I/O ops/sec (iops)

3.12 Storage Filesystem

Metric Name	Description	Unit	Severity Level	Recommended Alert Threshold
Filesystem space available	Available space on mounted filesystems Free space on mounted filesystems Used space on mounted filesystems	bytes	p3/p2	60%/80%
File Descriptor	Maximum open file descriptors - Maximum number of open file descriptors Open file descriptors - Number of currently open file descriptors	short
Filesystem in ReadOnly / Error	Filesystems mounted in read-only mode Device error count - Number of device errors	short	p3
File Nodes Free	Free file nodes: Number of inodes remaining on mounted filesystems	short	p3	60%
FIle Nodes Size	File nodes total: Total number of inodes on mounted filesystems	short

3.13 Network Traffic

Metric Name	Description	Unit	Severity Level	Recommended Alert Threshold
Network traffic by Packets	Receive - Total packets received per second across all interfaces Transmit - Total packets transmitted per second across all interfaces	packets/sec
Network Traffic Drop	Receive drop - Total dropped received packets per second per interface Transmit drop - Total dropped transmitted packets per second per interface	packets/sec	p3	100
Network Traffic Multicast	Receive multicast - Multicast packets received per second on each interface	packets/sec
Network Traffic Frame	Receive frame - Frames received per second on each interface	packets/sec
Network Traffic Colls	Transmit colls - Number of collisions detected on each interface	short
ARP Entries	ARP entries - Count of entries in the ARP table per interface	short
Speed	Speed - Maximum bandwidth of the network interface	bytes
Softnet Packets	Processed - Number of packets processed per CPU Dropped - Number of packets dropped per CPU
Network Operational Status	Physical link state - Physical connectivity status of each NIC	short
Network Traffic Errors	Receive errors - Total erroneous packets received per second on each interface Transmit errors - Total erroneous packets transmitted per second on each interface	packets/sec	p3	100
Network Traffic Compressed	Receive compressed - Compressed packets received per second on each interface Transmit compressed - Compressed packets transmitted per second on each interface	packets/sec
Network traffic Fifo	Receive fifo - FIFO packets received per second on each interface Transmit fifo - FIFO packets transmitted per second on each interface	packets/sec
Network Traffic Carrier	Statistic transmit_carrier - Number of carrier losses detected by each interface	short
NF Contrack	NF conntrack entries - Number of tracked connections NF conntrack limit - Maximum allowed tracked connections	short
MTU	Maximum size of packets that can be received on each interface	bytes
Queue Length	Length of the transmission queue for each interface	short
Softnet Out of Quota	Backlog status per CPU	0-1

3.14 Network Sockstat

Metric Name	Description	Unit
Sockstat TCP	TCP_alloc - Number of allocated TCP sockets (established, sk_buff assigned) TCP_inuse - Number of TCP sockets currently in use (listening) TCP_mem - TCP socket buffer usage TCP_orphan - Number of orphaned (not associated with any process) TCP connections (useless, pending destruction) TCP_tw - Number of TCP connections waiting to close	short
Sockstats FRG / RAW	FRAG_inuse - Number of Frag sockets in use FRAG_memory - Frag buffer usage RAW_inuse - Number of Raw sockets in use	short
Sockstat Used	Sockets_used - Total number of sockets used across all protocols	short
Sockstat UDP	UDPLITE_inuse - Number of UDP-Lite sockets in use	short
Sockstat Memory Size	TCP_mem_bytes - TCP socket buffer size in bytes UDP_mem_bytes - UDP socket buffer size in bytes	bytes

3.15 Network Netstat

Metric Name	Description	Unit	Severity Level	Recommended Alert Threshold
Netstat IP In / Out Octets	InOctets - Number of octets received OutOctets - Number of octets transmitted	short
ICMP In / Out	InMsgs - Number of ICMP messages received (includes icmpInErrors) OutMsgs - Number of ICMP messages attempted to send (includes icmpOutErrors)	short
UDP In / Out	InDatagrams - Average UDP datagrams received (over 5 minutes) OutDatagrams - Average UDP datagrams sent (over 5 minutes)	short
TCP In / Out	InSegs - Segments received, including erroneous ones. Includes segments received on currently established connections OutSegs - Segments sent, including those on current connections, excluding segments containing only retransmitted octets	short
TCP Connections	CurrEstab - Number of TCP connections in ESTABLISHED or CLOSE-WAIT state	short
TCP Direct Transition	ActiveOpens - Number of TCP connections transitioning directly from CLOSED to SYN-SENT PassiveOpens - Number of TCP connections transitioning directly from LISTEN to SYN-RCVD	short
Netstat IP Forwarding	Forwarding - Number of IP packets forwarded	short
ICMP Errors	InErrors - ICMP messages received with ICMP-specific errors (e.g., invalid checksum, incorrect length)	short
UDP Errors	InCsumErrors - Average number of UDP packets with checksum errors (over 5 minutes) InErrors - Average number of incoming UDP packets undeliverable for reasons other than missing listener (over 5 minutes) RcvbufErrors - Average number of UDP packets dropped due to receive buffer overflow (over 5 minutes) SndbufErrors - Average number of UDP packets dropped due to send buffer overflow (over 5 minutes) NoPorts - Average number of UDP packets received on unknown ports (over 5 minutes)	short	p3	100
TCP Errors	ListenOverflows - Number of times the listen queue of a socket overflowed ListenDrops - Number of SYNs ignored on LISTEN sockets TCPSynRetrans - Retransmissions of SYN or SYN/ACK to break connection establishment, including fast and timeout retransmissions RetransSegs - Number of retransmitted segments (segments containing one or more previously transmitted octets) InErrs - Segments received with errors (e.g., incorrect TCP checksum) OutRsts - Segments sent with the RST flag	short	p3	100
TCP SyncCookie	SyncookiesFailed - Number of invalid SYN cookies received SyncookiesRecv - Number of SYN cookies received SyncookiesSent - Number of SYN cookies sent	short

3.16 Node Exporter

Metric Name	Description	Unit	Severity Level	Recommended Alert Threshold
Node Exporter Scrape Time	Duration of each collector scrape	seconds
Node Exporter Scrape	Number of collectors functioning normally	short

4 YMatrix Host ext

Metric Name	Description	Unit
Host 5-minute Load	5-minute load average across selected hosts	short
Host Memory Usage Ratio	Memory usage percentage across selected hosts	0-1
CPU Busy Percentage	CPU utilization percentage	0-1
Disk I/O Utilization	Disk I/O usage rate	0-1
Free Space Utilization	Free disk space utilization on selected hosts	0-1
Network Traffic Sent	Network traffic transmitted by selected hosts	bit
Network Traffic Received	Network traffic received by selected hosts	bit
SWAP Usage	SWAP usage on selected hosts	0-1

Metric Name	Description	Unit
net dev	Network device status	short
softnet_stat	Memory usage percentage across selected hosts	short
hardirq_cpu	Number of CPU hardware interrupts	short
hardirq_cpu_pie	Pie chart of CPU hardware interrupts	short
hardirq_quene	Number of hardware interrupts per device	short
hardirq_quene_pie	Pie chart of hardware interrupts per device	short
softirq_rx	Number of software interrupts for data reception	short
softirq_rx_pie	Pie chart of software interrupts for data reception	short
softirq_tx	Number of software interrupts for data transmission	short
softirq_tx_pie	Pie chart of software interrupts for data transmission	short
ip	Packet receive/transmit statistics at IP layer	short
udp	Packet receive/transmit statistics for UDP protocol	short

5 YMatrix Database ext

Metric Name	Description	Unit	Severity Level	Recommended Alert Threshold
license expiration time	Remaining time until LICENSE expires	seconds (s)	p3/p2	Alert p3 if less than 15 days remaining Alert p2 if less than 7 days remaining; contact YMatrix promptly to renew LICENSE
Missing partition strategy for range tables	Range partitioned tables missing APM partition strategy configuration	short	p2	Must be addressed promptly; otherwise data will be written to default partition, affecting performance
Range partition table creation count	Number of delayed new partitions in range partitioned tables	short	p2	Must be addressed promptly; otherwise data will be written to default partition, affecting performance
mars table max runs	Internal metric for MARS2	short	p3/p2	Alert p3 if exceeds 1500; monitor trend Alert p2 if exceeds 1800 Write performance degrades significantly or becomes impossible when value reaches 2039
Max block_items value	Instantaneous batch write count by mxgate	short
YMatrix Total Process Count	Total number of postgres-related processes on selected hosts	short	p2	Prevent excessive process count which may lead to memory exhaustion; configure as needed
Duplicate Index Count	Number of duplicate indexes; consider removing unnecessary ones	short	p3
matrixgate Connection Count	Total number of connections to mxgate processes	short
24-Hour Data Volume Change	Total data change over the last 24 hours	bytes
Top10 Subpartition Count	Top 10 tables with highest number of subpartitions. Configure as needed to avoid excessive subtables, which may impact query performance and consume more memory	bytes
Top10 Schema Size	Top 10 schemas by total size	bytes
Top10 System Table Size	Top 10 system tables by total size	bytes
Top10 Default Partition Table Size	Top 10 default partition tables by size	bytes	p3	Alert if default partition is too large; normally, default partitions should not contain data

Metric Name	Description	Unit	Severity Level	Recommended Alert Threshold
mars2 table max runs details	Trend chart of runs for MARS2 tables	short
Database Connection Details	Grouped by database, client address, and application_name	short
24-Hour Database Space Change	Database size change over 24 hours for each database	short
Total Query Duration Query	Total query execution time across database stages	milliseconds (ms)	p3	Configure as needed; investigate if total time changes significantly
Host YMatrix Process Trend	Trend of total postgres process count per host	short

Metric Name	Description	Unit	Severity Level	Recommended Alert Threshold
Table Bloat Details	Lists tables where dead tuples / live tuples > 1.1	short
Top 100 Process RSS Details	Top 100 postgres processes sorted by RSS (memory usage)	short
Slow Query Monitoring	Statistics on slow SQL queries executed in the database	none	p3
Total Duration Query Monitoring	Statistics on total SQL execution time	milliseconds (ms)
Duration Statistics Chart (seconds)	Total SQL execution time aggregated every 5 minutes	milliseconds (ms)
Long Transaction Metrics	Details of long-running transactions on Master/Segment	none	p3
Lock Wait Information	Details of database lock waits at data collection time	none	p3	Configure as needed; consider alerting for locks lasting over 10 minutes

← Previous

Grafana

MatrixUI

English Русский 简体中文