Prometheus Monitoring Metrics Interpretation

This document describes the metrics and recommended alert thresholds for YMatrix, MatrixGate, and host node monitoring in the Prometheus monitoring dashboard.

Alert Level Description

  • p0: Requires immediate action; the cluster is unavailable.
  • p1: Requires prompt action; cluster functionality may be affected if not resolved shortly.
  • p2: Requires attention; prolonged inaction may affect cluster operations.
  • p3: Does not impact cluster operations; configure alerts as needed.

Note!
Metrics without reference alert thresholds should be evaluated and configured based on actual conditions.

1 YMatrix Monitoring Metrics

1.1 Overview

This section displays the overall cluster status, including:

Metric Description Unit Level Recommended Alert Threshold
Cluster Status Node status of the cluster, including:
0: Normal
1: No Standby
2: No Mirror
10: Imbalanced Distribution (after node recovery, primary-mirror roles are not rebalanced)
11: Asynchronous Nodes (some Mirror nodes are out of sync with Primary)
12: Master Only (only Master node is running, typically used for diagnostics)
20: Segment Down (unavailable Segment nodes, cluster is unusable)
short p0 20: Segment Down is critical and requires alerting
Uptime Runtime since YMatrix started and OS uptime of the Master host seconds (s)
Version YMatrix version
Connection Status Database connection statistics: Total connections, Blocked queries, Idle connections, Idle in transaction short
Slow Queries Number of queries running longer than 1 day short Alert if greater than 0
Transactions Statistics on transaction commits and rollbacks short
Disk Space in Use Disk usage of Master or Segment instances 0-1
Node Status Status of each node:
0: UP (normal)
10: Switched (role switch occurred; rebalancing needed)
11: Resync (synchronizing)
20: Down (down)
short p2/p1 Alert at p2 if non-zero for more than 5 minutes
Alert at p1 for value 20

1.2 Database Performance

This section shows database performance metrics:

Metric Description Unit Level Recommended Alert Threshold
Page Hit Ratio Ratio of HEAP table read operations hitting block cache to total reads. (Cache includes only HEAP table's own cache, not OS cache.)
Displayed value is current; curve shows historical data.
Typically should be above 90%
0-1
Temp Size Total data volume written to temporary files by queries. All temp files are counted regardless of log_temp_files setting bytes
Sessions Per Database Number of sessions per database short
Activities Number of sessions in various states short
Deadlocks Number of deadlocks short p3 YMatrix automatically resolves deadlocks; failed queries can be retried. Alerting can be configured.
Checksum Failures Number of data page checksum failures. NULL if not enabled short p3
Rows Read Number of rows read short
Checkpoints Checkpoint statistics.Orange indicates checkpoints triggered manually.Green indicates checkpoints triggered by timeout short
Page Cache Hit blks_hit: number of cache hits during data page reads
blks_read: number of disk reads due to cache misses
Replication Latency write_lag - Time between local WAL flush and Standby/Mirror acknowledging receipt (but not yet flushed or applied). Measures commit delay when synchronous_commit is set to remote_write
flush_lag - Time between local WAL flush and Standby/Mirror acknowledging flush (but not yet applied). Measures commit delay when synchronous_commit is set to on
replay_lag - Time between local WAL flush and Standby/Mirror acknowledging application. Measures commit delay when synchronous_commit is set to remote_apply
milliseconds (ms) p3 In default synchronous replication between Primary and Mirror, values >1s slow down transactions. For async replication, increase threshold accordingly.
Rows Insert/Update/Delete Number of INSERT, UPDATE, or DELETE operations short
Checkpoint buffers buffers_checkpoint: buffers written during checkpoint
buffers_clean: buffers written by background writer
buffers_backend: buffers written directly by backend processes
short
Top 10 Replication Lag Size Top 10 WAL sizes for replication lag bytes p3 In default synchronous replication, >1GB causes slow commits. For async replication, increase threshold accordingly.

1.3 Storage

This section shows storage-related statistics:

Metric Description Unit Level Recommended Alert Threshold
Top 10 Database Top 10 largest databases bytes
Top 10 Users Top 10 users by data size bytes
Top 10 Aging Database Top 10 databases by age short p2 Maximum database age is 21E. Instance stops when only 1E remains. Logs warn at 5E. Recommend alerting at 6E and 2E.
Top 10 Big Tables Top 10 largest tables bytes
Top 10 Big Partitions Top 10 largest partitioned tables bytes
Top 10 Growth Today Top 10 tables with highest data growth today bytes
Top 10 Growth Last 7 Days Top 10 tables with highest data growth in last 7 days bytes

2 MatrixGate Monitoring Metrics

2.1 Basic Information

Metric Description Unit Level Recommended Alert Threshold
Version mxgate version
Uptime mxgate runtime seconds (s)
Process ID mxgate backend process PID short p2 No PID may indicate mxgate is down

2.2 Task Information

Metric Description Unit Level Recommended Alert Threshold
Target Table Target table for data insertion
Total Rows Inserted Total successfully inserted rows since mxgate started short
Total Failed Rows Total failed insertions since mxgate started short p3 Set threshold based on requirements
Total Inserted Size Total data size successfully inserted since mxgate started short
Concurrency Concurrency total: configured as stream - prepared + 1 (maximum concurrency)
Active count: actual working concurrency (some threads may sleep, so actual concurrency may be lower)
short
Transaction Time Granularity Time span of data transaction commits short
Target Table Blocked Number of blocked target tables short

2.3 Load Statistics

Metric Description Unit Level Recommended Alert Threshold
Committed Rows Number of rows committed for this job short
Inserted Rows Number of rows inserted for this job short
Blocked Rows Number of blocked rows for this job short p3 Set threshold based on requirements
Failed Rows Number of write failures for this job short p3 Set threshold based on requirements
Written Data Size Total bytes written by this job bytes

2.4 Latency Statistics

Latency across data ingestion stages, shown as statistical values over time:

  • max: maximum
  • min: minimum
  • 95%: average of 95% of data
Metric Description Unit Level Recommended Alert Threshold
Total Latency Sum of the following latencies nanoseconds (ns) p3 30s
insertStart Latency Time from executing INSERT to sending first data to Segment nanoseconds (ns)
write Latency Time mxgate takes to send this batch to Segment nanoseconds (ns)
insertDone Latency Time from sending last data to Segment to completion of INSERT (data redistribution and disk persistence across Segments) nanoseconds (ns)
commit Latency Time to execute COMMIT command nanoseconds (ns)

2.5 Database Events

Metric Description Unit Level Recommended Alert Threshold
CHECKPOINT Count Number of CHECKPOINT executions per minute short
CHECKPOINT Write Latency Total time spent writing files to disk during checkpoint, in milliseconds milliseconds (ms)
CHECKPOINT Sync Latency Total time spent syncing files to disk during checkpoint, in milliseconds milliseconds (ms)
Allocated Buffer Count Number of allocated buffers short
Written Disk Buffer Count Three categories:
1. Buffers written during checkpoint
2. Buffers written by background writer
3. Buffers written directly by a backend
short
Dirty Page Flush Limit Reached Number of times background writer stopped due to exceeding buffer write limit short
Primary-Standby WAL Lag WAL delay between Master and Standby or Primary and Mirror bytes
Primary-Standby Latency Time delay between Master and Standby or Primary and Mirror milliseconds (ms)
Target Table Blocking Trend Four categories:
1. Lock-related
2. Replication-related
3. Resource group-related
4. Resource queue-related
short

3 Host Node Monitoring

3.1 Quick CPU / Mem / Disk

Metric Description Unit Level Recommended Alert Threshold
CPU Busy Percentage of time all CPU cores are busy 0-1
Sys Load (5m avg) Average CPU load across all cores over 5 minutes 0-1 p3/p2 CPU cores 3 / CPU cores 5
Sys Load (15m avg) Average CPU load across all cores over 15 minutes 0-1 p3/p2 CPU cores 3 / CPU cores 5
RAM Used Used memory (total - free - buffer/cache) 0-1
SWAP Used Used swap memory 0-1 p3 80%
Root FS Used Root filesystem usage 0-1 p3/p2 60%/80%
CPU Cores Number of physical CPU cores short
RootFS Total Total root filesystem space bytes p3/p2 60%/80%
Uptime System uptime seconds (s)
RAM Total Total memory size bytes
SWAP Total Swap partition size bytes

3.2 Basic CPU / Mem / Disk

Metric Description Unit Level Recommended Alert Threshold
CPU Basic Basic CPU information from /proc/stat 0-1
Memory Basic Basic memory information bytes
Network Traffic Basic Basic network info per interface bit p3/p2 60% / 80% of NIC max bandwidth
Disk Space Used Basic Disk usage percentage of all mounted filesystems 0-1 p3 60% / 80% disk usage

3.3 CPU / Memory / Net / Disk

Metric Description Unit Level Recommended Alert Threshold
CPU CPU time spent in kernel mode short
Memory Stack Memory stack from /proc/meminfo bytes
Network Traffic Transfer rate per network interface bytes/sec
Disk Space Used Disk space used on all mounted filesystems bytes
Disk IOps Disk read/write operations I/O ops/sec (iops)
I/O Usage Read / Write Disk read/write throughput bytes
I/O Utilization I/O utilization 0-1 p3/p2 60% / 80%
CPU spent seconds in guests (VMs) Time spent running a guest with nice value milliseconds (ms)

3.4 Memory Meminfo

Metric Description Unit Level Recommended Alert Threshold
Memory Active / Inactive Frequently/recently used vs. less used memory
Memory Active / Inactive Detail Inactive_file - File-backed pages not accessed recently (LRU_INACTIVE_FILE)
Inactive_anon - Anonymous pages not accessed recently (LRU_INACTIVE_ANON)
Active_file - Recently accessed file-backed pages (LRU_ACTIVE_FILE)
Active_anon - Recently accessed anonymous pages (LRU_ACTIVE_ANON)
bytes
Memory Shared and Mapped Mapped - Memory used by mapped cache pages (Mapped)
Shmem - Shared memory (Shmem)
bytes
Memory Vmalloc VmallocChunk - Largest contiguous vmalloc memory block
VmallocTotal - Total vmalloc memory available
VmallocUsed - Total vmalloc memory used
bytes
Memory Anonymous Active_anon - Recently used anonymous virtual memory pages (nr_active_anon)
Active_file - Recently used file-backed virtual memory pages (nr_active_file)
bytes
Memory HugePages Counter HugePages_Free - Free HugePages count
HugePages_Rsvd - Reserved HugePages (requested but not yet allocated)
HugePages_Surp - Surplus HugePages beyond configured resident count
bytes
Memory DirectMap DirectMap1G - Memory mapped with 1G pages
DirectMap2M - Memory mapped with 2M pages
DirectMap4K - Memory mapped with 4K pages
bytes
Memory NFS NFS Unstable - Pages sent to NFS server but not yet written to disk bytes
Memory Committed Currently allocated memory (including allocated but unused)
Total allocatable memory
bytes p3/p2 60% / 80%
Memory Writeback and Dirty Writeback - Pages actively being written back to disk
WritebackTmp - Memory used by FUSE for temporary write buffers
Dirty - Data pending write to disk
bytes
Memory Slab Reclaimable - Reclaimable slab memory (nr_slab_reclaimable)
Unreclaimable - Non-reclaimable slab memory (nr_slab_unreclaimable)
bytes
Memory Bounce Bounce - Memory used by bounce buffers bytes
Memory Kernel / CPU KernelStack - Kernel stack size (resident, non-reclaimable)
PerCPU - Memory allocated per CPU for module loading
bytes
Memory HugePages Size HugePages - Total number of HugePages
Hugepagesize - Size per HugePage
bytes
Memory Unevictable MLocked Unevictable - Non-reclaimable memory
MLocked - Memory locked by mlock()
bytes

3.5 Memory Vmstat

Metric Description Unit Level Recommended Alert Threshold
Memory Pages In / Out Pagesin - Rate of data read from disk to physical memory (5-minute average)
Pagesout - Rate of data written from physical memory to disk (5-minute average)
short
Memory Page Faults Pgfault - Average minor and major page faults (5-minute average)
Pgmajfault - Average major page faults
Pgminfault - Average minor page faults
short
Memory Pages Swap In / Out Pswpin - Rate of data swapped in from disk (5-minute average)
Pswpout - Rate of data swapped out to disk (5-minute average)
short
OOM Killer Number of OOM Killer invocations short p3 Alert on any change

3.6 System Timesync

Metric Description Unit Level Recommended Alert Threshold
Time Synchronized Drift Estimated error (seconds)
Time offset between local system and reference clock
Maximum error (seconds)
short
Time Synchronized Status Whether clock is synchronized with a reliable server
Estimated error (seconds)
short
Time PLL Adjust Phase-locked loop time adjustment short
Time Misc Seconds between clock ticks
TAI (International Atomic Time) offset
short

3.7 System Processes

Metric Description Unit Level Recommended Alert Threshold
Processes Status Processes blocked - Number of currently blocked tasks (procs_blocked)
Processes in runnable state - Number of tasks in run queue (procs_running)
short p3 blocked: 10
Processes Forks Processes forks per second - Number of processes created per second short
PIDS Number and Limit Current number of running processes
Maximum process limit on host
short p3/p2 15000 / 20000
Processes Memory Virtual memory used by processes
Maximum virtual memory available to processes
bytes
Process schedule stats Running / Waiting Time to start a process
CPU wait time
ms
Threads Number and Limit Total number of threads
Maximum thread limit on host
short

3.8 System Misc

Metric Description Unit Level Recommended Alert Threshold
Context Switches / Interrupts Context switches - Average number of CPU context switches (5-minute average)
Interrupts - Average total interrupts handled (5-minute average)
short
Interrupts Detail List of soft interrupts and their average counts (5-minute average) short
Entropy Available for random number generation short
File Descriptors Maximum open file descriptors
Current open file descriptors
short
Schedule timeslices executed by each cpu Time slices scheduled per CPU short
CPU time spent in user and system contexts CPU time in user and system contexts short

3.9 Hardware Misc

Metric Description Unit Level Recommended Alert Threshold
Hardware temperature monitor Hardware temperature monitoring Celsius (℃)
Power supply Whether powered short
Throttle cooling device Cooling device status short

3.10 Systemd

Metric Description Unit Level Recommended Alert Threshold
Systemd Sockets Total accepted connections on sockets short
Systemd Units State inactive - Inactive Systemd units
failed - Failed Systemd units
deactivating - Deactivating units
active - Active units
activating - Activating units
short

--- SPLIT ---

3.11 Storage Disk

Metric Name Description Unit Severity Level Recommended Alert Threshold
Disk IOps Completed Number of read completions per second on each disk partition
Number of write completions per second on each disk partition
I/O ops/sec (iops)
Disk Average Wait Time Average wait time for reads on each disk
Average wait time for writes on each disk
Milliseconds (ms) p3 1s
Disk R/W Merged Number of merged read operations completed per second on each disk partition
Number of merged write operations completed per second on each disk partition
I/O ops/sec (iops)
Instantaneous Queue Size Instantaneous queue size; number of requests pending at sample time. Increases when requests are queued to the request_queue, decreases as requests complete short
Disk R/W Data Number of bytes read per second from each disk partition
Number of bytes written per second to each disk partition
bytes/sec
Average Queue Size Average queue length of requests issued to the device short
Time Spent Doing I/Os Percentage of elapsed time during which I/O requests were issued to the device (device bandwidth utilization). For devices that process requests serially, nearing 100% indicates saturation. For parallel devices such as RAID arrays and modern SSDs, this value does not necessarily reflect performance limits. 0-1
Disk IOps Discards completed / merged Disk Discards completed IOPS
Disk Discards merged IOPS
I/O ops/sec (iops)

3.12 Storage Filesystem

Metric Name Description Unit Severity Level Recommended Alert Threshold
Filesystem space available Available space on mounted filesystems
Free space on mounted filesystems
Used space on mounted filesystems
bytes p3/p2 60%/80%
File Descriptor Maximum open file descriptors - Maximum number of open file descriptors
Open file descriptors - Number of currently open file descriptors
short
Filesystem in ReadOnly / Error Filesystems mounted in read-only mode
Device error count - Number of device errors
short p3
File Nodes Free Free file nodes: Number of inodes remaining on mounted filesystems short p3 60%
FIle Nodes Size File nodes total: Total number of inodes on mounted filesystems short

3.13 Network Traffic

Metric Name Description Unit Severity Level Recommended Alert Threshold
Network traffic by Packets Receive - Total packets received per second across all interfaces
Transmit - Total packets transmitted per second across all interfaces
packets/sec
Network Traffic Drop Receive drop - Total dropped received packets per second per interface
Transmit drop - Total dropped transmitted packets per second per interface
packets/sec p3 100
Network Traffic Multicast Receive multicast - Multicast packets received per second on each interface packets/sec
Network Traffic Frame Receive frame - Frames received per second on each interface packets/sec
Network Traffic Colls Transmit colls - Number of collisions detected on each interface short
ARP Entries ARP entries - Count of entries in the ARP table per interface short
Speed Speed - Maximum bandwidth of the network interface bytes
Softnet Packets Processed - Number of packets processed per CPU
Dropped - Number of packets dropped per CPU
Network Operational Status Physical link state - Physical connectivity status of each NIC short
Network Traffic Errors Receive errors - Total erroneous packets received per second on each interface
Transmit errors - Total erroneous packets transmitted per second on each interface
packets/sec p3 100
Network Traffic Compressed Receive compressed - Compressed packets received per second on each interface
Transmit compressed - Compressed packets transmitted per second on each interface
packets/sec
Network traffic Fifo Receive fifo - FIFO packets received per second on each interface
Transmit fifo - FIFO packets transmitted per second on each interface
packets/sec
Network Traffic Carrier Statistic transmit_carrier - Number of carrier losses detected by each interface short
NF Contrack NF conntrack entries - Number of tracked connections
NF conntrack limit - Maximum allowed tracked connections
short
MTU Maximum size of packets that can be received on each interface bytes
Queue Length Length of the transmission queue for each interface short
Softnet Out of Quota Backlog status per CPU 0-1

3.14 Network Sockstat

Metric Name Description Unit Severity Level Recommended Alert Threshold
Sockstat TCP TCP_alloc - Number of allocated TCP sockets (established, sk_buff assigned)
TCP_inuse - Number of TCP sockets currently in use (listening)
TCP_mem - TCP socket buffer usage
TCP_orphan - Number of orphaned (not associated with any process) TCP connections (useless, pending destruction)
TCP_tw - Number of TCP connections waiting to close
short
Sockstats FRG / RAW FRAG_inuse - Number of Frag sockets in use
FRAG_memory - Frag buffer usage
RAW_inuse - Number of Raw sockets in use
short
Sockstat Used Sockets_used - Total number of sockets used across all protocols short
Sockstat UDP UDPLITE_inuse - Number of UDP-Lite sockets in use short
Sockstat Memory Size TCP_mem_bytes - TCP socket buffer size in bytes
UDP_mem_bytes - UDP socket buffer size in bytes
bytes

3.15 Network Netstat

Metric Name Description Unit Severity Level Recommended Alert Threshold
Netstat IP In / Out Octets InOctets - Number of octets received
OutOctets - Number of octets transmitted
short
ICMP In / Out InMsgs - Number of ICMP messages received (includes icmpInErrors)
OutMsgs - Number of ICMP messages attempted to send (includes icmpOutErrors)
short
UDP In / Out InDatagrams - Average UDP datagrams received (over 5 minutes)
OutDatagrams - Average UDP datagrams sent (over 5 minutes)
short
TCP In / Out InSegs - Segments received, including erroneous ones. Includes segments received on currently established connections
OutSegs - Segments sent, including those on current connections, excluding segments containing only retransmitted octets
short
TCP Connections CurrEstab - Number of TCP connections in ESTABLISHED or CLOSE-WAIT state short
TCP Direct Transition ActiveOpens - Number of TCP connections transitioning directly from CLOSED to SYN-SENT
PassiveOpens - Number of TCP connections transitioning directly from LISTEN to SYN-RCVD
short
Netstat IP Forwarding Forwarding - Number of IP packets forwarded short
ICMP Errors InErrors - ICMP messages received with ICMP-specific errors (e.g., invalid checksum, incorrect length) short
UDP Errors InCsumErrors - Average number of UDP packets with checksum errors (over 5 minutes)
InErrors - Average number of incoming UDP packets undeliverable for reasons other than missing listener (over 5 minutes)
RcvbufErrors - Average number of UDP packets dropped due to receive buffer overflow (over 5 minutes)
SndbufErrors - Average number of UDP packets dropped due to send buffer overflow (over 5 minutes)
NoPorts - Average number of UDP packets received on unknown ports (over 5 minutes)
short p3 100
TCP Errors ListenOverflows - Number of times the listen queue of a socket overflowed
ListenDrops - Number of SYNs ignored on LISTEN sockets
TCPSynRetrans - Retransmissions of SYN or SYN/ACK to break connection establishment, including fast and timeout retransmissions
RetransSegs - Number of retransmitted segments (segments containing one or more previously transmitted octets)
InErrs - Segments received with errors (e.g., incorrect TCP checksum)
OutRsts - Segments sent with the RST flag
short p3 100
TCP SyncCookie SyncookiesFailed - Number of invalid SYN cookies received
SyncookiesRecv - Number of SYN cookies received
SyncookiesSent - Number of SYN cookies sent
short

3.16 Node Exporter

Metric Name Description Unit Severity Level Recommended Alert Threshold
Node Exporter Scrape Time Duration of each collector scrape seconds
Node Exporter Scrape Number of collectors functioning normally short

4 YMatrix Host ext

Metric Name Description Unit Severity Level Recommended Alert Threshold
Host 5-minute Load 5-minute load average across selected hosts short
Host Memory Usage Ratio Memory usage percentage across selected hosts 0-1
CPU Busy Percentage CPU utilization percentage 0-1
Disk I/O Utilization Disk I/O usage rate 0-1
Free Space Utilization Free disk space utilization on selected hosts 0-1
Network Traffic Sent Network traffic transmitted by selected hosts bit
Network Traffic Received Network traffic received by selected hosts bit
SWAP Usage SWAP usage on selected hosts 0-1

Metric Name Description Unit Severity Level Recommended Alert Threshold
net dev Network device status short
softnet_stat Memory usage percentage across selected hosts short
hardirq_cpu Number of CPU hardware interrupts short
hardirq_cpu_pie Pie chart of CPU hardware interrupts short
hardirq_quene Number of hardware interrupts per device short
hardirq_quene_pie Pie chart of hardware interrupts per device short
softirq_rx Number of software interrupts for data reception short
softirq_rx_pie Pie chart of software interrupts for data reception short
softirq_tx Number of software interrupts for data transmission short
softirq_tx_pie Pie chart of software interrupts for data transmission short
ip Packet receive/transmit statistics at IP layer short
udp Packet receive/transmit statistics for UDP protocol short

5 YMatrix Database ext

Metric Name Description Unit Severity Level Recommended Alert Threshold
license expiration time Remaining time until LICENSE expires seconds (s) p3/p2 Alert p3 if less than 15 days remaining
Alert p2 if less than 7 days remaining; contact YMatrix promptly to renew LICENSE
Missing partition strategy for range tables Range partitioned tables missing APM partition strategy configuration short p2 Must be addressed promptly; otherwise data will be written to default partition, affecting performance
Range partition table creation count Number of delayed new partitions in range partitioned tables short p2 Must be addressed promptly; otherwise data will be written to default partition, affecting performance
mars table max runs Internal metric for MARS2 short p3/p2 Alert p3 if exceeds 1500; monitor trend
Alert p2 if exceeds 1800
Write performance degrades significantly or becomes impossible when value reaches 2039
Max block_items value Instantaneous batch write count by mxgate short
YMatrix Total Process Count Total number of postgres-related processes on selected hosts short p2 Prevent excessive process count which may lead to memory exhaustion; configure as needed
Duplicate Index Count Number of duplicate indexes; consider removing unnecessary ones short p3
matrixgate Connection Count Total number of connections to mxgate processes short
24-Hour Data Volume Change Total data change over the last 24 hours bytes
Top10 Subpartition Count Top 10 tables with highest number of subpartitions. Configure as needed to avoid excessive subtables, which may impact query performance and consume more memory bytes
Top10 Schema Size Top 10 schemas by total size bytes
Top10 System Table Size Top 10 system tables by total size bytes
Top10 Default Partition Table Size Top 10 default partition tables by size bytes p3 Alert if default partition is too large; normally, default partitions should not contain data

Metric Name Description Unit Severity Level Recommended Alert Threshold
mars2 table max runs details Trend chart of runs for MARS2 tables short
Database Connection Details Grouped by database, client address, and application_name short
24-Hour Database Space Change Database size change over 24 hours for each database short
Total Query Duration Query Total query execution time across database stages milliseconds (ms) p3 Configure as needed; investigate if total time changes significantly
Host YMatrix Process Trend Trend of total postgres process count per host short

Metric Name Description Unit Severity Level Recommended Alert Threshold
Table Bloat Details Lists tables where dead tuples / live tuples > 1.1 short
Top 100 Process RSS Details Top 100 postgres processes sorted by RSS (memory usage) short
Slow Query Monitoring Statistics on slow SQL queries executed in the database none p3
Total Duration Query Monitoring Statistics on total SQL execution time milliseconds (ms)
Duration Statistics Chart (seconds) Total SQL execution time aggregated every 5 minutes milliseconds (ms)
Long Transaction Metrics Details of long-running transactions on Master/Segment none p3
Lock Wait Information Details of database lock waits at data collection time none p3 Configure as needed; consider alerting for locks lasting over 10 minutes