Prometheus Monitoring Parameter Interpretation

This document introduces relevant indicators and reference alarm thresholds such as YMatrix, MatrixGate, host node monitoring in the Prometheus monitoring panel.

Alarm level description

  • p0: Need to be processed immediately, the cluster is no longer available.
  • p1: It needs to be processed as soon as possible, and it is not processed in a short period of time, which may affect the use of the cluster.
  • p2: It is necessary to pay attention to that not processing for a long time may affect cluster use.
  • p3: It will not affect the use of the cluster, just configure it on demand.

Notes!
There is no indicator for reference alarm threshold, please judge and configure alarm conditions based on actual conditions.

1 YMatrix Monitoring Metrics

1.1 Overview

This section shows the overall operating status of the cluster, including:

Indicator Name Description Unit Level Reference Alarm Threshold
Cluster status Cluster node status, including:
0: Normal
1: None Standby
2: None Mirror
10: Distribution unbalanced (Some nodes are not rebalancing after the downtime and recovery)
11: There are master-slave asynchronous nodes (Some Mirror nodes are not synchronized with Primary)
12: Only Master (The cluster only starts the Master node, which is usually used during diagnosis)
20: Segment downtime (There is an unavailable Segment node, and the cluster is not available)
short p0 20: The Segment downtime is a serious event and requires an alarm
Runtime Includes YMatrix's run time since startup and Master host operating system run time seconds(s)
version YMatrix version
Connection status Connection status displays the number of connections in the database system, including: Total number of connections (Total), number of connection queries blocked (Blocked), number of idle connections (Idle), number of idle in transactions (Idle in TXN) short
Slow query number In the current system, the number of queries with execution time exceeding 1 day short greater than 0 means that there is a particularly slow query and an alarm is required
Transactions Transaction commit and rollback count statistics short
Disk Space in Use Disk usage. Disk usage for Master node or Segment node instance 0-1
Node status State of each node, including:
0: UP (Normal)
10: Switched (Role swap, indicating that master-slave switching has occurred and needs to be rebalanced)
11: Resync (Master-slave synchronization)
20: Down (Downtime)
short p2/p1 Alarm is required if the duration exceeds 5 minutes and is not 0
20 The value needs to increase the alarm p1

1.2 Database Performance

This section shows database performance, including:

Indicator Name Description Unit Level Reference Alarm Threshold
Page Hit Ratio HEAP table read operation hit block cache ratio to the total number of read operations. (The cache only includes caches maintained by the HEAP table itself, and does not include operating system cache)
The displayed value is the current value, and the curve is the historical value
The value is usually required to be above 90%
0-1
Temp Size The total amount of data written to temporary files in the database is queryed. Regardless of the reason for creating the temporary file and the log_temp_files setting, all temporary files will be counted bytes
Sessions Per Database Number of sessions per database short
Activities Number of sessions in various states short
Deadlocks Number of deadlocks short p3 When a deadlock occurs, YMatrix will automatically unlock. Failed queries can be retryed, and alarms can be configured
Checksum Failures NULL short p3
Rows Read Read data row count short
Checkpoints Checkpoint statistics. <font color=Orange is the number of operations that actively requests to generate checkpoints, Green is the number of operations that automatically generate checkpoints due to timeout short
Page Cache Hit blks_hit: Number of hit caches when reading data pages
blks_read: Number of times cache missed and disks to be read
Replication Latency write_lag - The elapsed time between the local flushing the latest WAL and the receipt of the Standby/Mirror write to the WAL successfully (but has not been refreshed or applied). If Standby/Mirror is configured, it can be used to measure the time elapsed between the local flushing of the latest WAL and the receipt of Standby/Mirror writing to WAL and the flashing of the disk successfully (but it has not been applied yet). If Standby/Mirror is configured, it can be used to measure the delay in submission when synchronous_commit is configured to on
replay_lag - The elapsed time between the local flushing of the latest WAL and the receipt of Standby/Mirror writing to the WAL, flushing and successfully applying. If Standby/Mirror is configured, it can be used to measure the delay caused by commit when synchronous_commit is configured as remote_apply
milliseconds(ms) p3 By default, the primary and Mirror are synchronously replicated, which will cause transaction commits to become very slow if it is greater than 1s. If it is asynchronous replication, the alarm threshold can be adjusted appropriately
Rows Insert/Update/Delete Number of data INSERT or UPDATE or DELETE short
Checkpoint buffers buffers_checkpoint - Number of caches written during checkpoint generation
buffers_clean - Number of caches written by background write process
buffers_backend - Number of caches written directly by worker process
short
Top 10 Replication Lag Size Top 10 Replication Delay WAL Size bytes p3 By default, the primary and Mirror are replicated synchronously. If it is greater than 1GB, transaction commits will become very slow. If it is asynchronous replication, the alarm threshold can be adjusted appropriately

1.3 Storage

This section displays storage-related statistics, including:| Indicator Name | Description | Unit | Level | Reference Alarm Threshold | | --- | --- | --- | --- | --- | |Top 10 Database|Database Size Top10|bytes| | | |Top 10 Users|User Data Size Top10|bytes| | | |Top 10 Aging Database|Database Age Top10|short|p2|The maximum database usage age is 21E. When only 1E is left, the YMatrix instance will be forced to stop. When it belongs to 5E, there will be a prompt in the log. It is recommended that the alarm configuration be 6E and 2E. | |Top 10 Big Tables|Table Size Top10|bytes| | | |Top 10 Big Partitions|Partition table size Top10|bytes| | | |Top 10 Growth Today|The 10 Tables with the Fastest Data Growth on the Day|bytes| | | |Top 10 Growth Last 7 Days|10 Tables with the Fastest Data Growth in 7 Days|bytes| |

2 MatrixGate Monitoring Metrics

2.1 Basic Information

Indicator Name Description Unit Level Reference Alarm Threshold
version mxgate version number
Run time mxgate run time seconds(s)
Process number mxgate background process PID short p2 No process number, it may be that mxgate is down

2.2 Task information

Indicator Name Description Unit Level Reference Alarm Threshold
Target table Target table to which this task data is inserted
Total number of rows in the database Total number of data successfully in the database since the mxgate was started short
Total number of error rows Total number of data failed to enter the database since the start of mxgate short p3 Alarm threshold can be set according to the situation
Total inlet size The data size of the successful initing of this task since mxgate was started short
Concurrency Total concurrency: The value is the configuration item stream - prepared + 1, the upper limit configuration of concurrency
Number of work: The actual concurrency of work, some threads will enter a dormant state, so the actual concurrency of work may be less than the configuration
short
Transaction time granularity Time span of data transaction commit short
Target table blocking Number of target table blocking short

2.3 Load Statistics

Indicator Name Description Unit Level Reference Alarm Threshold
Number of submitted rows The job submitted rows short
Number of rows entered into the database short
Number of blocked rows Number of blocked rows short p3 Alarm threshold can be set according to the situation
Failed rows The number of failed rows to write to the job short p3 The alarm threshold can be set according to the situation
Written data amount Total number of bytes written by this job bytes

2.4 Delay Statistics

The delays of each stage of data entry are statistical values ​​for a period of time, including:

  • max: maximum value
  • min: minimum value
  • 95%: 95% average of data
Indicator Name Description Unit Level Reference Alarm Threshold
Total delay statistics This delay is the sum of the following delays nanoseconds(ns) p3 30s
insertStart Delay Stat Delay from INSERT execution to the first piece of data sent to the Segment nanoseconds(ns)
write Delay Statistics mxgate Time-consuming to send this batch of data to the Segment nanoseconds(ns)
insertDone Delay Statistics Delays when the last data is sent to the Segment to the INSERT statement has been executed (the data is redistributed between each Segment and ends with the end of the drop) nanoseconds(ns)
commit delay statistics Delay of execution of commit command nanoseconds(ns)

2.5 Database Events

Indicator Name Description Unit Level Reference Alarm Threshold
CHECKPOINT times Number of CHECKPOINT execution within one minute short
CHECKPOINT Write Delay Total time spent on the checkpoint processing portion of the file being written to disk, in milliseconds milliseconds(ms)
CHECKPOINT Synchronization Delay Total time spent on the checkpoint processing portion of the file being synced to disk, in milliseconds milliseconds(ms)
Number of cache blocks applied Number of buffers allocated short
Number of cache blocks written to disk is divided into three categories:
1. Number of buffers written during checkpoint
2. Number of buffers written by background write process
3. Number of buffers written directly by a backend
short
The number of times the dirty page has reached the upper limit The number of times the background writing process stops cleaning due to too many buffers being written short
Master-slave latency logs WAL latency between Master and Standby or Primary and Mirror bytes
Master-slave delay time Delay time between Master and Standby or Primary and Mirror milliseconds(ms)
Target table blocking event trend chart Divided into four categories:
1. Lock correlation
2. Copy correlation
3. Resource group correlation
4. Resource queue correlation
short

3 Host node monitoring

3.1 Quick CPU / Mem / Disk

Indicator Name Description Unit Level Reference Alarm Threshold
CPU Busy Collection of Busy status proportion of all CPU cores 0-1
Sys Load (5m avg) Average load rate of all CPU cores in 5 minutes 0-1 p3/p2 CPU cores 3 / CPU cores 5
Sys Load (15m avg) Average load rate of all cores in 15 minutes 0-1 p3/p2 CPU cores 3 / CPU cores 5
RAM Used Size of used memory (total memory - free memory size - memory size occupied by Buffer cache and Cached cache) 0-1
SWAP Used Size of used swap memory 0-1 p3 80%
Root FS Used root file system usage 0-1 p3/p2 60%/80%
CPU Cores Physical CPU cores short
RootFS Total rootFS Total root file system space bytes p3/p2
Uptime Seconds(s)
RAM Total Memory Size bytes
SWAP Total Size of swap partition bytes

3.2 Basic CPU / Mem / Disk

Indicator Name Description Unit Level Reference Alarm Threshold
CPU Basic CPU Basic Information /proc/stat 0-1
Memory Basic Memory Basic Information bytes
Network Traffic Basic Basic network information for each interface bit p3/p2 Network card maximum bandwidth 60% 80%
Disk Space Used Basic Disk Space Ratio of all mounted file systems 0-1 p3 Disk Usage Rate 60% 80%

Indicator Name Description Unit Level Reference Alarm Threshold
CPU Percentage of processes executed by CPU in kernel mode short
Memory Stack Memory Stack/proc/meminfo bytes
Network Traffic Transmission rate of each network interface bytes/sec
Disk Space Used Disk Space Size of All Mounted File Systems bytes
Disk IOps Disk Read and Write I/O ops/sec(iops)
I/O Usage Read/Write Disk Read and Write Rate bytes
I/O Utilization 0-1 p3/p2 60% / 80%
CPU spent seconds in guests (VMs) Time spent running a guest with nice value milliseconds(ms)

3.4 Memory Meminfo

Indicator Name Description Unit Level Reference Alarm Threshold
Memory Active / Inactive Recently used/less memory
Memory Active / Inactive Detail Inactive_file - Memory page corresponding to the file that has not been accessed for a long time on LRU list /proc/meminfo LRU_INACTIVE_FILE
Inactive_anon - Anonymous page and swap cache that has not been accessed for a long time (including tmpfs) /proc/meminfo LRU_INACTIVE_ANON
Active_file - LRU list Memory page corresponding to the file that has not been accessed for a long time /proc/meminfo LRU_ACTIVE_FILE
Active_anon - Anonymous page and swap cache that has not been accessed for a long time (including tmpfs) /proc/meminfo LRU_ACTIVE_ANON
bytes
Memory Shared an Mapped Mapped - mapped Memory occupied by cache page /proc/meminfo Mapped
Shmem - Shared Memory /proc/meminfo Shared
bytes
Memory Vmalloc VmallocChunk - vmalloc The maximum logical continuous memory size that can be allocated /proc/meminfo VmallocChunk
VmallocTotal memory size that can be used /proc/meminfo VmallocTotal VmallocTotal
VmallocUsed - vmalloc Total memory size that can be used /proc/meminfo VmallocUsed
bytes
Memory Anonymous Active_anon - pages Recently used anonymous virtual memory page /proc/vmstat nr_active_anon
Active_file - Recently used file virtual memory page /proc/vmstat nr_active_file
bytes
Memory HugePages Counter HugePages_Free - The total number of free HugePages currently owned by the system /proc/meminfo HugePages_Free
HugePages_Rsvd - The total number of HugePages currently retained by the system. More specifically, the program has applied to the system, but since the program does not have a substantial HugePages read and write operation, the number of HugePages that the system has not actually allocated to the program /proc/meminfo HugePages_Rsvd
HugePages_Surp - refers to the number of permanent HugePages that exceed the system's set /proc/meminfo HugePages_Surp
bytes
Memory DirectMap DirectMap1G - Number of memory pages mapped to 1G
DirectMap2M - Number of memory pages mapped to 2M
DirectMap4K - Number of memory pages mapped to 4kB
bytes
Memory NFS NFS Unstable - Cache page sent to NFS server but not written to the hard disk bytes
Memory Commission The amount of memory that the current system has allocated, including the size of memory that has been allocated but not used yet
The amount of memory that the current system can allocate
bytes p3/p2 60%/80%
Memory Writeback an Dirty Writeback - Preparing to actively write back the hard disk cache page /proc/meminfo Writeback
WritebackTmp - FUSE memory used to temporarily write back the buffer /proc/meminfo WritebackTmp
Dirty - Data size that needs to be written back to the disk /proc/meminfo Dirty
bytes
Memory Slab Reclaimable - Reclaimable slab virtual memory page /proc/vmstat nr_slab_reclaimable
Unreclaimable - Unreclaimable slab virtual memory page /proc/vmstat nr_slab_unreclaimable
bytes
Memory Bounce Bounce - Bounce buffers Memory occupied by /proc/meminfo Bounce bytes
Memory Kernel / CPU KernelStack - Kernel Stack Size (resident memory, non-recyclable)
PerCPU - Memory Size Allocated for Each CPU Loading Module
bytes
Memory HugePages Size HugePages - The total number of HugePages currently owned by the system /proc/meminfo HugePages
Hugepagesize - The size of HugePages per page /proc/meminfo Hugepagesize
bytes
Memory Unevictable MLocked Unevictable - Memory that cannot be recycled /proc/meminfo Unevictable
MLocked - Memory size locked by mlock() system call /proc/meminfo MLocked
bytes

3.5 Memory Vmstat

Indicator Name Description Unit Level Reference Alarm Threshold
Memory Pages In / Out Pagesin - The rate at which data is read from hard disk to physical memory (within 5 minutes) /proc/vmstat pgpgin
Pagesout - The rate at which data is written from physical memory to hard disk (within 5 minutes) /proc/vmstat pgpgout
short
Memory Page Faults Pgfault - Average number of errors for first-level and second-level pages (within 5 minutes) /proc/vmstat pgfault
Pgmajfault - Average number of errors for first-level pages (within 5 minutes) /proc/vmstat pgmajfault
Pgminfault - Average number of errors for second-level pages (within 5 minutes)
short
Memory Pages Swap In / Out Pswpin - The rate at which data is loaded into memory from disk swap area (within 5 minutes) /proc/vmstat pswpin
Pswpout - The rate at which data is dumped from memory to disk swap area (within 5 minutes) /proc/vmstat pswpout
short
OOM Killer OOM Killer Number of calls short p3 Avoid any changes

3.6 System Timesync

| Indicator Name | Description | Unit | Level | Reference Alarm Threshold | | --- | --- | --- | --- | --- | |Time Syncronized Drift|Estimation Error (Second)
Time Offset between the local system and the reference clock
Maximum Error (Second)|short| | | |Time Syncronized Status|Is the clock synchronized with a reliable server
Estimation error (seconds)|short| | | |Time PLL Adjust|Phase Lock Loop Time Adjustment|short| | | |Time Misc|Second between clock ticking
International Atomic Time (TAI) Offset|short| | |

3.7 System Processes

Indicator Name Description Unit Level Reference Alarm Threshold
Processes Status Processes blocked - Number of currently blocked tasks /proc/stat procs_blocked
Processes in runnable state - Number of currently running tasks /proc/stat procs_running
short p3 blocked: 10
Processes Forks Processes forks second - Number of processes created per second short
PIDS Number and Limit Current host running process
Host limit maximum number of processes
short p3/p2 15000/20000
Processes Memory Size of virtual memory occupied by the process
Maximum virtual memory size available to the process
bytes
Process schedule stats Running / Waiting Time taken to start a process
CPU processing waiting time
ms
Threads Number and LImit Total current threads
Maximum number of threads on the host
short

3.8 System Misc

Indicator Name Description Unit Level Reference Alarm Threshold
Vontext Switches / Interrupts Context switches - Average number of context switches for CPU (within 5 minutes)
Interrupts - Average total number of interrupts for service (within 5 minutes)
short
Interrupts Detail The current system's soft interrupt list and corresponding interrupt number average number interrupt number (within 5 minutes) /proc/interrupts short
Entropy Available for random number generator short
File Descriptors Maximum number of open file descriptors
Number of open file descriptors
short
Schedule timeslices executed by each cpu Schedule timeslices executed by each CPU short
CPU time spent in user and system contexts Short

3.9 Hardware Misc

Indicator Name Description Unit Level Reference Alarm Threshold
Hardware temperature monitor Hardware temperature monitoring Celsius(℃)
Power supply Whether it is powered short
Throttle colling device Cooling device status short

3.10 Systemd

Indicator Name Description Unit Level Reference Alarm Threshold
Systemd Sockets sockets Total number of accepted connections short
Systemd Units State inactive - Inactive Systemd Units
failed - Failed Systemd Unit
deactivated - Deactivated Systemd Unit
active - Busy Systemd Unit
activated - Activating Systemd Unit
short

3.11 Storage Disk

Indicator Name Description Unit Level Reference Alarm Threshold
Disk IOps Completed Reads completed Number of read completions per second per disk partition
Writes completed Number of write completions per disk partition
I/O ops/sec (iops)
Disk Average Wait Time Read wait time avg Average read wait time per disk
Write wait time avg Average write wait time per disk
Milliseconds(ms) p3 1s
Disk R/W Merged Read merged Number of merged reads per second per disk partition
Write merged Number of merged writes per second per disk partition
I/O ops/sec (iops)
Instantaneous Queue Size Instantaneous Queue Size, the number of unprocessed requests during sampling. Increment as the request is provided to the appropriate structure request_queue, decrement as the request is completed short
Disk R/W Data Read bytes Number of bytes Read per second per disk partition
Written bytes Number of bytes Write per second per disk partition
bytes/sec
Average Queue Size Average Queue Length of Requests to Devices short
Time Spent Doing I/Os Percent runtime (bandwidth utilization of the device) issued by I/O requests to the device. For devices that provide requests in serial, device saturation occurs when the value is close to 100%. But for devices that provide requests in parallel, such as RAID arrays and modern SSDs, this number does not reflect its performance limitations 0-1
Disk IOps Discards completed / merged Discards Completed IOps
Disk Discards Merge IOps
I/O ops/sec(iops)

3.12 Storage Filesystem

Indicator Name Description Unit Level Reference Alarm Threshold
Filesystem space available mounted file system available space
mounted file system remaining space
mounted file system occupied space
bytes p3/p2 60%/80%
File Descriptor Maximum open file descriptors - Maximum number of open file descriptors
Open file descriptors - Number of open file descriptors
short
Filesystem in ReadOnly / Error ReadOnly File system mounted in read-only mode
Device error Number of device errors
short p3
File Nodes Free Free file nodes: The number of remaining usages of inode of mounted file system short p3 60%
FIle Nodes Size File nodes total: file node size of mounted file system short

3.13 NetWork Traffic

| Indicator Name | Description | Unit | Level | Reference Alarm Threshold | | --- | --- | --- | --- | --- | |Network traffic by Packets|Receive Total number of packets received per second by each interface
Transmit Total number of packets sent per second by each interface|packets/sec| | | |Network Traffic Drop|Receive drop Total number of discarded packets received by each interface per second
Transmit drop Total number of discarded packets sent by each interface per second|packets/sec|p3|100| |Network Traffic Multicast|Receive multicast Number of multicast packets received per second by each interface|packets/sec| | | |Network Traffic Frame|Receive frame The number of frames received by each interface per second|packets/sec| | | |Network Traffic Colls|Transmit cols Number of conflicts detected on each interface|short| | | |ARP Entries|ARP entries Statistics of packages in ARP table on each interface|short| | | |Speed|Speed ​​network card maximum bandwidth|bytes|| | |Softnet Packets|Processed Number of packets processed by each CPU
Droped Number of packets discarded by each CPU| | | | |Network Operational Status|Physical link state of each network card physical connection status|short| | | |Network Traffic Errors|Receive errors Monitor the total number of error packets received per second by each interface
Rransmit errors Monitor the total number of error packets sent per second by each interface|packets/sec|p3|100| |Network Traffic Compressed|Receive compressed Total number of compressed packets received per second by each interface
Transmit compressed Total number of compressed packets sent per second by each interface|packets/sec| | | |Network traffic Fifo|Receive fifo Total fifo packets received per second by each interface
Transmit fifo Total fifo packets sent per second by each interface|packets/sec| | | |Network Traffic Carrier|Statistic transmit_carrier Number of carrier losses detected by each interface|short| | | |NF Contrack|NF conntrack entries Tracking connections
NF conntrack limit|short| | |MTU|Value of maximum packet received by each interface|bytes| | | |Queue Length|Travel queue length for each structure|short| | | |Softnet Out of Quota|CPU Backlog of Various |0-1| | |

3.14 Neteork Sockstat

Indicator Name Description Unit Level Reference Alarm Threshold
Sockstat TCP TCP_alloc - Number of TCP sockets allocated (established, requested to sk_buff)
TCP_inuse - Number of TCP sockets being used (listening)
TCP_mem - Number of TCP sockets being used
TCP_orphan - Number of TCP connections without master (not belonging to any process)
TCP_tw - Number of TCP connections waiting to be closed
short
Sockstats FRG / RAW FRAG_inuse - Number of Frag sockets in use
FRAG_memory - Frag buffers in use
RAW_inuse - Number of Raw sockets in use
short
Sockstat Used Sockets_used - Total number of all protocol sockets used short
Sockstat UDP UDPLITE_inuse - Number of UDP-Lite sockets in use short
Sockstat Memory Size TCP_mem_bytes - TCP socket buffer bits
UDP_mem_bytes - UDP socket buffer bits
bytes

3.15 Network Netstat

Indicator Name Description Unit Level Reference Alarm Threshold
Netstat IP In / Out Octets InOctets - Number of octets received
OutOctets - Number of octets sent
short
ICPM In / Out InMsgs - Received message, this counter includes all counters counted by icmpInErrors
OutMsgs - Message attempted by icmpOutErrors
short
UDP In / Out InDatagrams - Average received UDP packets (within 5 minutes)
OutDatagrams - Average sent UDP packets (within 5 minutes)
short
TCP In / Out InSegs - Received segments, including error received segments. This count includes segments received on the currently established connection
OutSegs - Segments sent, including segments on the current connection, but not segments containing only retransmitted octets
short
TCP Connections CurrEstab - Number of TCP connections with current status ESTABLISHED or CLOSE-WAIT short
TCP Direct Transition ActiveOpens - TCP connection that has been directly transferred from CLOSED state to SYN-SENT state
PassiveOpens - TCP connection that has been directly transferred from LISTEN state to SYN-RCVD state
short
Netstat IP Forwarding Forwarding - IP forwarding packet count short
ICMP Errors InErrors-Message received and determined to have an ICMP specific error (error ICMP checksum, wrong length, etc.) short
UDP Errors InCsumErrors - Average of UDP packets with checksum errors (within 5 minutes)
InErrors - Average of UDP packets that cannot be delivered (application layer) due to reasons other than the native port is not listening (within 5 minutes)
RcvbufErrors - Average of UDP packets that are overflowing in the received buffer (within 5 minutes)
SndbufErrors - Average of UDP packets that are overflowing in the sent buffer (within 5 minutes)
NoPorts - Average of UDP packets that are overflowing in the unknown port (within 5 minutes)
short p3 100
TCP Errors ListenOverflows - Number of listen queues for sockets overflows
ListenDrops - SYN to LISTEN sockets are ignored
TCPSynRetranss - SYN-SYN/ACK retransmission to interrupt retransmission in SYN, fast/timeout retransmission
RetransSegs - Number of retransmissions - That is, the number of transmitted TCP segments contains one or more previously transmitted octets
InErrs - Error received segments (e.g., wrong TCP checksum)
OutRsts - Segments sent with RST flag
short p3 100
TCP SyncCookie SyncookiesFailed - Number of invalid SYN cookies received
SyncookiesRecv - Number of SYN cookies received
SyncookiesSent - Number of SYN cookies sent
short

3.16 Node Exporter

Indicator Name Description Unit Level Reference Alarm Threshold
Node Exporter Scrape Time Duration of individual collectors seconds
Node Exporter Scrape Normal working number of each collector short

4 YMatrix Host ext

Indicator Name Description Unit Level Reference Alarm Threshold
Host five minutes load Display all hosts selected within five minutes load short
Host Memory Percentage Display All host memory usage percentages selected 0-1
CPU Busy Percent Display CPU Busy Percent 0-1
Disk I/O Usage Display Disk I/O Usage 0-1
Remaining space utilization Displaying the remaining space utilization of the selected host 0-1
Send network traffic Display the selected host to send network traffic bit
Receive network traffic Display selected host receives network traffic bit
SWAP Usage Displays the selected host SWAP Usage 0-1

Indicator Name Description Unit Level Reference Alarm Threshold
net dev Net device status short
softnet_stat Displays the percentage of memory usage selected by all hosts short
hardirq_cpu CPU hardware interrupts short
hardirq_cpu_pie CPU hardware interrupt number pie chart short
hardirq_quene Number of hard terminals for each device short
hardirq_quene_pie Pie chart of hard terminals for each device short
softirq_rx Number of data reception software interrupts short
softirq_rx_pie Number of data reception software interrupts pie chart short
softirq_tx Number of data transmission software interrupts short
softirq_tx_pie Number of data transmission software interrupts pie chart short
ip IP network layer protocol transmission and reception situation short
udp UDP network protocol transmission and reception situation short

5 YMatrix Database ext

Indicator Name Description Unit Level Reference Alarm Threshold
license Expiration time LICENSE Expiration remaining time seconds(s) p3/p2 Remaining time is less than 15 days, and alarm p3 is required
Remaining time is less than 7 days, and alarm p2 is required. Contact YMatrix in time to replace LICENSE
Missing partition policy range table Range partition table missing configuration APM partition policy short p2 Need to be processed in time, otherwise the data will be written to the default partition, affecting performance
Range partition table creation number Range partition table new partition table delay number short p2 Need to be processed in time, otherwise the data will be written to the default partition, affecting performance
mars table maximum runs MARS2 internal indicators short p3/p2 Over 1500 alarm p3, you need to pay attention to whether it is still rising
Over 1800 alarm p2
When this value reaches 2039, it will cause slow writing and even no writing
Maximum block_items value mxgate number of instantaneous batches written short
YMatrix Total number of processes The corresponding number of postgres related processes for the selected host short p2 Prevent too many processes, otherwise it will cause insufficient memory, configure on demand
Repeat index number Repeat index number, unwanted index can be considered to delete short p3
matrixgate connections mxgate total number of connections to process short
24-hour data total change value The last 24-hour data total change bytes
Top10 Number of subpartitions The top ten tables in the number of subpartitions, configure as needed to avoid too many subtables, which will have a certain impact on query performance and occupy more memory bytes
Top10 Mode Size Ranking by Total Mode Size Top 10 bytes
Top10 System table size Ranking by total system table size Top 10 bytes
Top10 Default partition table size Ranking by default table size Top 10 bytes p3 Default partition is too large and requires an alarm. Normally, data should not exist in the default partition

Indicator Name Description Unit Level Reference Alarm Threshold
mars2 table maximum runs Details MARS2 table runs Trend Chart short
Database connection details Group by database, client address, application_name short
24-hour database space changes 24-hour database size changes short
Total time-consuming query Total time-consuming database query at each stage millsseconds (ms) p3 Configuration on demand, attention should be paid when the total time changes are large
Host YMatrix Process Trend Chart Total Number of Host Postgres Process Trend Chart short

Indicator Name Description Unit Level Reference Alarm Threshold
Table expansion details List Table dead tuples/surviving tuples > Table with 1.1 short
Top 100 Process RSS Details Sorted by RSS, list the memory occupied by postgres processes top 100 short
Slow query monitoring Slow execution in statistics database SQL none p3
Total time-consuming query monitoring Statistics SQL execution total time-consuming millseconds (ms)
Time-consuming statistics graph (seconds) Statistics are executed within every five minutes total SQL time-consuming millseconds (ms)
Long Things Indicators Statistics Master/Segment Medium Long Things Details none p3
Lock waiting information List receipt information moment, database lock waiting details none p3 Configure according to requirements, you can configure locks for more than 10 minutes and alerts