This document introduces relevant indicators and reference alarm thresholds such as YMatrix, MatrixGate, host node monitoring in the Prometheus monitoring panel.
Alarm level description
Notes!
There is no indicator for reference alarm threshold, please judge and configure alarm conditions based on actual conditions.
This section shows the overall operating status of the cluster, including:
Indicator Name | Description | Unit | Level | Reference Alarm Threshold |
---|---|---|---|---|
Cluster status | Cluster node status, including: 0: Normal 1: None Standby 2: None Mirror 10: Distribution unbalanced (Some nodes are not rebalancing after the downtime and recovery) 11: There are master-slave asynchronous nodes (Some Mirror nodes are not synchronized with Primary) 12: Only Master (The cluster only starts the Master node, which is usually used during diagnosis) 20: Segment downtime (There is an unavailable Segment node, and the cluster is not available) |
short | p0 | 20: The Segment downtime is a serious event and requires an alarm |
Runtime | Includes YMatrix's run time since startup and Master host operating system run time | seconds(s) | ||
version | YMatrix version | |||
Connection status | Connection status displays the number of connections in the database system, including: Total number of connections (Total), number of connection queries blocked (Blocked), number of idle connections (Idle), number of idle in transactions (Idle in TXN) | short | ||
Slow query number | In the current system, the number of queries with execution time exceeding 1 day | short | greater than 0 means that there is a particularly slow query and an alarm is required | |
Transactions | Transaction commit and rollback count statistics | short | ||
Disk Space in Use | Disk usage. Disk usage for Master node or Segment node instance | 0-1 | ||
Node status | State of each node, including: 0: UP (Normal) 10: Switched (Role swap, indicating that master-slave switching has occurred and needs to be rebalanced) 11: Resync (Master-slave synchronization) 20: Down (Downtime) |
short | p2/p1 | Alarm is required if the duration exceeds 5 minutes and is not 0 20 The value needs to increase the alarm p1 |
This section shows database performance, including:
Indicator Name | Description | Unit | Level | Reference Alarm Threshold |
---|---|---|---|---|
Page Hit Ratio | HEAP table read operation hit block cache ratio to the total number of read operations. (The cache only includes caches maintained by the HEAP table itself, and does not include operating system cache) The displayed value is the current value, and the curve is the historical value The value is usually required to be above 90% |
0-1 | ||
Temp Size | The total amount of data written to temporary files in the database is queryed. Regardless of the reason for creating the temporary file and the log_temp_files setting, all temporary files will be counted | bytes | ||
Sessions Per Database | Number of sessions per database | short | ||
Activities | Number of sessions in various states | short | ||
Deadlocks | Number of deadlocks | short | p3 | When a deadlock occurs, YMatrix will automatically unlock. Failed queries can be retryed, and alarms can be configured |
Checksum Failures | NULL | short | p3 | |
Rows Read | Read data row count | short | ||
Checkpoints | Checkpoint statistics. <font color=Orange is the number of operations that actively requests to generate checkpoints, Green is the number of operations that automatically generate checkpoints due to timeout | short | ||
Page Cache Hit | blks_hit: Number of hit caches when reading data pages blks_read: Number of times cache missed and disks to be read |
|||
Replication Latency | write_lag - The elapsed time between the local flushing the latest WAL and the receipt of the Standby/Mirror write to the WAL successfully (but has not been refreshed or applied). If Standby/Mirror is configured, it can be used to measure the time elapsed between the local flushing of the latest WAL and the receipt of Standby/Mirror writing to WAL and the flashing of the disk successfully (but it has not been applied yet). If Standby/Mirror is configured, it can be used to measure the delay in submission when synchronous_commit is configured to on replay_lag - The elapsed time between the local flushing of the latest WAL and the receipt of Standby/Mirror writing to the WAL, flushing and successfully applying. If Standby/Mirror is configured, it can be used to measure the delay caused by commit when synchronous_commit is configured as remote_apply |
milliseconds(ms) | p3 | By default, the primary and Mirror are synchronously replicated, which will cause transaction commits to become very slow if it is greater than 1s. If it is asynchronous replication, the alarm threshold can be adjusted appropriately |
Rows Insert/Update/Delete | Number of data INSERT or UPDATE or DELETE | short | ||
Checkpoint buffers | buffers_checkpoint - Number of caches written during checkpoint generation buffers_clean - Number of caches written by background write process buffers_backend - Number of caches written directly by worker process |
short | ||
Top 10 Replication Lag Size | Top 10 Replication Delay WAL Size | bytes | p3 | By default, the primary and Mirror are replicated synchronously. If it is greater than 1GB, transaction commits will become very slow. If it is asynchronous replication, the alarm threshold can be adjusted appropriately |
This section displays storage-related statistics, including:| Indicator Name | Description | Unit | Level | Reference Alarm Threshold | | --- | --- | --- | --- | --- | |Top 10 Database|Database Size Top10|bytes| | | |Top 10 Users|User Data Size Top10|bytes| | | |Top 10 Aging Database|Database Age Top10|short|p2|The maximum database usage age is 21E. When only 1E is left, the YMatrix instance will be forced to stop. When it belongs to 5E, there will be a prompt in the log. It is recommended that the alarm configuration be 6E and 2E. | |Top 10 Big Tables|Table Size Top10|bytes| | | |Top 10 Big Partitions|Partition table size Top10|bytes| | | |Top 10 Growth Today|The 10 Tables with the Fastest Data Growth on the Day|bytes| | | |Top 10 Growth Last 7 Days|10 Tables with the Fastest Data Growth in 7 Days|bytes| |
Indicator Name | Description | Unit | Level | Reference Alarm Threshold |
---|---|---|---|---|
version | mxgate version number | |||
Run time | mxgate run time | seconds(s) | ||
Process number | mxgate background process PID | short | p2 | No process number, it may be that mxgate is down |
Indicator Name | Description | Unit | Level | Reference Alarm Threshold |
---|---|---|---|---|
Target table | Target table to which this task data is inserted | |||
Total number of rows in the database | Total number of data successfully in the database since the mxgate was started | short | ||
Total number of error rows | Total number of data failed to enter the database since the start of mxgate | short | p3 | Alarm threshold can be set according to the situation |
Total inlet size | The data size of the successful initing of this task since mxgate was started | short | ||
Concurrency | Total concurrency: The value is the configuration item stream - prepared + 1, the upper limit configuration of concurrency Number of work: The actual concurrency of work, some threads will enter a dormant state, so the actual concurrency of work may be less than the configuration |
short | ||
Transaction time granularity | Time span of data transaction commit | short | ||
Target table blocking | Number of target table blocking | short |
Indicator Name | Description | Unit | Level | Reference Alarm Threshold |
---|---|---|---|---|
Number of submitted rows | The job submitted rows | short | ||
Number of rows entered into the database | short | |||
Number of blocked rows | Number of blocked rows | short | p3 | Alarm threshold can be set according to the situation |
Failed rows | The number of failed rows to write to the job | short | p3 | The alarm threshold can be set according to the situation |
Written data amount | Total number of bytes written by this job | bytes |
The delays of each stage of data entry are statistical values for a period of time, including:
Indicator Name | Description | Unit | Level | Reference Alarm Threshold |
---|---|---|---|---|
Total delay statistics | This delay is the sum of the following delays | nanoseconds(ns) | p3 | 30s |
insertStart Delay Stat | Delay from INSERT execution to the first piece of data sent to the Segment | nanoseconds(ns) | ||
write Delay Statistics | mxgate Time-consuming to send this batch of data to the Segment | nanoseconds(ns) | ||
insertDone Delay Statistics | Delays when the last data is sent to the Segment to the INSERT statement has been executed (the data is redistributed between each Segment and ends with the end of the drop) | nanoseconds(ns) | ||
commit delay statistics | Delay of execution of commit command | nanoseconds(ns) |
Indicator Name | Description | Unit | Level | Reference Alarm Threshold |
---|---|---|---|---|
CHECKPOINT times | Number of CHECKPOINT execution within one minute | short | ||
CHECKPOINT Write Delay | Total time spent on the checkpoint processing portion of the file being written to disk, in milliseconds | milliseconds(ms) | ||
CHECKPOINT Synchronization Delay | Total time spent on the checkpoint processing portion of the file being synced to disk, in milliseconds | milliseconds(ms) | ||
Number of cache blocks applied | Number of buffers allocated | short | ||
Number of cache blocks written to disk | is divided into three categories: 1. Number of buffers written during checkpoint 2. Number of buffers written by background write process 3. Number of buffers written directly by a backend |
short | ||
The number of times the dirty page has reached the upper limit | The number of times the background writing process stops cleaning due to too many buffers being written | short | ||
Master-slave latency logs | WAL latency between Master and Standby or Primary and Mirror | bytes | ||
Master-slave delay time | Delay time between Master and Standby or Primary and Mirror | milliseconds(ms) | ||
Target table blocking event trend chart | Divided into four categories: 1. Lock correlation 2. Copy correlation 3. Resource group correlation 4. Resource queue correlation |
short |
Indicator Name | Description | Unit | Level | Reference Alarm Threshold |
---|---|---|---|---|
CPU Busy | Collection of Busy status proportion of all CPU cores | 0-1 | ||
Sys Load (5m avg) | Average load rate of all CPU cores in 5 minutes | 0-1 | p3/p2 | CPU cores 3 / CPU cores 5 |
Sys Load (15m avg) | Average load rate of all cores in 15 minutes | 0-1 | p3/p2 | CPU cores 3 / CPU cores 5 |
RAM Used | Size of used memory (total memory - free memory size - memory size occupied by Buffer cache and Cached cache) | 0-1 | ||
SWAP Used | Size of used swap memory | 0-1 | p3 | 80% |
Root FS Used | root file system usage | 0-1 | p3/p2 | 60%/80% |
CPU Cores | Physical CPU cores | short | ||
RootFS Total | rootFS Total | root file system space | bytes | p3/p2 |
Uptime | Seconds(s) | |||
RAM Total | Memory Size | bytes | ||
SWAP Total | Size of swap partition | bytes |
Indicator Name | Description | Unit | Level | Reference Alarm Threshold |
---|---|---|---|---|
CPU Basic | CPU Basic Information /proc/stat | 0-1 | ||
Memory Basic | Memory Basic Information | bytes | ||
Network Traffic Basic | Basic network information for each interface | bit | p3/p2 | Network card maximum bandwidth 60% 80% |
Disk Space Used Basic | Disk Space Ratio of all mounted file systems | 0-1 | p3 | Disk Usage Rate 60% 80% |
Indicator Name | Description | Unit | Level | Reference Alarm Threshold |
---|---|---|---|---|
CPU | Percentage of processes executed by CPU in kernel mode | short | ||
Memory Stack | Memory Stack/proc/meminfo | bytes | ||
Network Traffic | Transmission rate of each network interface | bytes/sec | ||
Disk Space Used | Disk Space Size of All Mounted File Systems | bytes | ||
Disk IOps | Disk Read and Write | I/O ops/sec(iops) | ||
I/O Usage Read/Write | Disk Read and Write Rate | bytes | ||
I/O Utilization | 0-1 | p3/p2 | 60% / 80% | |
CPU spent seconds in guests (VMs) | Time spent running a guest with nice value | milliseconds(ms) |
Indicator Name | Description | Unit | Level | Reference Alarm Threshold |
---|---|---|---|---|
Memory Active / Inactive | Recently used/less memory | |||
Memory Active / Inactive Detail | Inactive_file - Memory page corresponding to the file that has not been accessed for a long time on LRU list /proc/meminfo LRU_INACTIVE_FILE Inactive_anon - Anonymous page and swap cache that has not been accessed for a long time (including tmpfs) /proc/meminfo LRU_INACTIVE_ANON Active_file - LRU list Memory page corresponding to the file that has not been accessed for a long time /proc/meminfo LRU_ACTIVE_FILE Active_anon - Anonymous page and swap cache that has not been accessed for a long time (including tmpfs) /proc/meminfo LRU_ACTIVE_ANON |
bytes | ||
Memory Shared an Mapped | Mapped - mapped Memory occupied by cache page /proc/meminfo Mapped Shmem - Shared Memory /proc/meminfo Shared |
bytes | ||
Memory Vmalloc | VmallocChunk - vmalloc The maximum logical continuous memory size that can be allocated /proc/meminfo VmallocChunk VmallocTotal memory size that can be used /proc/meminfo VmallocTotal VmallocTotal VmallocUsed - vmalloc Total memory size that can be used /proc/meminfo VmallocUsed |
bytes | ||
Memory Anonymous | Active_anon - pages Recently used anonymous virtual memory page /proc/vmstat nr_active_anon Active_file - Recently used file virtual memory page /proc/vmstat nr_active_file |
bytes | ||
Memory HugePages Counter | HugePages_Free - The total number of free HugePages currently owned by the system /proc/meminfo HugePages_Free HugePages_Rsvd - The total number of HugePages currently retained by the system. More specifically, the program has applied to the system, but since the program does not have a substantial HugePages read and write operation, the number of HugePages that the system has not actually allocated to the program /proc/meminfo HugePages_Rsvd HugePages_Surp - refers to the number of permanent HugePages that exceed the system's set /proc/meminfo HugePages_Surp |
bytes | ||
Memory DirectMap | DirectMap1G - Number of memory pages mapped to 1G DirectMap2M - Number of memory pages mapped to 2M DirectMap4K - Number of memory pages mapped to 4kB |
bytes | ||
Memory NFS | NFS Unstable - Cache page sent to NFS server but not written to the hard disk | bytes | ||
Memory Commission | The amount of memory that the current system has allocated, including the size of memory that has been allocated but not used yet The amount of memory that the current system can allocate |
bytes | p3/p2 | 60%/80% |
Memory Writeback an Dirty | Writeback - Preparing to actively write back the hard disk cache page /proc/meminfo Writeback WritebackTmp - FUSE memory used to temporarily write back the buffer /proc/meminfo WritebackTmp Dirty - Data size that needs to be written back to the disk /proc/meminfo Dirty |
bytes | ||
Memory Slab | Reclaimable - Reclaimable slab virtual memory page /proc/vmstat nr_slab_reclaimable Unreclaimable - Unreclaimable slab virtual memory page /proc/vmstat nr_slab_unreclaimable |
bytes | ||
Memory Bounce | Bounce - Bounce buffers Memory occupied by /proc/meminfo Bounce | bytes | ||
Memory Kernel / CPU | KernelStack - Kernel Stack Size (resident memory, non-recyclable) PerCPU - Memory Size Allocated for Each CPU Loading Module |
bytes | ||
Memory HugePages Size | HugePages - The total number of HugePages currently owned by the system /proc/meminfo HugePages Hugepagesize - The size of HugePages per page /proc/meminfo Hugepagesize |
bytes | ||
Memory Unevictable MLocked | Unevictable - Memory that cannot be recycled /proc/meminfo Unevictable MLocked - Memory size locked by mlock() system call /proc/meminfo MLocked |
bytes |
Indicator Name | Description | Unit | Level | Reference Alarm Threshold |
---|---|---|---|---|
Memory Pages In / Out | Pagesin - The rate at which data is read from hard disk to physical memory (within 5 minutes) /proc/vmstat pgpgin Pagesout - The rate at which data is written from physical memory to hard disk (within 5 minutes) /proc/vmstat pgpgout |
short | ||
Memory Page Faults | Pgfault - Average number of errors for first-level and second-level pages (within 5 minutes) /proc/vmstat pgfault Pgmajfault - Average number of errors for first-level pages (within 5 minutes) /proc/vmstat pgmajfault Pgminfault - Average number of errors for second-level pages (within 5 minutes) |
short | ||
Memory Pages Swap In / Out | Pswpin - The rate at which data is loaded into memory from disk swap area (within 5 minutes) /proc/vmstat pswpin Pswpout - The rate at which data is dumped from memory to disk swap area (within 5 minutes) /proc/vmstat pswpout |
short | ||
OOM Killer | OOM Killer Number of calls | short | p3 | Avoid any changes |
| Indicator Name | Description | Unit | Level | Reference Alarm Threshold |
| --- | --- | --- | --- | --- |
|Time Syncronized Drift|Estimation Error (Second)
Time Offset between the local system and the reference clock
Maximum Error (Second)|short| | |
|Time Syncronized Status|Is the clock synchronized with a reliable server
Estimation error (seconds)|short| | |
|Time PLL Adjust|Phase Lock Loop Time Adjustment|short| | |
|Time Misc|Second between clock ticking
International Atomic Time (TAI) Offset|short| | |
Indicator Name | Description | Unit | Level | Reference Alarm Threshold |
---|---|---|---|---|
Processes Status | Processes blocked - Number of currently blocked tasks /proc/stat procs_blocked Processes in runnable state - Number of currently running tasks /proc/stat procs_running |
short | p3 | blocked: 10 |
Processes Forks | Processes forks second - Number of processes created per second | short | ||
PIDS Number and Limit | Current host running process Host limit maximum number of processes |
short | p3/p2 | 15000/20000 |
Processes Memory | Size of virtual memory occupied by the process Maximum virtual memory size available to the process |
bytes | ||
Process schedule stats Running / Waiting | Time taken to start a process CPU processing waiting time |
ms | ||
Threads Number and LImit | Total current threads Maximum number of threads on the host |
short |
Indicator Name | Description | Unit | Level | Reference Alarm Threshold |
---|---|---|---|---|
Vontext Switches / Interrupts | Context switches - Average number of context switches for CPU (within 5 minutes) Interrupts - Average total number of interrupts for service (within 5 minutes) |
short | ||
Interrupts Detail | The current system's soft interrupt list and corresponding interrupt number average number interrupt number (within 5 minutes) /proc/interrupts | short | ||
Entropy | Available for random number generator | short | ||
File Descriptors | Maximum number of open file descriptors Number of open file descriptors |
short | ||
Schedule timeslices executed by each cpu | Schedule timeslices executed by each CPU | short | ||
CPU time spent in user and system contexts | Short |
Indicator Name | Description | Unit | Level | Reference Alarm Threshold |
---|---|---|---|---|
Hardware temperature monitor | Hardware temperature monitoring | Celsius(℃) | ||
Power supply | Whether it is powered | short | ||
Throttle colling device | Cooling device status | short |
Indicator Name | Description | Unit | Level | Reference Alarm Threshold |
---|---|---|---|---|
Systemd Sockets | sockets Total number of accepted connections | short | ||
Systemd Units State | inactive - Inactive Systemd Units failed - Failed Systemd Unit deactivated - Deactivated Systemd Unit active - Busy Systemd Unit activated - Activating Systemd Unit |
short |
Indicator Name | Description | Unit | Level | Reference Alarm Threshold |
---|---|---|---|---|
Disk IOps Completed | Reads completed Number of read completions per second per disk partition Writes completed Number of write completions per disk partition |
I/O ops/sec (iops) | ||
Disk Average Wait Time | Read wait time avg Average read wait time per disk Write wait time avg Average write wait time per disk |
Milliseconds(ms) | p3 | 1s |
Disk R/W Merged | Read merged Number of merged reads per second per disk partition Write merged Number of merged writes per second per disk partition |
I/O ops/sec (iops) | ||
Instantaneous Queue Size | Instantaneous Queue Size, the number of unprocessed requests during sampling. Increment as the request is provided to the appropriate structure request_queue, decrement as the request is completed | short | ||
Disk R/W Data | Read bytes Number of bytes Read per second per disk partition Written bytes Number of bytes Write per second per disk partition |
bytes/sec | ||
Average Queue Size | Average Queue Length of Requests to Devices | short | ||
Time Spent Doing I/Os | Percent runtime (bandwidth utilization of the device) issued by I/O requests to the device. For devices that provide requests in serial, device saturation occurs when the value is close to 100%. But for devices that provide requests in parallel, such as RAID arrays and modern SSDs, this number does not reflect its performance limitations | 0-1 | ||
Disk IOps Discards completed / merged | Discards Completed IOps Disk Discards Merge IOps |
I/O ops/sec(iops) |
Indicator Name | Description | Unit | Level | Reference Alarm Threshold |
---|---|---|---|---|
Filesystem space available | mounted file system available space mounted file system remaining space mounted file system occupied space |
bytes | p3/p2 | 60%/80% |
File Descriptor | Maximum open file descriptors - Maximum number of open file descriptors Open file descriptors - Number of open file descriptors |
short | ||
Filesystem in ReadOnly / Error | ReadOnly File system mounted in read-only mode Device error Number of device errors |
short | p3 | |
File Nodes Free | Free file nodes: The number of remaining usages of inode of mounted file system | short | p3 | 60% |
FIle Nodes Size | File nodes total: file node size of mounted file system | short |
| Indicator Name | Description | Unit | Level | Reference Alarm Threshold |
| --- | --- | --- | --- | --- |
|Network traffic by Packets|Receive Total number of packets received per second by each interface
Transmit Total number of packets sent per second by each interface|packets/sec| | |
|Network Traffic Drop|Receive drop Total number of discarded packets received by each interface per second
Transmit drop Total number of discarded packets sent by each interface per second|packets/sec|p3|100|
|Network Traffic Multicast|Receive multicast Number of multicast packets received per second by each interface|packets/sec| | |
|Network Traffic Frame|Receive frame The number of frames received by each interface per second|packets/sec| | |
|Network Traffic Colls|Transmit cols Number of conflicts detected on each interface|short| | |
|ARP Entries|ARP entries Statistics of packages in ARP table on each interface|short| | |
|Speed|Speed network card maximum bandwidth|bytes|| |
|Softnet Packets|Processed Number of packets processed by each CPU
Droped Number of packets discarded by each CPU| | | |
|Network Operational Status|Physical link state of each network card physical connection status|short| | |
|Network Traffic Errors|Receive errors Monitor the total number of error packets received per second by each interface
Rransmit errors Monitor the total number of error packets sent per second by each interface|packets/sec|p3|100|
|Network Traffic Compressed|Receive compressed Total number of compressed packets received per second by each interface
Transmit compressed Total number of compressed packets sent per second by each interface|packets/sec| | |
|Network traffic Fifo|Receive fifo Total fifo packets received per second by each interface
Transmit fifo Total fifo packets sent per second by each interface|packets/sec| | |
|Network Traffic Carrier|Statistic transmit_carrier Number of carrier losses detected by each interface|short| | |
|NF Contrack|NF conntrack entries Tracking connections
NF conntrack limit|short| |
|MTU|Value of maximum packet received by each interface|bytes| | |
|Queue Length|Travel queue length for each structure|short| | |
|Softnet Out of Quota|CPU Backlog of Various |0-1| | |
Indicator Name | Description | Unit | Level | Reference Alarm Threshold |
---|---|---|---|---|
Sockstat TCP | TCP_alloc - Number of TCP sockets allocated (established, requested to sk_buff) TCP_inuse - Number of TCP sockets being used (listening) TCP_mem - Number of TCP sockets being used TCP_orphan - Number of TCP connections without master (not belonging to any process) TCP_tw - Number of TCP connections waiting to be closed |
short | ||
Sockstats FRG / RAW | FRAG_inuse - Number of Frag sockets in use FRAG_memory - Frag buffers in use RAW_inuse - Number of Raw sockets in use |
short | ||
Sockstat Used | Sockets_used - Total number of all protocol sockets used | short | ||
Sockstat UDP | UDPLITE_inuse - Number of UDP-Lite sockets in use | short | ||
Sockstat Memory Size | TCP_mem_bytes - TCP socket buffer bits UDP_mem_bytes - UDP socket buffer bits |
bytes |
Indicator Name | Description | Unit | Level | Reference Alarm Threshold |
---|---|---|---|---|
Netstat IP In / Out Octets | InOctets - Number of octets received OutOctets - Number of octets sent |
short | ||
ICPM In / Out | InMsgs - Received message, this counter includes all counters counted by icmpInErrors OutMsgs - Message attempted by icmpOutErrors |
short | ||
UDP In / Out | InDatagrams - Average received UDP packets (within 5 minutes) OutDatagrams - Average sent UDP packets (within 5 minutes) |
short | ||
TCP In / Out | InSegs - Received segments, including error received segments. This count includes segments received on the currently established connection OutSegs - Segments sent, including segments on the current connection, but not segments containing only retransmitted octets |
short | ||
TCP Connections | CurrEstab - Number of TCP connections with current status ESTABLISHED or CLOSE-WAIT | short | ||
TCP Direct Transition | ActiveOpens - TCP connection that has been directly transferred from CLOSED state to SYN-SENT state PassiveOpens - TCP connection that has been directly transferred from LISTEN state to SYN-RCVD state |
short | ||
Netstat IP Forwarding | Forwarding - IP forwarding packet count | short | ||
ICMP Errors | InErrors-Message received and determined to have an ICMP specific error (error ICMP checksum, wrong length, etc.) | short | ||
UDP Errors | InCsumErrors - Average of UDP packets with checksum errors (within 5 minutes) InErrors - Average of UDP packets that cannot be delivered (application layer) due to reasons other than the native port is not listening (within 5 minutes) RcvbufErrors - Average of UDP packets that are overflowing in the received buffer (within 5 minutes) SndbufErrors - Average of UDP packets that are overflowing in the sent buffer (within 5 minutes) NoPorts - Average of UDP packets that are overflowing in the unknown port (within 5 minutes) |
short | p3 | 100 |
TCP Errors | ListenOverflows - Number of listen queues for sockets overflows ListenDrops - SYN to LISTEN sockets are ignored TCPSynRetranss - SYN-SYN/ACK retransmission to interrupt retransmission in SYN, fast/timeout retransmission RetransSegs - Number of retransmissions - That is, the number of transmitted TCP segments contains one or more previously transmitted octets InErrs - Error received segments (e.g., wrong TCP checksum) OutRsts - Segments sent with RST flag |
short | p3 | 100 |
TCP SyncCookie | SyncookiesFailed - Number of invalid SYN cookies received SyncookiesRecv - Number of SYN cookies received SyncookiesSent - Number of SYN cookies sent |
short |
Indicator Name | Description | Unit | Level | Reference Alarm Threshold |
---|---|---|---|---|
Node Exporter Scrape Time | Duration of individual collectors | seconds | ||
Node Exporter Scrape | Normal working number of each collector | short |
Indicator Name | Description | Unit | Level | Reference Alarm Threshold |
---|---|---|---|---|
Host five minutes load | Display all hosts selected within five minutes load | short | ||
Host Memory Percentage | Display All host memory usage percentages selected | 0-1 | ||
CPU Busy Percent | Display CPU Busy Percent | 0-1 | ||
Disk I/O Usage | Display Disk I/O Usage | 0-1 | ||
Remaining space utilization | Displaying the remaining space utilization of the selected host | 0-1 | ||
Send network traffic | Display the selected host to send network traffic | bit | ||
Receive network traffic | Display selected host receives network traffic | bit | ||
SWAP Usage | Displays the selected host SWAP Usage | 0-1 |
Indicator Name | Description | Unit | Level | Reference Alarm Threshold |
---|---|---|---|---|
net dev | Net device status | short | ||
softnet_stat | Displays the percentage of memory usage selected by all hosts | short | ||
hardirq_cpu | CPU hardware interrupts | short | ||
hardirq_cpu_pie | CPU hardware interrupt number pie chart | short | ||
hardirq_quene | Number of hard terminals for each device | short | ||
hardirq_quene_pie | Pie chart of hard terminals for each device | short | ||
softirq_rx | Number of data reception software interrupts | short | ||
softirq_rx_pie | Number of data reception software interrupts pie chart | short | ||
softirq_tx | Number of data transmission software interrupts | short | ||
softirq_tx_pie | Number of data transmission software interrupts pie chart | short | ||
ip | IP network layer protocol transmission and reception situation | short | ||
udp | UDP network protocol transmission and reception situation | short |
Indicator Name | Description | Unit | Level | Reference Alarm Threshold |
---|---|---|---|---|
license Expiration time | LICENSE Expiration remaining time | seconds(s) | p3/p2 | Remaining time is less than 15 days, and alarm p3 is required Remaining time is less than 7 days, and alarm p2 is required. Contact YMatrix in time to replace LICENSE |
Missing partition policy range table | Range partition table missing configuration APM partition policy | short | p2 | Need to be processed in time, otherwise the data will be written to the default partition, affecting performance |
Range partition table creation number | Range partition table new partition table delay number | short | p2 | Need to be processed in time, otherwise the data will be written to the default partition, affecting performance |
mars table maximum runs | MARS2 internal indicators | short | p3/p2 | Over 1500 alarm p3, you need to pay attention to whether it is still rising Over 1800 alarm p2 When this value reaches 2039, it will cause slow writing and even no writing |
Maximum block_items value | mxgate number of instantaneous batches written | short | ||
YMatrix Total number of processes | The corresponding number of postgres related processes for the selected host | short | p2 | Prevent too many processes, otherwise it will cause insufficient memory, configure on demand |
Repeat index number | Repeat index number, unwanted index can be considered to delete | short | p3 | |
matrixgate connections | mxgate total number of connections to process | short | ||
24-hour data total change value | The last 24-hour data total change | bytes | ||
Top10 Number of subpartitions | The top ten tables in the number of subpartitions, configure as needed to avoid too many subtables, which will have a certain impact on query performance and occupy more memory | bytes | ||
Top10 Mode Size | Ranking by Total Mode Size Top 10 | bytes | ||
Top10 System table size | Ranking by total system table size Top 10 | bytes | ||
Top10 Default partition table size | Ranking by default table size Top 10 | bytes | p3 | Default partition is too large and requires an alarm. Normally, data should not exist in the default partition |
Indicator Name | Description | Unit | Level | Reference Alarm Threshold |
---|---|---|---|---|
mars2 table maximum runs Details | MARS2 table runs Trend Chart | short | ||
Database connection details | Group by database, client address, application_name | short | ||
24-hour database space changes | 24-hour database size changes | short | ||
Total time-consuming query | Total time-consuming database query at each stage | millsseconds (ms) | p3 | Configuration on demand, attention should be paid when the total time changes are large |
Host YMatrix Process Trend Chart | Total Number of Host Postgres Process Trend Chart | short |
Indicator Name | Description | Unit | Level | Reference Alarm Threshold |
---|---|---|---|---|
Table expansion details | List Table dead tuples/surviving tuples > Table with 1.1 | short | ||
Top 100 Process RSS Details | Sorted by RSS, list the memory occupied by postgres processes top 100 | short | ||
Slow query monitoring | Slow execution in statistics database SQL | none | p3 | |
Total time-consuming query monitoring | Statistics SQL execution total time-consuming | millseconds (ms) | ||
Time-consuming statistics graph (seconds) | Statistics are executed within every five minutes total SQL time-consuming | millseconds (ms) | ||
Long Things Indicators | Statistics Master/Segment Medium Long Things Details | none | p3 | |
Lock waiting information | List receipt information moment, database lock waiting details | none | p3 | Configure according to requirements, you can configure locks for more than 10 minutes and alerts |