MatrixUI Health Monitoring

This document introduces the cluster health monitoring functionality of the graphical user interface.

The YMatrix database executes a massive number of SQL statements to support daily operations, which may encounter hardware issues such as network failures or lock waits caused by transaction concurrency. If not addressed promptly, these issues can lead to slow client response times or even direct errors, thereby impacting operational efficiency. To better address these issues, the health monitoring functionality of the graphical user interface can assist you in quickly identifying abnormal behavior within the database cluster.

Health monitoring regularly checks corresponding database system tables based on different monitoring items to verify whether query execution states align with business expectations. If any deviations from expected states are detected, an alert notification is immediately sent. These notifications can be viewed in the graphical interface. If you find it inconvenient to constantly check the page, you can opt for email notifications to receive alerts more promptly.

1 Preparation

Enter the IP address of the machine where MatrixGate is located (default is the Master's IP) and the port number in the browser to log in to the graphical interface.

http://<IP>:8240

2 Basic Configuration

After successfully logging in, enter the Health Monitoring page.

2.1 Email Configuration

You can choose whether to configure your email address as needed. If you complete the email configuration, you will receive email notifications.

Graphical Interface Domain Name To facilitate timely access to detailed alert information, we will include a link in the email to redirect to the graphical interface. If the email recipient cannot access the default domain name, this field must be modified.
SMTP Server Address
The SMTP server address consists of an IP address and port number. Example: smtp.example.com:465.

Common Third-Party Email Servers

Aliyun Email Service Address Guidelines
Personal Edition: First enable SMTP service, refer to Documentation. SMTP service address and port number, refer to Documentation.
Enterprise Edition: The email administrator must enable the SMTP service. Refer to the documentation. SMTP service address and port number: Refer to the documentation.
Google Email Service Address Guidelines:
First enable IMAP or POP service, refer to Document.
NetEase Email Service Address Explanation:
Personal Edition: First enable SMTP service, refer to [Document](https://help.mail.163.com/ faqDetail.do?code=d7a5dc8471cd0c0e8b4b8f4f8e49998b374173cfe9171305fa1ce630d7f67ac2cda80145a1742516).
Enterprise Edition: The SMTP service is enabled by default. If you need to verify the service status, refer to the documentation. SMTP service address and port number, refer to the documentation.
QQ Email Service Address Explanation:
Personal Edition: Enable the SMTP service first. Refer to the documentation. SMTP service address and port number, refer to the documentation.
Enterprise Edition: Steps to enable SMTP service, refer to Document. For the SMTP service address and port number, refer to Document.

Notes!
If the email service is set up by the enterprise itself, consult the email administrator or email service provider.

Username
The account used for authentication on the SMTP server. This field is optional and only required when the SMTP server requires a username for authentication. Example: [email protected].
Password
The password for the SMTP username. This field is optional and only required when the SMTP server requires both a username and password for authentication.

Common third-party email servers

Alibaba Cloud Email:
Use the email login password, which is the password associated with the username email.
Google Email:
Use the email login password, which is the password associated with the username email.
NetEase Email:
Personal version: An authorization code must be used as the password, refer to Documentation .
Enterprise Edition: The default login password is the email password. If the administrator has enabled the client authorization code feature, you must consult the administrator on how to obtain the authorization code.
QQ Email:
Personal Edition: An authorization code must be used as the password. Refer to the document.
Enterprise Edition: The default login password is the email password. If the administrator has enabled secure login, an authorization code is required. Refer to the document.

Notes!
If the email service is set up by the enterprise itself, consult the email administrator or email service provider.

Sender
If using a third-party email service, this field should be consistent with the “username” content; if using a self-built email service, just fill in the sender's email address.
Recipient
Enter the recipient's email address; multiple addresses can be entered.

2.2 Email Notifications

If you have configured an email address, you will receive an email when an event occurs that meets the detection project failure conditions.

2.3 Event History

Regardless of whether you have configured an email address, you can view records of events that occurred in the cluster and met the fault conditions of the detection items in the “Event History” section.

3 Monitoring Projects

The list of monitoring projects provided by YMatrix is as follows:

Serial number	monitoring projects	Explanation
1	Cluster unavailable	Periodically run the query SELECT * FROM gp_dist_random('gp_id'); to verify cluster availability. If this query fails three consecutive times, the cluster is most likely down, possibly due to simultaneous failure of the primary and mirror segments, network outages, power failures, or hardware issues.
2	Segment failure	When a primary segment fails, the corresponding mirror segment’s host becomes resource-skewed, its load increases, and query latency rises; in severe cases the skewed node may exhaust memory and render the cluster unavailable. When a mirror segment fails, the cluster’s high availability is reduced; if the corresponding primary segment subsequently fails, the cluster becomes unavailable.
3	Query/transaction duration exceeds 12 hours	Long-running queries/transactions can monopolize large amounts of memory and CPU, slowing database responsiveness and potentially triggering OOM (out-of-memory) conditions; they may also delay the VACUUM process.
4	Transaction idle in transaction for more than 1 hour	A transaction remaining idle in transaction for an extended period blocks most queries on its affected tables and prevents VACUUM from reclaiming dead tuples, causing table bloat.
5	Single query/transaction blocks more than 5 other queries for over 15 minutes	When one query/transaction blocks many others for a prolonged time, cascading waits can occur, severely degrading service responsiveness.
6	Query requesting Exclusive or AccessExclusive lock blocked for more than 15 minutes	A query waiting longer than 15 minutes for an Exclusive or AccessExclusive table-level lock can create a backlog of blocked queries, hurting overall responsiveness.
7	Query/transaction holding Exclusive or AccessExclusive lock for more than 2 hours	A query/transaction that holds an Exclusive or AccessExclusive table-level lock for an extended period blocks every query that touches the locked table, degrading responsiveness.
8	Transaction holding Exclusive or AccessExclusive lock and idle in transaction for more than 15 minutes	A transaction that holds an Exclusive or AccessExclusive table-level lock and remains idle in transaction for 15 minutes blocks most queries on the affected tables, reducing responsiveness.
9	Disk	Quickly enable/disable monitoring checks including “disk full,” “disk space below 20%,” “disk will exhaust within 7 days,” and “abnormal growth in the past 24 hours.” Click “Edit” to tailor thresholds to business needs.

The default setting is enabled, and it can be enabled as needed.

If you believe that the default parameters for the detection project do not meet your business needs, you can also edit them.

More
For Grafana alert settings, please refer to Grafana Cluster Alerts for configuration and usage.