MatrixUI Health Monitoring

This document describes the graphical interface cluster health monitoring feature.

While supporting daily operations, YMatrix databases execute a large volume of SQL statements. Issues such as hardware failures (e.g., network outages) or lock contention due to transaction concurrency may occur. If not addressed promptly, these issues can slow down client responses or cause direct errors, affecting business efficiency. To better address such problems, the graphical health monitoring feature helps you quickly identify abnormal behaviors in the database cluster.

Health monitoring periodically checks relevant system catalog tables based on different detection items. It evaluates whether query execution states meet expected business conditions. When an unexpected state is detected, a notification is immediately sent. Notifications can be viewed within the graphical interface. For more timely alerts, you can also configure email notifications if checking the web page is inconvenient.

1 Prerequisites

Enter the IP address (by default, the Master host's IP) and port number of the machine where MatrixGate is running into your browser to log in to the graphical interface.

http://<IP>:8240

2 Basic Configuration

After logging in successfully, navigate to the Health Monitoring page.

2.1 Email Configuration

You may choose whether to configure email settings based on your needs. Once configured, you will receive alert notifications via email.

  1. Graphical Interface Domain Name
    To facilitate quick access to detailed alert information, we include a link in the email that redirects to the graphical interface. If recipients cannot access the default domain, modify this field accordingly.

  2. SMTP Server Address
    The SMTP server address consists of an IP address and port number. Example: smtp.example.com:465.

Common third-party email service addresses:

Note!
If the email service is self-hosted, consult your email administrator or service provider.

  1. Username
    The account used for authentication on the SMTP server. This field is optional and required only when the SMTP server requires username-based authentication. Example: [email protected].

  2. Password
    The password for the SMTP user. This field is optional and required only when the SMTP server requires both username and password for authentication.

  • Common third-party email services:
    • Alibaba Cloud Mail: Use the login password of the mailbox.
    • Google Mail: Use the login password of the mailbox.
    • NetEase Mail:
      • Personal Edition: Use an authorization code instead of the login password; see documentation.
      • Enterprise Edition: Default is login password. If the administrator has enabled client authorization codes, contact them for details.
    • QQ Mail:
      • Personal Edition: Use an authorization code; see documentation.
      • Enterprise Edition: Default is login password. If secure login is enabled, use an authorization code; see documentation.

Note!
For self-hosted email services, consult your email administrator or service provider.

  1. Sender
    For third-party email services, this field should match the "Username".
    For self-hosted services, enter the sender email address.

  2. Recipients
    Enter one or more recipient email addresses.

2.2 Email Notifications

If you have configured email settings, you will receive an alert email whenever an event meets the failure condition of any detection item.

2.3 Event History

Regardless of whether email notifications are configured, you can view historical records of events that met detection failure conditions under Event History.

3 Monitoring Items

The following is a list of monitoring items provided by YMatrix:

Item Monitoring Item Description
1 Cluster Unavailable Periodically runs the query `SELECT * FROM gp_dist_random('gp_id');` to check cluster availability. If this query fails three times consecutively, the cluster is likely down—possible causes include primary Segment and its mirror Segment failing simultaneously, network failure, power outage, or hardware failure.
2 Segment Failure A failed primary Segment causes resource skew on the corresponding mirror Segment host. The mirror Segment’s host experiences increased load, slowing queries. In severe cases, memory exhaustion on the skewed node may render the cluster unavailable.
A failed mirror Segment reduces high availability. If the corresponding primary Segment then fails, the cluster becomes unavailable.
3 Query/Transaction Running Over 12 Hours Long-running queries or transactions consume excessive memory and CPU resources, degrading database response performance and potentially triggering OOM (out-of-memory). They may also delay VACUUM processes.
4 Transaction Idle in Transaction for Over 1 Hour A transaction remaining idle in transaction state for a long time blocks most queries involving its tables and prevents VACUUM from reclaiming dead rows, leading to table bloat.
5 A Single Query/Transaction Blocks More Than 5 Others for Over 15 Minutes If a query or transaction blocks many others for a prolonged period, it may trigger cascading blockages among other statements, reducing service responsiveness.
6 Query Requesting Exclusive or AccessExclusive Lock Blocked for Over 15 Minutes A query requesting an Exclusive or AccessExclusive table-level lock, if blocked for a long duration, may cause a backlog of blocked queries, affecting response efficiency.
7 Query/Transaction Holding Exclusive or AccessExclusive Lock for Over 2 Hours A query or transaction holding an Exclusive or AccessExclusive table-level lock for a long time blocks all queries accessing the locked table, impacting service responsiveness.
8 Transaction Holding Exclusive or AccessExclusive Lock in Idle-in-Transaction State for Over 15 Minutes A transaction holding an Exclusive or AccessExclusive lock while idle in transaction for 15 minutes blocks most queries on related tables, affecting service responsiveness.
9 Disk You can quickly enable or disable disk monitoring options including: “Disk Full”, “Disk Space Below 20%”, “Disk Will Be Exhausted Within 7 Days”, and “Abnormal Disk Growth in Last 24 Hours”. Click the “Edit” button to adjust thresholds according to business needs.

All items are enabled by default but can be toggled as needed.

If the default parameters do not suit your business scenario, you can edit them.

More
For Grafana alert configuration, refer to Grafana Cluster Alerts.