Health Monitoring

This document describes the graphical interface cluster health monitoring feature.

While supporting daily operations, YMatrix databases execute a large volume of SQL statements. Issues such as hardware failures (e.g., network outages) or lock contention due to transaction concurrency may occur. If not addressed promptly, these can lead to slow client responses or direct errors, negatively impacting business efficiency. To better address such issues, the graphical health monitoring feature helps you quickly identify abnormal behaviors in the database cluster.

Health monitoring periodically checks relevant system catalog tables based on different detection items. It verifies whether query execution states meet business expectations. When a deviation from expected conditions is detected, an alert notification is immediately generated. Notifications can be viewed within the graphical interface. For more timely alerts, you may also choose to receive email notifications if checking the web page is inconvenient.

1 Prerequisites

Enter the IP address (by default, the Master host IP) and port number of the machine where MatrixGate is running into your browser to log in to the graphical interface.

http://<IP>:8240  

2 Health Monitoring

After successful login, navigate to the Health Monitoring page.

2.1 Email Configuration

You may choose whether to configure an email server based on your needs. Once configured, you will receive alert notifications via email.

  1. Graphical Interface Domain Name
    To facilitate quick access to alert details, the email includes a link that redirects to the graphical interface. If recipients cannot access the default domain, modify this field accordingly.

  2. SMTP Server Address
    The SMTP server address consists of an IP address and port number. Example: smtp.example.com:465.

Common third-party email service addresses:

  1. Alibaba Cloud Mail

  2. Google Mail (Gmail)
    Enable IMAP or POP service first, see documentation.

  3. NetEase Mail

    • Personal Edition: Enable SMTP service first, see documentation.
    • Enterprise Edition: SMTP service is enabled by default. To verify status, see documentation.
      SMTP server address and port, see documentation.
  4. QQ Mail

Note!
If the email service is self-hosted, consult your email administrator or service provider.

  1. Username
    Account used for authentication on the SMTP server. Optional; required only if the SMTP server requires username authentication. Example: [email protected].

  2. Password
    Password for the SMTP username. Optional; required only if the SMTP server requires password authentication.

  • Common Third-Party Email Services
    1. Alibaba Cloud Mail: Use the mailbox login password.
    2. Google Mail: Use the mailbox login password.
    3. NetEase Mail:
      • Personal Edition: Use an authorization code as the password, see documentation.
      • Enterprise Edition: Default is login password. If the administrator has enabled client authorization codes, contact the administrator for details.
    4. QQ Mail:
      • Personal Edition: Use an authorization code as the password, see documentation.
      • Enterprise Edition: Default is login password. If secure login is enabled, use an authorization code, see documentation.

Note!
If the email service is self-hosted, consult your email administrator or service provider.

  1. Sender
    For third-party email services, this field should match the "Username".
    For self-hosted email services, enter the sender email address.

  2. Recipients
    Enter one or more recipient email addresses.

2.2 Monitoring Items

The list shows all monitoring items currently provided by YMatrix. All items are enabled by default. You can disable or enable them as needed.

If the default parameters do not suit your business scenario, you can modify them.

Item Monitoring Item Description
1 Cluster Unavailable Periodically runs SELECT * FROM gp_dist_random('gp_id'); to verify cluster availability. If this query fails three times consecutively, the cluster is likely down. Possible causes include primary Segment and its mirror Segment failure, network issues, power failure, or hardware faults.
2 Segment Failure A failed primary Segment causes resource skew on the corresponding mirror Segment. The mirror Segment's host experiences increased load, slowing queries. In severe cases, memory exhaustion may occur, leading to cluster unavailability.
A failed mirror Segment reduces high availability. If the corresponding primary Segment fails, the cluster becomes unavailable.
3 Query/Transaction Running Over 12 Hours Long-running queries/transactions consume excessive memory and CPU, slowing database response and potentially triggering OOM (out-of-memory). They may also delay VACUUM processes.
4 Transaction in "idle in transaction" State Over 1 Hour A transaction in "idle in transaction" state for a long time blocks most queries involving its tables and prevents VACUUM from reclaiming dead rows, causing table bloat.
5 Single Query/Transaction Blocks More Than 5 Others for Over 15 Minutes When a query/transaction blocks many others for a prolonged period, it can cause cascading blocking, reducing service responsiveness.
6 Query Requesting Exclusive or AccessExclusive Lock Blocked Over 15 Minutes A query waiting for an Exclusive or AccessExclusive table lock for a long time may cause a backlog of blocked queries, affecting response efficiency.
7 Query/Transaction Holding Exclusive or AccessExclusive Lock for Over 2 Hours A query/transaction holding an Exclusive or AccessExclusive table lock for a long duration blocks all queries accessing the locked table, degrading performance.
8 Transaction Holding Exclusive or AccessExclusive Lock in "idle in transaction" State Over 15 Minutes A transaction holding an Exclusive or AccessExclusive lock while in "idle in transaction" state for over 15 minutes blocks most queries on the involved tables, affecting response efficiency.

2.3 Email Notifications

If you have configured an email server, you will receive an email alert when a condition defined in any monitoring item is met.

2.4 Event History

Regardless of email configuration, you can view historical records of cluster events that triggered monitoring alerts under Event History.