Daily Cleaning

This document describes how to correctly and effectively clean up and maintain YMatrix database tables and its basic principles.

1 What is the cleaning and maintenance of YMatrix?

YMatrix strongly recommends periodic cleaning and maintenance of database tables in production systems.

Periodic cleaning is implemented by the VACUUM command, which is mainly used to recycle the storage space occupied by dead tuples. In normal YMatrix operations, deleted or discarded tuples by updating are not physically deleted from their tables, they remain until VACUUM is finished. Therefore, it is necessary to execute VACUUM periodically, especially for frequently updated tables.

In most cases, it is enough to have the automatic cleaning daemon (Autovacuum) perform the cleaning, and you may need to adjust the automatic cleaning parameters described in it to get the best results. However, in some cases, you also need to be able to supplement or replace the activities of the background process using manually managed commands such as VACUUM. This is usually performed using the [cron script] (https://en.wikipedia.org/wiki/Cron) or the task scheduler script.

2 Why do we need to carry out periodic cleaning and maintenance?

Each table in the YMatrix database must be cleaned and maintained regularly for the following reasons:

  • Restore or reuse disk space occupied by updated or deleted rows. In YMatrix, the UPDATE or DELETE of a row at once will not immediately remove the old version of the row. This approach is necessary to benefit from multi-version concurrency control (MVCC): it cannot be deleted when an older version may still be visible to other transactions. But in the end, no transaction will be interested in an outdated or deleted version of the row. The space it takes must be recycled for new rows, which avoids unlimited growth in disk space requirements.
  • Update statistics used by the optimizer.
  • Update visibility map, which speeds up scans with index only.
  • Protect old data from being lost due to transaction ID revolving or multi-transaction ID revolving.

3 How to effectively clean and maintain?

3.1 Manual cleaning

Manual cleaning is done by running the VACUUM SQL statement.

For example, clean a table:

=# VACUUM <tablename>;

Clean all tables:

=# VACUUM;

In addition to the standard VACUUM statement, you can also use the VACUUM FULL statement for cleaning and maintenance. VACUUM FULL can reclaim more disk space but runs more slowly. In addition, the standard form of VACUUM can be run in parallel with production database operations (commands such as SELECT, INSERT, UPDATE, and DELETE will continue to work properly, but you cannot use ALTER TABLE and other commands to update the table definition during the cleaning). But VACUUM FULL requires an exclusive lock on the table where it works, so it cannot be paralleled with other uses of this table. Therefore, usually administrators should strive to use the standard VACUUM and avoid VACUUM FULL.

For different storage tables, the degree of blocking of other SQL statements varies as they execute the VACUUM clean command:

| Storage Engine | INSERT | UPDATE | DELETE | ALTER | SELECT | Physical File Size | |--|--|--|--|--|--|--|--| | HEAP | No blocking | No blocking | No blocking | Blocking | No blocking | Not changing | | AO | No blocking | No blocking | No blocking | Blocking | No blocking | Shrinking | | MARS2/MARS3 | No blocking | No blocking | No blocking | Blocking | No blocking | Shrinking |

Notes!
VACUUM generates a large amount of I/O traffic, which will cause poor performance of other active sessions. You can adjust some configuration parameters to perform the performance impact caused by background cleaning activities. See [Cost-based cleaning delay] (/doc/latest/reference/configuration_parameters/resource_consumption_parameters#vacuum).

3.2 Automatic cleaning

Automatic cleaning is done through the Autovacuum daemon, with the purpose of automatically executing the VACUUM and ANALYZE commands. When enabled:

  • Autovacuum checks tables that have been inserted, updated or deleted in large quantities. The interval between the two check operations is defined by the parameter autovacuum_naptime.
  • The Autovacuum daemon is actually composed of multiple processes. So how many processes can be run at the same time are defined by the autovacuum_max_workers parameter, and 3 worker processes are called by default for work.
  • When does the Autovacuum process trigger VACUUM:
    • If the actual number of dead tuples in a table exceeds the valid threshold calculated by the following formula due to an update or delete operation, the table will become an Autovacuum candidate table: autovacuum_vacuum_scale_factor * The total number of tuples in the table + autovacuum_vacuum_threshold.
  • When does the Autovacuum process trigger ANALYZE:
    • Since the end of the last analysis operation, any table with the number of tuples that have performed the insert/delete/update operations exceeding the valid threshold calculated by the formula below is eligible for Autovacuum analysis: autovacuum_analyze_scale_factor * The total number of tuples of this table + autovacuum_analyze_threshold.

Turn on the automatic cleaning function and reload the configuration file to take effect:

$ gpconfig -c autovacuum -v on
$ mxstop -u

Notes!
It is necessary to ensure that the track_counts TO on (SET track_counts TO on;) is automatically cleaned up the process. This parameter is enabled by default.

Notes!
For details on automatic cleaning related configuration parameters, please refer to [Automatic Cleanup Parameters] (/doc/latest/reference/configuration_parameters/automatic_vacuum_parameters).