ANALYZE

Collect statistics about a database.

ANALYZE [VERBOSE] [table [ (column [, ...] ) ]]

ANALYZE [VERBOSE] {root_partition_table_name|leaf_partition_table_name} [ (column [, ...] )] 

ANALYZE [VERBOSE] ROOTPARTITION {ALL | root_partition_table_name [ (column [, ...] )]}

describe

ANALYZE Collects statistics about table contents in the database and stores the results in the system table pg_statistic. The YMatrix database then uses this statistics to help determine the most efficient query execution plan.

If no parameters are used, ANALYZE collects statistics for each table in the current database. You can specify table names to collect statistics for a single table. You can specify a set of column names, in which case only statistics for those columns are collected.

ANALYZE does not collect statistics for external tables.

For partition tables, ANALYZE collects other statistics on the leaf partition, namely HyperLogLog (HLL) statistics. HLL statistics are used to obtain the number of different values ​​(NDVs) of queries for partitioned tables.

  • When summarizing NDV estimates for multiple leaf partitions, HLL statistics generate more accurate NDV estimates than standard table statistics.
  • When updating HLL statistics, the ANALYZE operation is only required on the changed leaf partition. For example, ANALYZE is required if the leaf subpartition data has changed, or the leaf subpartition has been exchanged with another table.

Parameters

{ root_partition_table_name | leaf_partition_table_name } [ (column [, ...] ) ]

  • Collect statistics of partition tables, including HLL statistics. HLL statistics are collected only on leaf partitions.
    ANALYZE root_partition_table_name, collects statistics for all leaf partitions and root partitions.
    ANALYZE leaf_partition_table_name, collects statistical information about leaf partitions.
    By default, if a leaf partition is specified and all other leaf partitions have statistics, ANALYZE updates the root partition statistics. If not all leaf subpartitions have statistics, ANALYZE records information about leaf subpartitions without statistics.

ROOTPARTITION [ALL]

  • Collect statistics on the root partition of the partition table based on the data in the partition table only. If possible, ANALYZE uses leaf partition statistics to generate root partition statistics. Otherwise, ANALYZE collects statistics by sampling leaf partition data. Statistics were not collected on the leaf partition and only the data was sampled. HLL statistics are not collected.
    When specifying ROOTPARTITION, the name of the ALL or partition table must be specified.
    If ROOTPARTITION is specified as ALL, YMatrix collects statistics on the root partitions of all partition tables in the database. If there is no partition table in the database, a message is returned stating that there is no partition table. For tables that are not partitioned tables, no statistics are collected.
    If you specify a table name using ROOTPARTITION and the table is not a partitioned table, no statistics are collected for the table and a warning message is returned.
    The ROOTPARTITION clause does not apply to VACUUM ANALYZE. The VACUUM ANALYZE ROOTPARTITION command returns an error.
    The time to run ANALYZE ROOTPARTITION is similar to the time to analyze a non-partitioned table with the same data, because ANALYZE ROOTPARTITION only samples leaf partition data.
    For partition table sales_curr_yr, this example command collects statistics only on the root partition of the partition table. ANALYZE ROOTPARTITION sales_curr_yr;

VERBOSE

  • Enable display progress messages. Enable Show progress messages. When specified, ANALYZE sends this message
    • Table being processed.
    • Execute the query to generate the sample table.
    • The column for which statistics are calculated.
    • Publish a query to collect different statistics for a single column.
    • Statistics collected.

table

  • The name of the specific table to be analyzed (probably schema-qualified). If omitted, all regular tables in the current database (rather than external tables) are analyzed.

column

  • The name of the specific column to be analyzed. Defaults to all columns.

Notice

Analysis is performed only when the appearance is clearly selected. Not all external data wrappers support ANALYZE. If the table's wrapper does not support ANALYZE, the command will display a warning and do nothing.

It is best to run ANALYZE regularly or immediately after making significant changes to the content of the table. Accurate statistics help YMatrix databases select the most appropriate query plan, thereby increasing query processing speed. A common strategy for read-only databases is to run VACUUM and ANALYZE once a day at a low usage time of day. (This is not enough if there is a lot of update activity.) You can use the gp_stats_missing view in gp_toolkit schema to check tables with missing statistics:

SELECT * from gp_toolkit.gp_stats_missing;

ANALYZE requires SHARE UPDATE EXCLUSIVE lock to the target table. This lock conflicts with the following locks: SHARE UPDATE EXCLUSIVE, SHARE, SHARE ROW EXCLUSIVE, EXCLUSIVE, ACCESS EXCLUSIVE.

If you run ANALYZE on a table that does not contain data, statistics are not collected for that table. For example, if you perform a TRUNCATE operation on a table with statistics and then run ANALYZE on that table, the statistics will not change.

For partitioned tables, if the partitioned table has a large number of parsed partitions and only a few leaf partitions have partitions, specifying the part of the table to be parsed, the root partition, or the subpartition (leaf partition table) may be useful to change.

  • When ANALYZE is run on the root partition table, statistics for all leaf partitions are collected. Leaf subpartitions are the lowest-level tables in the subtable hierarchy created by YMatrix databases for subtable use.
  • When running ANALYZE on a leaf partition, only statistics for that leaf partition and root partition are collected. If the data in the leaf partition has changed (for example, you made a significant update to the leaf subpartition data or swapped the leaf subpartition), you can run ANALYZE on the leaf subpartition to collect table statistics. By default, if all other leaf partitions have statistics, the command updates the root partition statistics.
    For example, if you collect statistics on a partition table with a large number of partitions and then update data in only a few leaf partitions, you can run ANALYZE on only those partitions to update statistics for the partition and statistics for the root partition.
  • Statistics are not collected when running ANALYZE on a subtable that is not a leaf partition.
    For example, you can create a partition table that contains partitions from 2006 to 2016 and subpartitions for each month of each year. If you run ANALYZE on a 2013 subtable, no statistics are collected. If you run ANALYZE on a leaf partition in March 2013, only statistics for that leaf partition are collected.

ANALYZE does not collect statistics for external table partitions for partitions that have been swapped to use external tables:

  • If ANALYZE is run on an external table partition, the partition will not be analyzed.
  • If ANALYZE or ANALYZE ROOTPARTITION is run on the root partition, the external table partition is not sampled and the root table statistics do not include the external table partition.
  • If the VERBOSE clause is specified, a reference message is displayed: skipping external table.

YMatrix database server configuration parameters optimizer_analyze_root_partition affects when statistics are collected on the root partition of the partition table. If the parameter is on (default), the ROOTPARTITION keyword is not required to collect statistics on the root partition when running ANALYZE. When running ANALYZE on the root partition or ANALYZE on the cotyledon partition of the partition table and other cotyledon partitions have statistics, root partition statistics are collected. If the parameter is off, you must run ANALZYE ROOTPARTITION to collect root partition statistics.

The statistics collected by ANALYZE usually include a list of some of the most commonly used values ​​in each column and a histogram showing an approximate data distribution in each column. If ANALYZE considers them unimportant (for example, there is no common value in a unique key column), or if the column data type does not support the appropriate operator, one or both of them can be ignored.

For large tables, ANALYZE takes a random sample from the table contents instead of checking each row. This allows analyzing very large tables in a very short time. Note, however, that the statistics are only approximate and that each time you run ANALYZE, there will be a slight change, even if the actual table contents have not changed. This may cause subtle changes in the cost of planners shown by EXPLAIN. In rare cases, this uncertainty will cause the query optimizer to choose different query plans between ANALYZE runs. To avoid this, increase the amount of statistics collected by ANALYZE by adjusting the default_statistics_target configuration parameter, or by using ALTER TABLE ... ALTER COLUMN ... SET (n_distinct ...) (see ALTER TABLE). Target value sets the maximum number of entries in the list of most commonly used values ​​and the maximum number of bins in the histogram. The default target value is 100, but the value can be adjusted up or down to weigh the accuracy of the planner estimate against the time it takes ANALYZE and the amount of space taken in pg_statistic. In particular, setting the statistics target to zero disables statistics collection for the column. This may be useful for columns that are never used as part of a WHERE, GROUP BY, or ORDER BY clause that are never used as a query, because the scheduler will not use statistics for such columns.

The largest statistical information target in the column to be analyzed determines the number of table rows sampled to prepare statistics. Increasing the target results in a proportional increase in the time and space required to perform ANALYZE.

One of the estimated values ​​is the number of different values ​​that appear in each column. Because only a subset of rows is checked, this estimate can sometimes be very inaccurate even with the largest possible statistical target. If this error causes the query plan to be incorrect, you can manually determine a more accurate value and install it with ALTER TABLE ... ALTER COLUMN ... SET STATISTICS DISTINCT.

When the YMatrix database performs an ANALYZE operation to collect statistics for the table and detects that all sampled table data pages are empty (without valid data), the YMatrix database displays a message stating that the VACUUM FULL operation should be performed. If the sample page is empty, the table statistics will be inaccurate. After making a lot of changes to the table (such as deleting a large number of rows), the page becomes empty. The VACUUM FULL operation deletes blank pages and allows the ANALYZE operation to collect accurate statistics.

If the table has no statistics, the server configuration parameter gp_enable_relsize_collection controls whether the Postgres query optimizer uses the default statistics file or estimates the table size using the pg_relation_size function. By default, if statistics are not available, the Postgres optimizer uses the default statistics file to estimate the number of lines.

Example

Collect statistics of table mytable:

ANALYZE mytable;

compatibility

There is no ANALYZE statement in the SQL standard.

See also

VACUUM