Understanding Query Plans

This document introduces the basic structure and reading method of query plans, and provides examples and explanations of both single-node and parallel query plans.

1 What is a Query Plan?

A query plan (also known as an execution plan or query execution plan) is an important part of a complete query process. It is selected and generated by the optimizer, and provides a detailed description of how a SQL statement is executed in the database.

In centralized databases (such as PostgreSQL and other single-node databases), the database node instance has all the data, and the query process occurs entirely on the same machine. Generating and executing a single-node query plan is sufficient to meet query requirements.

In distributed databases (such as Greenplum and YMatrix), data is distributed across multiple nodes. Users need to specify the distribution strategy and distribution key when creating tables. Data is distributed across multiple Segment node instances based on the distribution strategy and key, with each Segment node instance holding only a portion of the data.

Therefore, the original single-node query plan cannot achieve a complete query. A parallel query plan (or distributed query plan) must be implemented. Based on data distribution, it determines how to efficiently parallelize computation and how to bring computation closer to data—that is, performing computation on Segment node instances as much as possible, avoiding computation on the Master, to ensure the best query performance.

2 How is a Query Plan Generated?

In YMatrix, after a table is imported into the database, you first need to use the ANALYZE command to collect statistics. Regularly executing ANALYZE on the entire database is also necessary. Only with correct statistics can the optimizer accurately select a query plan, thereby improving query processing speed.

With correct statistics collected, you can use the EXPLAIN command to view the different logical query plans generated by the optimizer for a specified SQL statement. The query optimizer typically generates multiple possible query plans and then selects the optimal plan to execute the query. By analyzing and understanding the query plan, developers and database administrators (DBAs) can optimize query performance and identify potential bottlenecks.

3 What Does a Query Plan Consist Of?

The structure of a query plan is a tree structure, composed of multiple plan nodes, each representing a query operation, such as table scanning, joining, aggregation, etc., and passing the result to the parent node for further processing.

Each node in the query plan contains detailed information about its execution, such as execution order, access method, concurrency, and estimated cost. The query plan tree starts from leaf nodes and ends at the root node. By traversing the entire tree, you can determine the query plan for the query in the database.

In distributed query plans, the most common plan node types and their operators are as follows:

Plan Node Type Operation Description Scalar Operator Vectorized Operator
Scan Plan Nodes Responsible for scanning tables or indexes in the database SeqScan: Sequentially scans all rows in a table
IndexScan: Traverses an index to retrieve rows from a table
BitmapHeapScan & BitmapIndexScan: Often used together. Bitmap index scan scans the index itself and generates a bitmap (index size affects bitmap granularity). Then, bitmap heap scan scans the data table based on the bitmap generated by the bitmap index scan. Note: Index scan reads one index entry from the index table at a time, then checks the data table to see if it meets the index condition, requiring alternating access between the index table and data table, suitable for precise positioning; Bitmap scan is different, it clearly divides into two stages. First, it retrieves all index entries that meet the conditions from the index table at once, sorts them in memory, and then accesses the data table based on the retrieved index entries, suitable for large-scale positioning
DynamicSeqScan: Uses a partition selection function to select partitions
ParallelSeq: If parallel execution is enabled, the number of parallel queries is usually equal to the number of CPU cores to achieve the highest efficiency
MxVScan: Sequential scan operations completed by the vectorized execution engine in MARS2/MARS3/AOCO tables
Join Plan Nodes Responsible for joining two or more tables HashJoin: Builds a hash table from the smaller table, using the join column as the hash key. Then scans the larger table, calculates the hash key for the join column, and explores the hash table to find rows with the same hash key. HashCond shows the columns to be joined
NestedLoopJoin: Iterates over rows in the larger dataset, scanning rows from the smaller dataset in each iteration. Nested loop join requires broadcasting one of the tables so that all rows in one table can be compared with all rows in the other table. However, if the distribution key is the same as the join key, local nested loop join can be performed without broadcasting
MergeJoin: Sorts two datasets and merges them
MxVHashJoin: Hash join operations completed by the vectorized execution engine in MARS2/MARS3/AOCO tables
Other Plan Nodes Materialize: Caches subquery execution results to the cache, i.e., the result tuples generated during the first execution are cached, waiting for use by upper nodes
Sort: Sorts tuples returned by the lower node
Group: Groups sorted tuples from the lower node
Agg: Executes aggregate functions
Unique: Performs deduplication
Hash: Performs hash operations, auxiliary node for HashJoin
Limit: Handles LIMIT/OFFSET clauses
WindowAgg: Handles window functions
LockRows: Locks all selected rows, can select rows based on FOR SHARE/FOR UPDATE clauses respectively
SetOP: Operates and combines multiple datasets, including union (UNION), intersection (INTERSECT), and difference (EXCEPT/MINUS), etc.
SubqueryScan: Evaluates any subquery
Append: Merges multiple result sets into one
Result: Handles conditional expressions that only need to be calculated once or INSERT ... VALUES statements that only contain a VALUES clause
GroupAgg: First sorts the data according to the aggregate columns, so that tuples with equal aggregate columns are adjacent, then traverses the sorted data again to complete deduplication and calculation of aggregate functions
HashAgg: No sorting required, groups and aggregates data by building a hash table
Broadcast Motion: Each Segment sends its own data rows to all other Segments, so that each Segment node instance has a complete local copy of the table data
Redistribute Motion: Each Segment recalculates data based on the redistribution key and sends it to the corresponding Segment
Gather Motion: Result data from all Segments is assembled into a single stream. For most query plans, this is the final operation
MxVMotion: Gather Motion implemented by the vectorized execution engine in MARS2/MARS3/AOCO tables
MxVSort: Sort operations completed by the vectorized execution engine in MARS2/MARS3/AOCO tables
MxVResult: Result operations completed by the vectorized execution engine in MARS2/MARS3/AOCO tables
MxVSubqueryScan: Subquery scan operations completed by the vectorized execution engine in MARS2/MARS3/AOCO tables
MxVLimit: Limit operations completed by the vectorized execution engine in MARS2/MARS3/AOCO tables
MxVHashAgg: HashAgg operations completed by the vectorized execution engine in MARS2/MARS3/AOCO tables
MxVGroupAgg: GroupAgg operations completed by the vectorized execution engine in MARS2/MARS3/AOCO tables
MxVAppend: Append operations completed by the vectorized execution engine in MARS2/MARS3/AOCO tables

The above nodes can be combined as needed to form a complete query plan.

4 How to Read a Query Plan

Due to its nested tree structure, a query plan should be read from bottom to top.

Here, we provide examples of both a single-node query plan and a parallel query plan for illustration.

4.1 Single-Node Query Plan

First, create two tables. (Example tables are created in PostgreSQL 9.2, a single-node database)

CREATE TABLE sale (
  cid int,
  pid int,
  pnum int,
  qty float,
  date timestamp
);

CREATE TABLE customer (
  cid int,
  cname text,
  cwechat text
);

Insert some test data:

INSERT INTO sale (cid, pid, pnum, qty, date) VALUES
(1, 1, 1, 10.0, '2023-07-20 08:37:06'),
(2, 3, 6, 35.0, '2023-07-15 18:22:00'),
(3, 2, 1, 3.0, '2023-07-17 11:37:00'),
(4, 1, 2, 10.0, '2023-07-20 10:37:09'),
(5, 3, 2, 35.0, '2023-07-17 08:12:06'),
(6, 1, 1, 10.0, '2023-07-02 12:10:23'),
(7, 3, 1, 35.0, '2023-07-07 08:07:00'),
(8, 5, 3, 99.0, '2023-07-20 10:37:06'),
(9, 3, 1, 35.0, '2023-07-20 15:30:00'),
(10, 3, 1, 35.0, '2023-07-20 09:00:00');

INSERT INTO customer (cid, cname, cwechat) VALUES
(1, 'kepler', 'kepler1997'),
(2, 'amiee', 'amiee1995'),
(3, 'lila', 'lila2002'),
(4, 'cici', 'cici1982'),
(5, 'damien', 'damien1983'),
(6, 'ariana', 'ariana1990'),
(7, 'kanye', 'kanye1960'),
(8, 'taylor', 'taylor1996'),
(9, 'michael', 'mike2005'),
(10, 'ray', 'ray1957');

Now we want to join the sale and customer tables on the shared cid key to calculate total sales per customer, and analyze the query plan for this query.

Before generating a query plan, statistics must be collected:

ANALYZE sale;
ANALYZE customer;

Generate a query plan:

EXPLAIN SELECT c.cname, sum(s.pnum * s.qty) AS amount 
FROM sale s, customer c 
WHERE s.cid = c.cid 
GROUP BY c.cname;

Query Plan Output:

HashAggregate  (cost=2.56..2.66 rows=10 width=14)
  Group Key: c.cname
  ->  Hash Join  (cost=1.23..2.46 rows=10 width=18)
        Hash Cond: (s.cid = c.cid)
        ->  Seq Scan on sale s  (cost=0.00..1.10 rows=10 width=16)
        ->  Hash  (cost=1.10..1.10 rows=10 width=10)
              ->  Seq Scan on customer c  (cost=0.00..1.10 rows=10 width=10)

The query plan has two main nodes: Hash Join and HashAggregate. Each node includes three cost values: cost, rows, and width. These help us understand the execution time, result set size, and data transmission/processing complexity in advance.

Query Plan Tree Structure: ![](https://img.ymatrix.cn/ymatrix_home/stand-alone plan tree (screenshot)_1691547703.png)

Reading a query plan requires understanding its nested structure. Each node is a sub-operation, and the data flow and computation order is from bottom to top. The optimizer first plans a scan operation. Scanning the customer table generates a cost of 1.10. After scanning, a hash table is built and stored in memory. Meanwhile, the sale table is scanned and hash calculations are performed. Then, the cid field is used as the join key to compare with the data in the hash table. At this point, the startup cost and total cost are 1.23 and 2.46, respectively. Finally, all data is grouped and aggregated by the cname field using the hash table, and the total query cost is 2.66.

Detailed Interpretation of Plan Components (see operator table above for details):

  • Seq Scan: Sequential scan. Used when the table is not indexed or most data needs to be retrieved.
  • Hash: Builds a hash table. Although shown as a separate node, it is part of the Hash Join node.
  • Hash Cond: Shows the join condition.
  • Hash Join: See operator table.
  • Group Key: The aggregation key.
  • HashAggregate: See operator table.
  • cost: Estimated cost of the plan. Consists of two parts:
    • Startup cost: Cost to return the first row. For seq scans, it's near zero; for sorts, it's higher.
    • Total cost: Cost to return all rows. Note: This is not actual time.
  • rows: Estimated number of rows output by the node. Reflects WHERE clause selectivity.
  • width: Width in bytes of all columns output by the node.

4.2 Parallel Query Plan

As above, first create two tables. (Sample tables are created using the MARS3 storage engine in distributed database YMatrix 5.1)

Before creating a MARS3 table, you need to first create a matrixts extension mainly used for timing-related functions.

CREATE EXTENSION matrixts;

CREATE TABLE sale (
  cid int,
  pid int,
  pnum int,
  qty float,
  date timestamp
) USING MARS3
DISTRIBUTED BY (cid,date)
ORDER BY (cid,date);

CREATE TABLE customer (
  cid int,
  cname text,
  cwechat text
) USING MARS3
DISTRIBUTED BY (cid)
ORDER BY (cid);

Insert the same test data and collect statistics as above.

Then generate a query plan using the same SQL statement. Since YMatrix enables vectorization by default, many vectorized operators will be used in the query plan, which helps improve query performance.

EXPLAIN SELECT c.cname, sum(s.pnum * s.qty) AS amount 
FROM sale s, customer c 
WHERE s.cid = c.cid 
GROUP BY c.cname;

Query Plan Output:

Gather Motion 4:1  (slice1; segments: 4)  (cost=72.26..72.44 rows=10 width=14)
  ->  GroupAggregate  (cost=72.26..72.32 rows=2 width=14)
        Group Key: c.cname
        ->  Sort  (cost=72.26..72.27 rows=2 width=18)
              Sort Key: c.cname
              ->  Redistribute Motion 4:4  (slice2; segments: 4)  (cost=0.00..72.25 rows=2 width=18)
                    Hash Key: c.cname
                    ->  Nested Loop  (cost=0.00..72.20 rows=2 width=18)
                          Join Filter: (s.cid = c.cid)
                          ->  Custom Scan (MxVMotion) Redistribute Motion 4:4  (slice3; segments: 4)  (cost=0.00..36.07 rows=2 width=16)
                                Hash Key: s.cid
                                ->  Custom Scan (MxVScan) on sale s  (cost=0.00..36.02 rows=2 width=16)
                          ->  Materialize  (cost=0.00..36.04 rows=2 width=10)
                                ->  Custom Scan (MxVScan) on customer c  (cost=0.00..36.02 rows=2 width=10)
 Optimizer: Postgres query optimizer

The query plan has 6 main nodes: Materialize, Redistribute Motion, Nested Loop, Sort, GroupAggregate, and Gather Motion. Each node includes three cost values: cost, rows, and width, which help us understand the execution time, result set size, and data transmission/processing complexity.

Query Plan Tree Structure: ![](https://img.ymatrix.cn/ymatrix_home/parallel planning tree (screenshot 2)_1691547677.png)

The optimizer first plans a Materialize operation, generating a cost of about 36 for scanning and materializing the customer table. Since the customer table is distributed by cid on the Segment, while the sale table is distributed by (cid,date), the sale table must be redistributed by cid before the relevant tuples of the two tables can be locally joined in each Segment. After redistribution, a Nested Loop join is performed with the filter condition s.cid = c.cid. At this point, the total cost of the query plan is about 72, which has already consumed a lot of resources. Since the query requires grouping results by the cname field, the optimizer plans a second redistribution operation and sorts by the cname field, and finally completes grouping and aggregation. After the above steps, the query is calculated and summarized to the Gather Motion node, and the total query cost (72.32) is estimated by the Gather Motion node and the result is sent to the Master.

Detailed Interpretation of Plan Components (see operator table above for details):

  • Materialize: See operator table.
  • MxVScan: See operator table.
  • MxVMotion: See operator table.
  • Hash Key: Creates hash keys for hash tables.
  • Nested Loop: See operator table.
  • Join Filter: Specifies the join condition.
  • Redistribute Motion: Moves tuples between Segment instances to complete the join.
  • Sort: See operator table.
  • Sort Key: The sort key.
  • GroupAggregate: See operator table.
  • Group Key: The aggregation key.
  • Gather Motion: Indicates when the Segment instance sends the result back to the Master, and the Master presents the result to the client. As long as Motion generates a query plan, this plan also has an implicit slice (slice 3) at its top level. However, not all query plans require this operator. For example, the CREATE TABLE x AS SELECT... statement, because tuples are sent to the newly created table instead of being sent to the Master.
  • segments: Number of data node instances.
  • slice: In a distributed database system, a query plan is usually divided into multiple slices, each slice is responsible for part of the plan, so that each part of the query plan can be processed in parallel by these Segments. The boundary of the slice is Motion. Whenever Motion is encountered, the Motion will be cut into the sender and receiver.
  • cost: See single-node query plan analysis.
  • rows: See single-node query plan analysis.
  • width: See single-node query plan analysis.

The schematic diagram of the Slice division between Segment 1 and Segment 2 in this query plan is as follows:

![](https://img.ymatrix.cn/ymatrix_home/slice (screenshot fix)_1691548044.png)