Understanding Query Plans

This document introduces the basic components of query plans and how to read them. It provides examples and explanations for both single-node and parallel (distributed) query plans.

1 What Is a Query Plan?

A query plan, also known as an execution plan or query execution plan, is a critical component in the process of executing a SQL statement. Generated by the query optimizer, it describes in detail how a given SQL statement will be executed within the database.

In centralized databases (such as PostgreSQL and other single-node systems), all data resides on one node instance, and the entire query execution occurs locally. A single-node query plan suffices for such environments.

However, in distributed databases like Greenplum and YMatrix, data is spread across multiple nodes. When creating tables, users must specify a distribution strategy and distribution key. Data is then distributed across multiple MXSegment instances based on this key, with each segment holding only a portion of the total dataset.

As a result, traditional single-node query plans are insufficient. A new type of plan—called a parallel query plan (or distributed query plan)—must be used. This plan leverages data distribution to enable efficient parallel computation and ensures that processing occurs as close to the data as possible. Specifically, computations should occur on MXSegment instances rather than on the MXMaster whenever feasible, minimizing data movement and maximizing query performance.

2 How Is a Query Plan Generated?

After loading data into YMatrix, you must first run the ANALYZE command to collect statistics. Periodic full-database ANALYZE operations are also recommended. Accurate statistics are essential for the optimizer to make informed decisions when selecting the most efficient query plan, thereby improving query performance.

Once statistics are collected, use the EXPALIN command to view the logical query plans generated by the optimizer for a given SQL statement. The optimizer typically evaluates multiple potential plans and selects the optimal one for execution. Analyzing and understanding these plans helps developers and database administrators (DBAs) optimize query performance and identify bottlenecks.

3 What Does a Query Plan Consist Of?

A query plan has a tree structure composed of multiple plan nodes. Each node represents a specific operation—such as table scan, join, or aggregation—and passes its result to the parent node for further processing.

Each node contains detailed execution information, including execution order, access method, concurrency level, and estimated cost. The plan tree starts at leaf nodes and ends at the root node. Traversing the tree from bottom to top reveals the complete execution flow.

In distributed query plans, common plan node types and their associated operators are listed below:

Plan Node Type Operation Description Scalar Operator Vectorized Operator
Scan Plan Nodes Responsible for scanning tables or indexes SeqScan: Sequentially scans all rows in a table
IndexScan: Traverses an index to fetch rows from the table
BitmapHeapScan & BitmapIndexScan: Often used together. Bitmap index scan reads qualifying index entries and builds a bitmap (granularity depends on index size). Bitmap heap scan then uses this bitmap to access the table. Note: Index scan retrieves one entry at a time and immediately checks against conditions, alternating between index and table access—ideal for precise lookups. Bitmap scan operates in two distinct phases: first, it collects all matching index entries and sorts them in memory; then, it accesses the table—better suited for broad-range queries
DynamicSeqScan: Uses a partition selection function to choose partitions
ParallelSeq: When parallelism is enabled, set the number of parallel workers equal to CPU cores for maximum efficiency
MxVScan: Vectorized sequential scan for MARS2/AOCO tables
Join Plan Nodes Combines two or more tables HashJoin: Builds a hash table from the smaller table using the join column as the hash key, then probes the larger table row by row. HashCond shows the joined columns
NestedLoopJoin: Iterates over rows in one dataset and scans the other for matches per iteration. Requires broadcasting one table unless both tables share the same distribution key, allowing local joins without data redistribution
MergeJoin: Sorts both datasets and merges them
MxVHashJoin: Vectorized hash join for MARS2/AOCO tables
Other Plan Nodes Materialize: Caches results of subqueries for reuse; stores output tuples upon first execution
Sort: Orders tuples returned from child nodes
Group: Groups sorted tuples from lower nodes
Agg: Applies aggregate functions
Unique: Removes duplicate rows
Hash: Builds a hash table, typically used as a helper for HashJoin
Limit: Handles LIMIT/OFFSET clauses
WindowAgg: Processes window functions
LockRows: Locks selected rows, optionally filtered by FOR SHARE/FOR UPDATE
SetOP: Performs set operations across multiple datasets: union (UNION), intersection (INTERSECT), difference (EXCEPT/MINUS)
SubqueryScan: Evaluates subqueries
Append: Combines multiple result sets into one
Result: Handles expressions computed once or simple VALUES statements with only a INSERT ... VALUES clause
GroupAgg: First sorts data by grouping columns so identical values are adjacent, then aggregates while scanning
HashAgg: Groups and aggregates without sorting by building a hash table
Broadcast Motion: Each segment sends its rows to all other segments, resulting in a full local copy on every segment
Redistribute Motion: Each segment recalculates data distribution based on a redistribution key and sends rows to the appropriate target segment
Gather Motion: Results from all segments are assembled into a single stream. This is usually the final step in most query plans
MxVMotion: Vectorized Gather Motion for MARS2/AOCO tables
MxVSort: Vectorized sort for MARS2/AOCO tables
MxVResult: Vectorized Result operation
MxVSubqueryScan: Vectorized subquery scan
MxVLimit: Vectorized Limit operation
MxVHashAgg: Vectorized HashAgg
MxVGroupAgg: Vectorized GroupAgg
MxVAppend: Vectorized Append

These nodes can be combined as needed to form a complete query plan.

4 How to Read a Query Plan

Due to its nested tree structure, a query plan must be read from bottom to top.

Below are examples of a single-node query plan and a parallel query plan.

4.1 Single-Node Query Plan

First, create two tables. (The example uses PostgreSQL 9.2.)

=# CREATE TABLE sale (
  cid int,
  pid int,
  pnum int,
  qty float,
  date timestamp
);
=# CREATE TABLE customer (
  cid int,
  cname text,
  cwechat text
);

Insert test data.

=# INSERT INTO sale (cid, pid, pnum, qty, date) VALUES
(1, 1, 1, 10.0, '2023-07-20 08:37:06'),
(2, 3, 6, 35.0, '2023-07-15 18:22:00'),
(3, 2, 1, 3.0, '2023-07-17 11:37:00'),
(4, 1, 2, 10.0, '2023-07-20 10:37:09'),
(5, 3, 2, 35.0, '2023-07-17 08:12:06'),
(6, 1, 1, 10.0, '2023-07-02 12:10:23'),
(7, 3, 1, 35.0, '2023-07-07 08:07:00'),
(8, 5, 3, 99.0, '2023-07-20 10:37:06'),
(9, 3, 1, 35.0, '2023-07-20 15:30:00'),
(10, 3, 1, 35.0, '2023-07-20 09:00:00');
=# INSERT INTO customer (cid, cname, cwechat) VALUES
(1, 'kepler', 'kepler1997'),
(2, 'amiee', 'amiee1995'),
(3, 'lila', 'lila2002'),
(4, 'cici', 'cici1982'),
(5, 'damien', 'damien1983'),
(6, 'ariana', 'ariana1990'),
(7, 'kanye', 'kanye1960'),
(8, 'taylor', 'taylor1996'),
(9, 'michael', 'mike2005'),
(10, 'ray', 'ray1957');

Now compute the total sales amount per customer by joining the sale and customer tables on their common cid key, and analyze the query plan.

First, collect statistics.

=# ANALYZE sale;
=# ANALYZE customer;

Generate the query plan.

=# EXPLAIN SELECT c.cname, sum(s.pnum * s.qty) AS amount FROM sale s, customer c WHERE s.cid = c.cid GROUP BY c.cname;
                                  QUERY PLAN
------------------------------------------------------------------------------
 HashAggregate  (cost=2.56..2.66 rows=10 width=14)
   Group Key: c.cname
   ->  Hash Join  (cost=1.23..2.46 rows=10 width=18)
         Hash Cond: (s.cid = c.cid)
         ->  Seq Scan on sale s  (cost=0.00..1.10 rows=10 width=16)
         ->  Hash  (cost=1.10..1.10 rows=10 width=10)
               ->  Seq Scan on customer c  (cost=0.00..1.10 rows=10 width=10)
(7 rows)

This plan has two main nodes: Hash Join and HashAggregate. Each node includes three cost estimates: cost, rows, and width. These help predict execution time, result size, and processing complexity.

The plan tree structure is shown below:

Understanding a query plan requires recognizing its nesting. Each node is a sub-operation, and data flows upward from child to parent. The optimizer begins with a sequential scan of the customer table, costing 1.10. Then, a hash table is built in memory from the sale table via another sequential scan. Using cid as the join key, the system compares rows from the sale table against the hash table. At this stage, startup and total costs are 1.23 and 2.46, respectively. Finally, results are grouped and aggregated by cname, yielding a total cost of 2.66.

Detailed explanation of plan components (see operator table above):

  • Seq Scan: Sequential scan. Used when no index exists or when most of the table needs to be read.
  • Hash: Builds a hash table. Though shown as a separate node, it is part of the Hash Join operation.
  • Hash Cond: Specifies the join condition.
  • Hash Join: See operator table.
  • Group Key: Column used for grouping.
  • HashAggregate: Equivalent to HashAgg; see operator table.
  • cost: Estimated execution cost, split into two parts by "...":
    • First value: Startup cost—the cost to return the first row. For Seq Scan, this is near zero; for Sort, it's higher due to pre-processing.
    • Second value: Total cost—the cost to return all rows. Note: Cost does not represent time.
  • rows: Number of rows output by the node. May be less than scanned rows due to filtering by WHERE. Total cost assumes all rows are retrieved, though this may not happen (e.g., with LIMIT).
  • width: Total byte size of output columns.

4.2 Parallel Query Plan

As before, create two tables.

Before creating MARS2 tables, install the matrixts extension for time-series functionality.

=# CREATE EXTENSION matrixts;
=# CREATE TABLE sale (
  cid int,
  pid int,
  pnum int,
  qty float,
  date timestamp
) USING MARS2
DISTRIBUTED BY (cid,date);
=# CREATE INDEX ON sale USING mars2_btree (cid,date);
=# CREATE TABLE customer (
  cid int,
  cname text,
  cwechat text
) USING MARS2
DISTRIBUTED BY (cid);
=# CREATE INDEX ON customer USING mars2_btree (cid);

Insert the same test data and collect statistics.

Now generate the query plan using the same SQL. Since YMatrix enables vectorization by default, many vectorized operators appear in the plan, enhancing performance.

=# EXPLAIN SELECT c.cname, sum(s.pnum * s.qty) AS amount FROM sale s, customer c WHERE s.cid = c.cid GROUP BY c.cname;
                                                                QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
 Gather Motion 4:1  (slice1; segments: 4)  (cost=72.26..72.44 rows=10 width=14)
   ->  GroupAggregate  (cost=72.26..72.32 rows=2 width=14)
         Group Key: c.cname
         ->  Sort  (cost=72.26..72.27 rows=2 width=18)
               Sort Key: c.cname
               ->  Redistribute Motion 4:4  (slice2; segments: 4)  (cost=0.00..72.25 rows=2 width=18)
                     Hash Key: c.cname
                     ->  Nested Loop  (cost=0.00..72.20 rows=2 width=18)
                           Join Filter: (s.cid = c.cid)
                           ->  Custom Scan (MxVMotion) Redistribute Motion 4:4  (slice3; segments: 4)  (cost=0.00..36.07 rows=2 width=16)
                                 Hash Key: s.cid
                                 ->  Custom Scan (MxVScan) on sale s  (cost=0.00..36.02 rows=2 width=16)
                           ->  Materialize  (cost=0.00..36.04 rows=2 width=10)
                                 ->  Custom Scan (MxVScan) on customer c  (cost=0.00..36.02 rows=2 width=10)
 Optimizer: Postgres query optimizer
(15 rows)

This plan has six main nodes: Materialize, Redistribute Motion, Nested Loop, Sort, GroupAggregate, and Gather Motion. Each includes cost, row count, and width estimates.

The plan tree structure is shown below:

The optimizer starts with a materialization operation. Scanning and materializing the customer table costs approximately 36. Because the customer table is distributed by cid and the sale table by (cid,date), the sale table must be redistributed by cid so that related rows reside on the same segment, enabling local joins. After redistribution, a nested loop join applies the filter condition s.cid = c.cid. By this point, the total cost reaches about 72, indicating significant resource usage. Since the query requires grouping by cname, a second redistribution occurs, followed by sorting on cname and group aggregation. The final results are sent to the Gather Motion node, where the Gather Motion node computes the total cost (72.32) and returns results to the MXMaster.

Explanation of key components (see operator table above):

  • Materialize: See description above.
  • MxVScan: Vectorized scan for MARS2/AOCO tables.
  • MxVMotion: Vectorized motion operator.
  • Hash Key: Column used to build a hash table.
  • Nested Loop: See description above.
  • Join Filter: Condition applied during the join.
  • Redistribute Motion: Moves tuples between MXSegment instances to align data for joins.
  • Sort: See description above.
  • Sort Key: Column used for sorting.
  • GroupAggregate: Aggregates after grouping.
  • Group Key: Column used for grouping.
  • Gather Motion: Collects results from all segments and sends them to the MXMaster for client delivery. Any plan containing Motion is sliced implicitly. The top-level slice here is slice1. Not all queries require this—e.g., CREATE TABLE x AS SELECT... writes directly to a table, bypassing the MXMaster.
  • segments: Number of data node instances.
  • slice: In distributed systems, query plans are divided into slices, each processed in parallel by segments. Slices are separated by Motion operators—each Motion splits the plan into sender and receiver sides.

Slice division for Segment 1 and Segment 2 is illustrated below: