Data Distribution Policies

YMatrix supports three core data distribution policies that determine how data is stored across cluster nodes, directly affecting query performance and cluster stability:

Distribution Policy Overview

Distribution Method Use Case Advantages Disadvantages
HASH Large tables involved in equality joins on the distribution key Rows with the same key value reside on the same node, enabling efficient equality joins Poor randomness in key values may cause data skew
RANDOM Large tables that do not require joins and lack a suitable distribution key More uniform data distribution, reducing the risk of skew No predictable data placement; the optimizer cannot leverage distribution patterns for optimization
REPLICATED Small tables that are infrequently updated but frequently joined or indexed A full copy of the table exists on every node Consumes more storage space

Hash Distribution (HASH)

How It Works

  • One or more columns are specified as the distribution key when creating the table.
  • When inserting data, the system computes a hash value based on the distribution key and maps the row to a specific segment node.
  • NULL values are hashed and assigned to a specific segment like any other value, ensuring consistent distribution behavior.

Key Advantages

  • Uniform data distribution: A high-cardinality distribution key prevents hotspots caused by excessive data on a single node.
  • Excellent query performance: Equality queries (WHERE column = value) can directly target the relevant segment without scanning other nodes.
  • Efficient JOINs: When joining multiple tables on the same distribution key, inter-segment data movement is minimized, improving join speed.

Use Cases

  • Most business tables (entity tables, fact tables), especially those with frequent queries and large data volumes.
  • Scenarios requiring frequent equality queries or multi-table JOINs (e.g., financial systems, e-commerce order systems).
  • Tables with a clear, high-cardinality query dimension.

Random Distribution (RANDOM)

How It Works

  • Data is randomly assigned to segment nodes during insertion.
  • No distribution key is required; the system automatically ensures even distribution across nodes.

Key Advantages

  • Completely uniform data distribution with no hotspots.
  • Simple configuration—no need to design a distribution key.

Use Cases

  • ETL temporary tables and intermediate processing tables.
  • Test tables or small tables with low query frequency.
  • Tables without clear query dimensions and no need for multi-table joins.

Notes

  • JOINs between tables require data redistribution across all nodes, resulting in lower performance. Not suitable for core business tables.

Replicated Distribution (REPLICATED)

How It Works

  • A complete copy of the table is stored on every segment node in the cluster.
  • Queries can read data directly from the local node without inter-node communication.

Key Advantages

  • Extremely high single-node query performance: No data movement is required, leading to fast response times.

Use Cases

  • Small tables (typically under 100 MB), such as dimension tables, code lookup tables, or configuration tables.
  • Frequently queried small tables (e.g., account charts, user role tables).
  • Dimension tables that join with many other tables (e.g., GIS reference tables).

Notes

  • High storage overhead: Data redundancy scales linearly with the number of nodes.
  • Lower write performance: All node copies must be updated synchronously, making it unsuitable for tables with frequent writes.