English

Русский

简体中文

Blog Document About

Data Distribution Policies

YMatrix supports three core data distribution policies that determine how data is stored across cluster nodes, directly affecting query performance and cluster stability:

Distribution Policy Overview

Distribution Method	Use Case	Advantages	Disadvantages
HASH	Large tables involved in equality joins on the distribution key	Rows with the same key value reside on the same node, enabling efficient equality joins	Poor randomness in key values may cause data skew
RANDOM	Large tables that do not require joins and lack a suitable distribution key	More uniform data distribution, reducing the risk of skew	No predictable data placement; the optimizer cannot leverage distribution patterns for optimization
REPLICATED	Small tables that are infrequently updated but frequently joined or indexed	A full copy of the table exists on every node	Consumes more storage space

Hash Distribution (HASH)

How It Works

One or more columns are specified as the distribution key when creating the table.
When inserting data, the system computes a hash value based on the distribution key and maps the row to a specific segment node.
NULL values are hashed and assigned to a specific segment like any other value, ensuring consistent distribution behavior.

Key Advantages

Uniform data distribution: A high-cardinality distribution key prevents hotspots caused by excessive data on a single node.
Excellent query performance: Equality queries (WHERE column = value) can directly target the relevant segment without scanning other nodes.
Efficient JOINs: When joining multiple tables on the same distribution key, inter-segment data movement is minimized, improving join speed.

Use Cases

Most business tables (entity tables, fact tables), especially those with frequent queries and large data volumes.
Scenarios requiring frequent equality queries or multi-table JOINs (e.g., financial systems, e-commerce order systems).
Tables with a clear, high-cardinality query dimension.

Random Distribution (RANDOM)

How It Works

Data is randomly assigned to segment nodes during insertion.
No distribution key is required; the system automatically ensures even distribution across nodes.

Key Advantages

Completely uniform data distribution with no hotspots.
Simple configuration—no need to design a distribution key.

Use Cases

ETL temporary tables and intermediate processing tables.
Test tables or small tables with low query frequency.
Tables without clear query dimensions and no need for multi-table joins.

Notes

JOINs between tables require data redistribution across all nodes, resulting in lower performance. Not suitable for core business tables.

Replicated Distribution (REPLICATED)

How It Works

A complete copy of the table is stored on every segment node in the cluster.
Queries can read data directly from the local node without inter-node communication.

Key Advantages

Extremely high single-node query performance: No data movement is required, leading to fast response times.

Use Cases

Small tables (typically under 100 MB), such as dimension tables, code lookup tables, or configuration tables.
Frequently queried small tables (e.g., account charts, user role tables).
Dimension tables that join with many other tables (e.g., GIS reference tables).

Notes

High storage overhead: Data redundancy scales linearly with the number of nodes.
Lower write performance: All node copies must be updated synchronously, making it unsuitable for tables with frequent writes.