Data Distribution Policies
YMatrix supports three core data distribution policies that determine how data is stored across cluster nodes, directly affecting query performance and cluster stability:
Distribution Policy Overview
| Distribution Method |
Use Case |
Advantages |
Disadvantages |
| HASH |
Large tables involved in equality joins on the distribution key |
Rows with the same key value reside on the same node, enabling efficient equality joins |
Poor randomness in key values may cause data skew |
| RANDOM |
Large tables that do not require joins and lack a suitable distribution key |
More uniform data distribution, reducing the risk of skew |
No predictable data placement; the optimizer cannot leverage distribution patterns for optimization |
| REPLICATED |
Small tables that are infrequently updated but frequently joined or indexed |
A full copy of the table exists on every node |
Consumes more storage space |
Hash Distribution (HASH)
How It Works
- One or more columns are specified as the distribution key when creating the table.
- When inserting data, the system computes a hash value based on the distribution key and maps the row to a specific segment node.
- NULL values are hashed and assigned to a specific segment like any other value, ensuring consistent distribution behavior.
Key Advantages
- Uniform data distribution: A high-cardinality distribution key prevents hotspots caused by excessive data on a single node.
- Excellent query performance: Equality queries (
WHERE column = value) can directly target the relevant segment without scanning other nodes.
- Efficient JOINs: When joining multiple tables on the same distribution key, inter-segment data movement is minimized, improving join speed.
Use Cases
- Most business tables (entity tables, fact tables), especially those with frequent queries and large data volumes.
- Scenarios requiring frequent equality queries or multi-table JOINs (e.g., financial systems, e-commerce order systems).
- Tables with a clear, high-cardinality query dimension.
Random Distribution (RANDOM)
How It Works
- Data is randomly assigned to segment nodes during insertion.
- No distribution key is required; the system automatically ensures even distribution across nodes.
Key Advantages
- Completely uniform data distribution with no hotspots.
- Simple configuration—no need to design a distribution key.
Use Cases
- ETL temporary tables and intermediate processing tables.
- Test tables or small tables with low query frequency.
- Tables without clear query dimensions and no need for multi-table joins.
Notes
- JOINs between tables require data redistribution across all nodes, resulting in lower performance. Not suitable for core business tables.
Replicated Distribution (REPLICATED)
How It Works
- A complete copy of the table is stored on every segment node in the cluster.
- Queries can read data directly from the local node without inter-node communication.
Key Advantages
- Extremely high single-node query performance: No data movement is required, leading to fast response times.
Use Cases
- Small tables (typically under 100 MB), such as dimension tables, code lookup tables, or configuration tables.
- Frequently queried small tables (e.g., account charts, user role tables).
- Dimension tables that join with many other tables (e.g., GIS reference tables).
Notes
- High storage overhead: Data redundancy scales linearly with the number of nodes.
- Lower write performance: All node copies must be updated synchronously, making it unsuitable for tables with frequent writes.