Depending on the different characteristics of the storage engine, you can flexibly create different tables under different scenario requirements. We give the following example.
MARS2 tables depend on the matrixts
extension. Before building a table, you first need to create an extension in the database using the storage engine.
Notes!
matrixts
is extended to the database level, and it can be created once in a database without repeated creation.CREATE EXTENSION matrixts;
When creating tables, use
USING MARS2
to specify the storage engine, and the basic construction representation is as follows.CREATE TABLE mars2( time timestamp with time zone, tag_id int, i4 int4, i8 int8 ) USING MARS2;
We extend to the following examples of Internet of Vehicle timing scenarios for further explanation.
CREATE TABLE vehicle_basic_data_mars2( daq_time timestamp encoding (minmax), vin varchar(32) COLLATE "C" encoding (minmax), lng float encoding (minmax), lat float encoding (minmax), speed float , license_template varchar(16) , flag integer ) USING MARS2 WITH (compresstype=zstd, compresslevel=3) DISTRIBUTED BY (vin) PARTITION BY RANGE (daq_time) ( START ('2022-07-01 00:00:00') INCLUSIVE END ('2022-08-01 00:00:00') EXCLUSIVE EVERY (INTERVAL '1 day') ,default partition others);
After successfully creating the MARS2 table, you must create an additional index of type mars2_btree so that you can read and write normally. The purpose of using index sorting is to make data of the same dimension or similar characteristics physically close as possible to reduce the number of I/O addressing and improve query efficiency. Therefore, the selection of sorting keys needs to comply with the main business query characteristics. For example, the requirement is a single-device point query, then the sort key is the device number (vin) in the timing scenario. If the requirement is a detailed query, aggregation query or multi-device query within a certain time period, then the sort key is the device number and timestamp (vin,daq_time).
CREATE INDEX idx_mars2 ON vehicle_basic_data_mars2 USING mars2_btree(vin, daq_time) WITH(uniquemode = true);
Combining the specific parameters and statements in the above example table, make best practice instructions:
> ***Notes!***
We hope you will design ideas based on our suggestions, but we don’t want you to blindly apply them. The specific timing scenarios are ever-changing, and specific analysis of specific situations is still very necessary.
1. Select the sort key
When designing a table, you must consider the selection of sorting keys. We recommend that you select the identification field of the data subject. If there is a time stamp, you can add it to the identification field together as the sort key. In the example table is `(vin, daq_time)`.
The sort key is part of the index, so this type of query can be located to the storage location through the index, thereby avoiding full table scanning. This is to optimize queries based on the data subject and allow the data to be stored in the order of the data subject.
Data from the same subject is stored continuously, which can reduce the number of I/O and improve query speed.
2. encoding (minmax)
This option is used to improve computing power.
First of all, it is recommended to add the minmax option to all sort keys, because the sort keys are ordered, and minmax filtering can also have a good filtering effect when scanning the full table.
Secondly, if you want to perform aggregated queries such as min, max, avg, sum, count, etc. based on this field, or you need to add this option when filtering the `WHERE` condition.
The addition of this option needs to be determined based on the business scenario. Do not add it to all fields without thinking, otherwise this operation will bring a lot of consumption of CPU and disk I/O for data writing.
2. COLLATE "C"
This option is only added in the Device Encoding field, and no other fields need to be added. With this operation, you can improve the sorting and querying speed of text types.
3. USING MARS2
It uses a fixed writing method of MARS2 tables, and cannot change the content and uppercase and uppercase.
4. WITH (compresstype=zstd, compresslevel=3)
Specify the compression algorithm as zstd, and also rle_type, zlib, lz4 compression algorithms. It is recommended to use zstd in this table, and level 3 is used for the compression level. This example uses a general compression algorithm, see "2 Using Compression" below for custom compression schemes.
5. DISTRIBUTED BY (vin):
Use this statement to select the distribution key.
Distribution keys determine how data is distributed to each data node (segment), which has a great impact on performance. The main selection criteria are:
(1). Try to avoid data skew, so that the data is evenly distributed, and queries can be more fully parallel;
(2) When it comes to multi-table connections, try to keep the connection keys consistent with the distribution keys at the beginning of the design to avoid data movement.
In the example, we use the device encoding field as the distribution key, so that we can query the data of the same device and perform related calculations. This avoids performance losses caused by data redistribution between nodes.
6. PARTITION BY RANGE (daq_time)
Specifies the partition key for the partition table.
We recommend choosing time as the partition key. On the one hand, it can automatically eliminate data in conjunction with [automatic partition management] (/doc/5.0/datamodel/apm). On the other hand, it can use partition cutting capabilities to filter time partitions that are not interested.
It is recommended that you control the size of the partition to around one million to ten million data lines. Too low will cause too many partitions to consume memory, too high filtering is not good, and it is not convenient for data elimination.
In the example, we use the time when the device collects data as the partition key. Usually a large number of queries are the time to filter data collection. For example, when you want to query the data within a day and perform related calculations, you need to add the filtering condition `WHERE daq_time >= CURRENT_DATE - INTERVAL '1 day', so that the database will quickly determine which partition subtable the data is on, so as to quickly locate the data and query the table.
7. ( START ('2022-07-01 00:00:00') INCLUSIVE
END ('2022-08-01 00:00:00') EXCLUSIVE
EVERY (INTERVAL '1 days')
,default partition others);
This SQL statement indicates that you will use the START...END statement and the INCLUSIVE and EXCLUSIVE keywords to create a subpartition table that starts at 2022-07-01 zero point to before 2022-08-01 zero point.
8. EVERY (INTERVAL '1 day')
The time span of the subpartition table is 1 day. In addition to using day, you can also use hour, month, year, etc. and determine it according to the size of the data volume. For example, in a day, your server receives up to 1 million to 10 million pieces of data, and using "1 day" as the time interval on that day is the best choice; if the data volume of a day is only tens of thousands or hundreds of thousands, then choosing "1 month" is very suitable; if the data volume per day is only tens of thousands, then the "1 year" subpartition table can be generated once.
9. default partition others
Indicates the default partition. If the corresponding subtable storage of data cannot be found in other partitions, it will be stored in this partition.
> ***Notes!***
When data from the same time point in the device is reported in batches, MARS2 can merge data from the same device (the value of vin in this table) at the same time (the value of daq_time in this table). The merge feature requires that `uniquemode=true` be manually specified when creating an index, because the default value of this option is `false`. For example, when you specify `uniquemode=true`, device 'A01' will send back 3 data at '2022-01-01 00:00:00', and finally the original two data will be overwritten according to the last data, only one data will be retained; but if you default `uniquemode=false`, then device 'A01' will send back '2022-01-01 00:00:00', the 3 data will be retained in the end without any processing.
## 2 HEAP table / AO table
### 2.1 HEAP table
The HEAP table is the YMatrix default storage engine, so if you do not specify the storage engine when creating the table, all the created HEAP tables are HEAP tables.
CREATE TABLE disk_heap( time timestamp with time zone, tag_id int, read float, write float ) DISTRIBUTED BY (tag_id);
### 2.2 AO table
#### 2.2.1 AORO table
The AORO table is the AO table that is stored. Row storage is the default storage method for AO tables.
AORO supports table-level compression, not column-level compression.
CREATE TABLE disk_aoro( time timestamp with time zone, tag_id int, read float, write float ) WITH(appendonly=true,orientation=row,compresstype=zlib,compresslevel=4) DISTRIBUTED BY (tag_id);
#### 2.2.2 AOCO table
The AOCO table is the column-stored AO table.
AOCO supports table-level compression and column-level compression.
CREATE TABLE disk_aoco( time timestamp with time zone, tag_id int ENCODING(compresstype=zlib,compresslevel=5), read float, write float ) WITH(appendonly=true,orientation=column,compresstype=zlib,compresslevel=6) DISTRIBUTED BY (tag_id);
### 2.3 Choose one best practice
The best practices to determine whether to use HEAP or AO tables are:
- Use the HEAP storage engine for tables and partitions that frequently update, delete, and single insert operations
- Use the HEAP storage engine for tables and partitions that will receive concurrent updates, deletes, and insert operations
- Use the AO storage engine for tables and partitions that are rarely updated after initial loading and are subsequently only performed in batch operations. **Never perform a single update, delete, or insert on the AO table. Concurrent batch insertion operations can be performed but concurrent batch updates or deletes are never performed.
- The space occupied by updated and deleted rows in the AO table will not be effectively recycled and reused like the HEAP table, so the AO storage engine is not suitable for frequently updated tables. Its design goal is to be used for large tables that are loaded at once, rarely updated, and frequently analyzed query processing.