Data Modeling

In any database, the data follows a certain model, such as relational model, KV model, etc. The hyper-converged timing database MatrixDB, based on the relational model, adds extensible data types, making it convenient for relationship operations, and is not limited by the fixed relationship model, which is both efficient and flexible.

Please refer to [MatrixDB Data Modeling and Spatial Distribution Model] for the teaching video of this course (https://www.bilibili.com/video/BV133411r7WH?share_source=copy_web)

1. Data Paradigm

To access timing data, you must first know that the timing data contains the following information:

  • Equipment Information
  • Indicator information

Therefore, when doing data access, MatrixDB needs the following data table:

  • Device information table, namely tags table
  • Equipment indicator table

Among them, the device information table includes device ID and other attributes, such as name, location, etc. The indicator table is used to store the device ID and all indicators. Two tables

The main reason why two tables need to be separated rather than exist together is:

  • The value of the equipment attribute field is relatively fixed, unlike the indicators that change in real time, there is no need to save a separate copy for each indicator, so that the design is more in line with the database design paradigm
  • The time series data access is large, and storing device attributes into IDs to identify them can reduce the network bandwidth consumed when sending data
  • Attribute fields such as device name are usually variable-length strings, which are not conducive to subsequent compression.

This design conforms to the relationship model design paradigm, but introduces a new question: how to generate device IDs. The following two methods are usually used:

  • Use the auto-increment primary key value of the device information table
  • Generate through device information hashing

2. Example

The following is an example of how to access timing data. Suppose the speed, temperature and humidity of the fan are to be collected. Fan equipment includes attributes such as name, number, area, coordinates, etc.

2.1 Data table

According to the above method, you need to prepare the following two tables:

  1. tags table, that is, fan information table
Field Type Description
tag_id int tag_id is the only id of the fan, and use the auto-increment primary key (SERIAL type)
name text fan name
serial_number text fan serial number
region text Flower belongs to the area
longitude float4 fan longitude
latitude float4 fan latitude
  1. Fan index table
Field Type Description
tag_id int fan id
time timestamp timestamp of time series data
speed float4 fan speed
temperature float4 fan temperature
humidity float4 fan humidity

2.2 Create databases and tables

2.2.1 Creating a database

Like other databases, in MatrixDB, data tables are organized by libraries, so first create the database stats.

You can connect to the default database postgres and then create it using CREATE DATABASE:

[mxadmin@mdw ~]$ psql postgres
psql (12)
Type "help" for help.

postgres=# CREATE DATABASE stats;
CREATE DATABASE

You can also directly create using MatrixDB's command line tool createb:

[mxadmin@mdw ~]$ createdb stats

2.2.2 Connect to the database

After creating the database, you have to connect it to use it. In MatrixDB, each connection can only belong to one fixed library.

You can connect to the database parameters by executing the psql command:

[mxadmin@mdw ~]$ psql stats
psql (12)
Type "help" for help.

stats=#

You can also switch to the connection to the new database from other database connections using the \c meta command:

postgres=# \c stats
You are now connected to database "stats" as user "mxadmin".

2.2.3 Create a table

After connecting to the database, start creating data tables:

  1. Create tags table:
    CREATE TABLE tags(
     tag_id serial,
     name text,
     serial_number text,
     region text,
     longitude float4,
     latitude float4
    )
    Distributed replicated;
  2. Create a metric table:
    CREATE TABLE metrics(
     time timestamp with time zone,
     tag_id int,
     speed float4,
     temperature float4,
     humidity float4
    )
    Distributed by (tag_id);

    From the above table creation statement, you will find that the distribution methods of the two tables are different.

  • The tags table is used as Distributed replicated
  • metrics is used as Distributed by (tag_id)

because:

  • The device information is relatively small, and connections are often required, so the distribution method is set to copy
  • The data volume of the index table is large, and tag_id is used as the distribution key, so that the data of the same device is distributed on the same node

For a more detailed introduction to data distribution, please refer to [Data Distribution Model] (/doc/latest/datamodel/data_distribute)

  1. Create a unique index for the tag table:

Because the device information in the tag table must be unique, it is necessary to establish a unique index for the information identifying the unique attributes of the device to facilitate the application to maintain the device information. Here it is assumed that the device number is used to mark the uniqueness of the device, so a unique index is established as follows:

CREATE UNIQUE INDEX ON tags(serial_number);

2.2.4 Write data

When the application party is accessing the indicator data, it needs to follow the following steps:

  • Convert device ID to device ID
  • Send device ID and metric information to MatrixDB

So how do you convert the device ID into an ID? The device ID is not an attribute of the device itself and will not be sent with the indicator data. It needs to be converted by the application party.

The easiest way is to check the library according to the device number every time, and read it directly if it exists; insert a new device and generate an ID if it does not exist.

This method requires that each metric be checked before inserting it, greatly increasing the database load and affecting throughput. Therefore, the recommended method is to maintain a memory hash table on the application side. When each piece of data arrives, query the device ID from the memory hash table. If it exists, it will be used directly; if it does not exist, it will be inserted into the library to obtain the device ID and update the hash table. Applicant

2.3 Data Query

Since the device information and indicator data are stored separately in two tables. When doing query, if you want to obtain both indicator statistics and device information, you need to make connections based on the device ID.

Data Connection

The following SQL statistics on the average temperature, average humidity and maximum speed of each fan on '2021-07-01':

SELECT tags.name,
        AVG(metrics.temperature),
        AVG(metrics.humidity),
        MAX(metrics.speed)
    FROM tags JOIN metrics USING (tag_id)
    WHERE metrics.time >= '2021-07-01'
        AND metrics.time < '2021-07-02'
    GROUP BY tags.tag_id, tags.name;

3. Extensible fields

From the above introduction, we can see that MatrixDB is a relational database and uses the paradigm of relational databases when modeling data. Although the relationship model is easy to use, it also faces some challenges when doing timing acquisition. Consider the following scenarios:

  • Too many metrics to collect, exceeding the maximum 1600 column limit for PostgreSQL
  • The collection index sets of different models of equipment vary greatly, resulting in a large number of column values ​​being NULL when returning data
  • The indicator set cannot be predicted, that is, the table schema may change frequently

In response to the above problem, the MatrixDB technical team developed mxkv custom data type and provided a kv key-value storage structure, which perfectly solved the problem in the above scenario.

As shown in the figure below, mxkv is a key-value type, and any number of key-values ​​can be stored internally. There is no limit on the number of keys, and new keys can be added arbitrarily. For uncertain indicators, just store them in the mxkv field. The actual storage space overhead depends on the number and size of the key value.

mxkv

For use of mxkv, please refer to mxkv