File Ingestion

File ingestion refers to importing existing text data files into a target table. Text files are typically in CSV format.

Create a CSV file:

$ vi rows.csv

The test CSV file has the following format. The first row contains column names, and subsequent rows contain data. The three columns are of timestamp, integer, and string types respectively:

time,c1,c2
2021-01-01 00:00:00,1,a1
2021-01-01 00:00:00,2,a2
2021-01-01 00:00:00,3,a3

The target database is test, and the schema of the target table dest matches the CSV file format:

=# CREATE TABLE dest(
    time timestamp,
    c1 int,
    c2 text
)USING MARS3
 DISTRIBUTED BY(c1)
 ORDER BY(time,c1);

Below are several common methods for importing file contents into the target table:


1. COPY

COPY is a built-in SQL command in YMatrix that imports data files located on the master node (Master) into the target table.

First, connect to the target database using psql, then execute the COPY command:

[mxadmin@mdw ~]$ psql test
psql (12)
Type "help" for help.

test=# COPY dest FROM '/home/mxadmin/rows.csv' DELIMITER ',' HEADER;
COPY 3

The DELIMITER parameter specifies the field delimiter. The HEADER option indicates that the header line should be skipped during import.

The COPY method is simple to use, but has the following limitations:

  • The data file must reside on the Master node.
  • Data is inserted through the Master, which performs similarly to batch INSERT operations. This can lead to performance issues with large datasets.

2. MatrixGate

MatrixGate uses YMatrix's high-speed data ingestion tool mxgate to import data.

Compared to COPY, mxgate offers better performance for large volumes of data and allows deployment of data files independent from the Master node.

$ tail -n +2 rows.csv | mxgate --source stdin --db-database test --db-master-host localhost --db-master-port 5432 --db-user mxadmin --time-format raw --target public.dest --parallel 2 --delimiter ',' 

The above command uses a pipe to redirect the data portion of rows.csv (starting from the second line) into mxgate. mxgate ingests the data from stdin into the dest table in the test database.

Note!
Since mxgate does not require a file header during ingestion, the tail -n +2 command is used to output the file content starting from the second line.

For more information about mxgate, refer to the documentation.