Define a new external table.
CREATE [READABLE] EXTERNAL [TEMPORARY | TEMP] TABLE table_name
( column_name data_type [, ...] | LIKE other_table )
LOCATION ('file://seghost[:port]/path/file' [, ...])
| ('gpfdist://filehost[:port]/file_pattern[#transform=trans_name]'
[, ...]
| ('gpfdists://filehost[:port]/file_pattern[#transform=trans_name]'
[, ...])
| ('pxf://path-to-data?PROFILE=profile_name[&SERVER=server_name][&custom-option=value[...]]'))
| ('s3://S3_endpoint[:port]/bucket_name/[S3_prefix] [region=S3-region] [config=config_file | config_server=url]')
[ON MASTER]
FORMAT 'TEXT'
[( [HEADER]
[DELIMITER [AS] 'delimiter' | 'OFF']
[NULL [AS] 'null string']
[ESCAPE [AS] 'escape' | 'OFF']
[NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
[FILL MISSING FIELDS] )]
| 'CSV'
[( [HEADER]
[QUOTE [AS] 'quote']
[DELIMITER [AS] 'delimiter']
[NULL [AS] 'null string']
[FORCE NOT NULL column [, ...]]
[ESCAPE [AS] 'escape']
[NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
[FILL MISSING FIELDS] )]
| 'CUSTOM' (Formatter=<formatter_specifications>)
[ OPTIONS ( key 'value' [, ...] ) ]
[ ENCODING 'encoding' ]
[ [LOG ERRORS [PERSISTENTLY]] SEGMENT REJECT LIMIT count
[ROWS | PERCENT] ]
CREATE [READABLE] EXTERNAL WEB [TEMPORARY | TEMP] TABLE table_name
( column_name data_type [, ...] | LIKE other_table )
LOCATION ('http://webhost[:port]/path/file' [, ...])
| EXECUTE 'command' [ON ALL
| MASTER
| number_of_segments
| HOST ['segment_hostname']
| SEGMENT segment_id ]
FORMAT 'TEXT'
[( [HEADER]
[DELIMITER [AS] 'delimiter' | 'OFF']
[NULL [AS] 'null string']
[ESCAPE [AS] 'escape' | 'OFF']
[NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
[FILL MISSING FIELDS] )]
| 'CSV'
[( [HEADER]
[QUOTE [AS] 'quote']
[DELIMITER [AS] 'delimiter']
[NULL [AS] 'null string']
[FORCE NOT NULL column [, ...]]
[ESCAPE [AS] 'escape']
[NEWLINE [ AS ] 'LF' | 'CR' | 'CRLF']
[FILL MISSING FIELDS] )]
| 'CUSTOM' (Formatter=<formatter specifications>)
[ OPTIONS ( key 'value' [, ...] ) ]
[ ENCODING 'encoding' ]
[ [LOG ERRORS [PERSISTENTLY]] SEGMENT REJECT LIMIT count
[ROWS | PERCENT] ]
CREATE WRITABLE EXTERNAL [TEMPORARY | TEMP] TABLE table_name
( column_name data_type [, ...] | LIKE other_table )
LOCATION('gpfdist://outputhost[:port]/filename[#transform=trans_name]'
[, ...])
| ('gpfdists://outputhost[:port]/file_pattern[#transform=trans_name]'
[, ...])
FORMAT 'TEXT'
[( [DELIMITER [AS] 'delimiter']
[NULL [AS] 'null string']
[ESCAPE [AS] 'escape' | 'OFF'] )]
| 'CSV'
[([QUOTE [AS] 'quote']
[DELIMITER [AS] 'delimiter']
[NULL [AS] 'null string']
[FORCE QUOTE column [, ...]] | * ]
[ESCAPE [AS] 'escape'] )]
| 'CUSTOM' (Formatter=<formatter specifications>)
[ OPTIONS ( key 'value' [, ...] ) ]
[ ENCODING 'write_encoding' ]
[ DISTRIBUTED BY ({column [opclass]}, [ ... ] ) | DISTRIBUTED RANDOMLY ]
CREATE WRITABLE EXTERNAL [TEMPORARY | TEMP] TABLE table_name
( column_name data_type [, ...] | LIKE other_table )
LOCATION('s3://S3_endpoint[:port]/bucket_name/[S3_prefix] [region=S3-region] [config=config_file | config_server=url]')
[ON MASTER]
FORMAT 'TEXT'
[( [DELIMITER [AS] 'delimiter']
[NULL [AS] 'null string']
[ESCAPE [AS] 'escape' | 'OFF'] )]
| 'CSV'
[([QUOTE [AS] 'quote']
[DELIMITER [AS] 'delimiter']
[NULL [AS] 'null string']
[FORCE QUOTE column [, ...]] | * ]
[ESCAPE [AS] 'escape'] )]
CREATE WRITABLE EXTERNAL WEB [TEMPORARY | TEMP] TABLE table_name
( column_name data_type [, ...] | LIKE other_table )
EXECUTE 'command' [ON ALL]
FORMAT 'TEXT'
[( [DELIMITER [AS] 'delimiter']
[NULL [AS] 'null string']
[ESCAPE [AS] 'escape' | 'OFF'] )]
| 'CSV'
[([QUOTE [AS] 'quote']
[DELIMITER [AS] 'delimiter']
[NULL [AS] 'null string']
[FORCE QUOTE column [, ...]] | * ]
[ESCAPE [AS] 'escape'] )]
| 'CUSTOM' (Formatter=<formatter specifications>)
[ OPTIONS ( key 'value' [, ...] ) ]
[ ENCODING 'write_encoding' ]
[ DISTRIBUTED BY ({column [opclass]}, [ ... ] ) | DISTRIBUTED RANDOMLY ]
CREATE EXTERNAL TABLE
or CREATE EXTERNAL WEB TABLE
creates a new readable external table definition in a YMatrix database. Readable external tables are commonly used for fast parallel data loading. After defining an external table, you can query its data directly (or in parallel) using SQL commands. For example, you can select, join, or sort data from external tables. Views can also be created for external tables. DML operations (UPDATE, INSERT, DELETE, or TRUNCATE) are not allowed on readable external tables, and indexes cannot be created on them.
CREATE WRITABLE EXTERNAL TABLE
or CREATE WRITABLE EXTERNAL WEB TABLE
creates a new writable external table definition in a YMatrix database. Writable external tables are commonly used to offload data from the database to a set of files or named pipes. Writable external web tables can also be used to output data to executable programs. Writable external tables can also serve as output targets for YMatrix parallel MapReduce computations. Once a writable external table is defined, data can be selected from database tables and inserted into the writable external table. Writable external tables only allow INSERT operations—SELECT, UPDATE, DELETE, or TRUNCATE are not permitted.
The main difference between regular external tables and external web tables is their data source. Regular readable external tables access static flat files, while external web tables access dynamic data sources—on web servers or through executing OS commands or scripts.
READABLE | WRITABLE: Specifies the type of external table. The default is readable. Readable external tables are used for loading data into YMatrix. Writable external tables are used for unloading data.
WEB: Creates a readable or writable external web table definition in YMatrix. Readable external web tables can access files via http://
protocol or data via OS commands. Writable external web tables output data to executable programs that can accept input data streams. External web tables cannot be rescanned during query execution.
TEMPORARY | TEMP: If specified, creates a temporary readable or writable external table definition in YMatrix. Temporary external tables reside in a special schema. The schema name cannot be specified when creating the table. Temporary external tables are automatically dropped at the end of the session. While the temporary table exists, the current session will not see any existing permanent table with the same name unless the permanent table is referenced by its schema-qualified name.
table_name: The name of the new external table.
column_name: The name of the column to be created in the external table definition. Unlike regular tables, external tables do not have column constraints or defaults, so do not specify them.
LIKE other_table: The LIKE
clause specifies a table from which the new external table will automatically copy all column names, data types, and YMatrix distribution policies. Any column constraints or default values specified in the original table are not copied to the new external table definition.
data_type: The data type of the column.
LOCATION ('protocol://[host[:port]]/path/file' [, ...]): For readable external tables, specifies the URI of the external data source used to populate the external or web table. Regular readable external tables allow gpfdist
or file
protocols. External web tables allow http
protocol. If the port is omitted, the port for http
and gpfdist
protocols is assumed to be 8080. For gpfdist
protocol, the path is relative to the directory from which gpfdist
serves files (specified when starting the gpfdist
program). Additionally, gpfdist
can use wildcards or other C-style pattern matching (e.g., space characters as [[:space:]]
) to represent multiple files in a directory.
ON MASTER: Restricts all table-related operations to the YMatrix master. Only readable and writable external tables created with s3
or custom protocols are allowed. gpfdist
, gpfdists
, pxf
, and file
protocols do not support ON MASTER
.
EXECUTE 'command' [ON ...]: Allowed only for readable external web tables or writable external tables. For readable external web tables, specifies the OS command to be executed by segment instances. The command
can be a single OS command or script. The ON
clause specifies which segment instances will execute the given command.
Default is ON ALL
. The command will be executed by every active primary instance on all segment hosts in the YMatrix database system. If the command executes a script, the script must be located in the same path on all segment hosts and executable by the YMatrix superuser (gpadmin
).
ON MASTER
runs the command only on the master host.
ON number
indicates that the command will be executed on the specified number of segments. Specific segments are randomly selected by the YMatrix database system at runtime. If the command executes a script, the script must be located in the same path on all segment hosts and executable by the YMatrix superuser (gpadmin
).
HOST
indicates that the command will be executed once per segment host (once per host, regardless of the number of active primary instances on each host).
HOST segment_hostname
indicates that the command will be executed by all active primary instances on the specified segment host.
SEGMENT segment_id
indicates that the command will be executed once by the specified segment. You can determine the ID of a segment instance by examining the content
number in the system catalog table gp_segment_configuration
. The content ID of the YMatrix master database is always -1.
For writable external tables, the command
specified in the EXECUTE
clause must be prepared to receive data. Since all segments sending data will write their output to the specified command or program, the only available option for the ON
clause is ON ALL
.
FORMAT 'TEXT | CSV' (options): When the FORMAT
clause identifies delimited text (TEXT) or comma-separated values (CSV) format, the formatting options are similar to those available for the PostgreSQL COPY command. If the data in the file does not use the default column delimiter, escape character, null string, etc., you must specify additional formatting options so that the YMatrix database can correctly read the data from the external file.
FORMAT 'CUSTOM' (formatter=formatter_specification): Specifies a custom data format. formatter_specification
specifies the function used to format the data, followed by comma-separated parameters for the formatting function. The length of the formatter specification (including the string Formatter=) can be up to approximately 50K bytes.
DELIMITER: Specifies an ASCII character used to separate columns within each row (line) of data. The default is the tab character in TEXT mode and comma in CSV mode. In TEXT mode for regular readable external tables, the delimiter can be set to OFF for special cases where non-structured data is loaded into a single-column table.
NULL: Specifies the string that represents a NULL value. In TEXT mode, the default is \N
(backslash-N), and in CSV mode, the default is an unquoted empty value. For cases where you do not want to distinguish between NULL values and empty strings, you may prefer an empty string even in TEXT mode. Any data item matching this string in external tables and web tables will be interpreted as a NULL value.
ESCAPE: Specifies a single character used for C-style escape sequences (e.g., \n
, \t
, \100
) and to escape data characters that might otherwise be interpreted as row or column delimiters. Ensure that the escape character chosen is not used in the actual column data. For TEXT-formatted files, the default escape character is \
(backslash), and for CSV-formatted files, it is "
. However, you can specify another character to represent the escape. You can also disable escaping in TEXT-formatted files by specifying the value 'OFF'
for the escape character. This is useful for data such as web log data in TEXT format with many embedded backslashes that are not intended as escapes.
NEWLINE: Specifies the newline character used in the data file—LF
(Line Feed, 0x0A), CR
(Carriage Return, 0x0D), or CRLF
(Carriage Return plus Line Feed, 0x0D 0x0A). If not specified, the YMatrix database segment will detect the newline type by looking at the first line of data received and using the first encountered newline type.
HEADER: For readable external tables, specifies that the first row of the data file is a header row (containing the names of the table columns) and should not be included as part of the table's data. If multiple data source files are used, all files must have a header row.
QUOTE: Specifies the quotation character for CSV mode. The default is the double quote ("
).
FORCE NOT NULL: In CSV mode, processing each specified column is as if it were referenced, so it is not a NULL value. For the default empty string in CSV mode (not existing between two delimiters), this will cause the missing value to be evaluated as a zero-length string.
FORCE QUOTE: In writable external tables using CSV mode, forces quotes around all non-NULL values in each specified column. If *
is specified, non-NULL values are quoted in all columns. NULL output is never quoted.
FILL MISSING FIELDS: For readable external tables, when FILL MISSING FIELDS
is specified in TEXT and CSV modes, if a row of data is missing or the end of a row of data is missing, the missing trailing field value is set to NULL (instead of reporting an error). Blank lines, fields with NOT NULL constraints, and delimiters at the end of the line will still report an error.
OPTIONS key 'value'[, key 'value' ...]: Optional parameters. When the protocol is used as an external table protocol for an external table, the parameter and value specification is set as a key-value pair for a custom data access protocol. The custom data access protocol is responsible for processing and validating key-value pairs.
ENCODING 'encoding': Character set encoding for external tables. Specify string constants (e.g., 'SQL_ASCII'), integer-encoded numbers, or DEFAULT to use the default client encoding.
LOG ERRORS [PERSISTENTLY]: This optional clause can record information about rows with malformed formatting before the SEGMENT REJECT LIMIT clause. Error log information is stored internally and can be accessed through the built-in SQL function gp_read_error_log()
of the YMatrix database.
SEGMENT REJECT LIMIT count [ROWS | PERCENT]: Runs the COPY FROM operation in single-line error isolation mode. If input lines are incorrectly formatted, they will be discarded as long as the reject limit count is not reached on any YMatrix segment instance during the load operation. The rejection limit count can be specified as a number of rows (default) or a percentage of the total number of rows (1-100). If PERCENT is used, each segment starts calculating the percentage of wrong rows only after the number of rows specified by the parameter gp_reject_percent_threshold
has been processed. The default value of gp_reject_percent_threshold
is 300 rows. Constraint errors (such as violations of NOT NULL, CHECK, or UNIQUE constraints) will still be processed in "all or nothing" input mode. If the limit is not reached, all correct rows are loaded and all wrong rows are discarded. Note: When reading an external table, if SEGMENT REJECT LIMIT is not triggered first or is not specified, the YMatrix database limits the number of initial rows that may contain malformed errors. If the first 1000 lines are rejected, the COPY operation will stop and roll back. The limit on the initial rejected row count can be changed using the YMatrix database server configuration parameter gp_initial_bad_row_limit
.
DISTRIBUTED BY ({column [opclass]}, [ ... ]) | DISTRIBUTED RANDOMLY: A YMatrix database distribution policy for declaring writable external tables. By default, writable external tables are randomly distributed. If the source table from which the data is exported has a hash distribution policy and the same distribution key column and operator class opclass
are defined for the writable external table, the need to move rows across the interconnect is eliminated, thereby improving unload performance. When you issue an unload command such as INSERT INTO wex_table SELECT * FROM source_table
, if both tables have the same hash distribution policy, these unloaded rows can be sent directly from the segment to the output location.
Start the gpfdist
file server program in the background on port 8081 to serve files in directory /var/data/staging
:
gpfdist -p 8081 -d /var/data/staging -l /home/mxadmin/log &
Create a readable external table named ext_customer
using the gpfdist
protocol and all text format files (*.txt
) found in the gpfdist
directory. Use vertical bar (|
) as the column delimiter and blank as NULL to format the file. Also access the external table in single-line error isolation mode:
CREATE EXTERNAL TABLE ext_customer
(id int, name text, sponsor text)
LOCATION ( 'gpfdist://filehost:8081/*.txt' )
FORMAT 'TEXT' ( DELIMITER '|' NULL ' ')
LOG ERRORS SEGMENT REJECT LIMIT 5;
Create the same readable external table definition as above, but using CSV format:
CREATE EXTERNAL TABLE ext_customer
(id int, name text, sponsor text)
LOCATION ( 'gpfdist://filehost:8081/*.csv' )
FORMAT 'CSV' ( DELIMITER ',' );
Using the file
protocol and some CSV format files with header rows, create a readable external table called ext_expenses
:
CREATE EXTERNAL TABLE ext_expenses (name text, date date,
amount float4, category text, description text)
LOCATION (
'file://seghost1/dbfast/external/expenses1.csv',
'file://seghost1/dbfast/external/expenses2.csv',
'file://seghost2/dbfast/external/expenses3.csv',
'file://seghost2/dbfast/external/expenses4.csv',
'file://seghost3/dbfast/external/expenses5.csv',
'file://seghost3/dbfast/external/expenses6.csv'
)
FORMAT 'CSV' ( HEADER );
Create a readable external web table, and each table host executes a script once:
CREATE EXTERNAL WEB TABLE log_output (linenum int, message
text) EXECUTE '/var/load_scripts/get_log_data.sh' ON HOST
FORMAT 'TEXT' (DELIMITER '|');
Create a writable external table named sales_out
that uses gpfdist
to write output data to a file named sales.out
. Use vertical bar (|
) as the column delimiter and blank as NULL to format the file:
CREATE WRITABLE EXTERNAL TABLE sales_out (LIKE sales)
LOCATION ('gpfdist://etl1:8081/sales.out')
FORMAT 'TEXT' ( DELIMITER '|' NULL ' ')
DISTRIBUTED BY (txn_id);
Create a writable external web table, pipe the output data received by segment to an executable script named to_adreport_etl.sh
:
CREATE WRITABLE EXTERNAL WEB TABLE campaign_out
(LIKE campaign)
EXECUTE '/var/unload_scripts/to_adreport_etl.sh'
FORMAT 'TEXT' (DELIMITER '|');
Use the writable external table defined above to unload selected data:
INSERT INTO campaign_out SELECT * FROM campaign WHERE
customer_id=123;
When you specify the LOG ERRORS
clause, the YMatrix database captures an error that occurs when reading external table data. You can view and manage captured error log data.
Use the built-in SQL function gp_read_error_log('table_name')
. It requires SELECT privileges on table_name
.
If error log data exists in the specified table, new error log data is appended to the existing error log data. Error log information is not copied to mirror.
Use the built-in SQL function gp_truncate_error_log('table_name')
to delete the error log data of table_name
. It requires table owner privileges.
When multiple YMatrix external tables are defined using the gpfdist
, gpfdists
, or file
protocols and accessing the same named pipe in the Linux system, YMatrix restricts access to the named pipe to a single reader. If the second reader tries to access the named pipe, an error is returned.
CREATE EXTERNAL TABLE
is a YMatrix database extension. The SQL standard does not provide for external tables.