june 30, 2023
Loading data from external file sources with gpfdist
Greenplum provides extensive functionality for working with data in external sources, and the source, volume and purpose of use must be taken into account when choosing how to access the data.
To download large amounts of data from files located on remote hosts, the Greenplum documentation recommends using gpfdist.What is it and why it is a good choice will be discussed in this post.
First we will have a short overview of external tables, gpfdist and gpload. We will go over the Best Practices for loading data and finally give an example of the loading algorithm.
External tables
An external table is a table whose metadata is stored in the database, but whose data itself is outside the database. External tables allow you to work with sources outside Greenplum as if they were tables in the database.
In Greenplum, an external table is created using the CREATE EXTERNAL TABLE sentence with the LOCATION and FORMAT parameters.
In the case of gpfdist, the LOCATION will look like this
LOCATION(‘gpfdist://<hostname>:[<port>]/<file_pattern>’)
When using a secure connection, gpfdist changes to gpfdists.
gpfdist can use file masks. For example, if file_pattern is specified as *.txt, all files with the .txt extension will be loaded.
When more than one file corresponds to a mask, the loading is sequential, one file at a time.
The FORMAT parameter defines the format of the source file and can take the following values: CSV, TEXT, CUSTOM. When CSV or TEXT is specified, the text formatting characteristics can be specified further in parentheses:
FORMAT ‘TEXT’
[( [HEADER]
[DELIMITER [AS] ‘<delimiter>’ | ‘OFF’]
[NULL [AS] ‘<null string>’]
[ESCAPE [AS] ‘<escape>’ | ‘OFF’]
[NEWLINE [ AS ] ‘LF’ | ‘CR’ | ‘CRLF’]
[FILL MISSING FIELDS] )]
| ‘CSV’
[( [HEADER]
[QUOTE [AS] ‘<quote>’]
[DELIMITER [AS] ‘<delimiter>’]
[NULL [AS] ‘<null string>’]
[FORCE NOT NULL <column> [, …]]
[ESCAPE [AS] ‘<escape>’]
[NEWLINE [ AS ] ‘LF’ | ‘CR’ | ‘CRLF’]
[FILL MISSING FIELDS] )]
| ‘CUSTOM’ (Formatter=<<formatter_specifications>>)
[ ENCODING ‘<encoding>’ ]
[ [LOG ERRORS [PERSISTENTLY]] SEGMENT REJECT LIMIT <count>
[ROWS | PERCENT] ]
- HEADER — the presence of a header line
- DELIMITER — character used to separate data. By default, for CSV it is a comma; for TEXT it is a tab character. When loading unformatted text into a single-column table, you can specify OFF.
- NULL — character used to represent NULL values. By default, for CSV — unquoted empty string; for TEXT — \N
- ESCAPE is a symbol used to escape sequences: \n, \t, \100, etc. The default is backslash.
- NEWLINE is a line feed character. By default it is defined by the first line. The following options are possible: CR, LF, CRLF
- QUOTE — text limiter in CSV. Double quotes by default.
- FORCE NOT NULL — In CSV, listed columns are treated as if they were enclosed in quotes. The default NULL value, a blank string without quotes, will be interpreted as an empty string.
- FILL MISSING FIELD — Fills the missing fields at the end of the line with NULL values instead of raising an error.
ENCODING allows you to specify the encoding of the source file. A list of character sets supported in Greenplum is provided at:
https://docs.vmware.com/en/VMware-Greenplum/6/greenplum-database/ref_guide-character_sets.html
External table documentation is available for review at:
gpfdist
- -d — local path to the directory
- -p — the port through which the utility will work (by default 8080)
- -l is an optional parameter that specifies where to write the logs.
A single instance of gpfdist transfers data at up to 200 MB/s and supports up to 64 connections from Greenplum segments by default.
The number of segments served by a single gpfdist instance is controlled by the configuration parameter gp_external_max_segs, whose value can be changed in the postgresql.conf file on the Greenplum master note. The value of this parameter should always be an even number multiple of the number of gpfdist instances.
Figure 1 Example of two gpfdist instances running on different network interfaces of the same ETL host
With gpfdist it is possible to perform data transformation — for this you need to write a transformation specification using any suitable tool, create a YAML file with the configuration and pass it in the utility parameters. This allows you to work with data formats such as JSON or XML, for example.
When reading compressed files with the extensions .gz, .bz2 or .zst, the files are decompressed «on the fly». However, it should be noted that compressed files are not supported on Windows platforms.
The parameters of the utility can be found in the documentation at
https://docs.vmware.com/en/VMware-Greenplum/6/greenplum-database/utility_guide-ref-gpfdist.html
gpload
Since gpload is essentially a wrapper for gpfdist, we won’t give it too much attention within the scope of this article and will only mention the main points.
The following steps are automated with gpload:
- Running gpfdist processes
- Create a temporary external table based on the data in the source file and the format specified in the control file.
- INSERT, UPDATE or MERGE operations to update data in the target table
- Deleting a temporary external table
- Ending gpfdist processes.
Best Practices
- Remove indexes before loading. Creating an index is faster than incremental updates when adding rows. You can increase the value of the maintenance_work_mem configuration parameter by the time the index is built, if necessary. However, this may affect the loading speed. Delete and create indexes only when end users are not working with the system.
- If loading was done to a hip table, use VACUUM. Also VACUUM should be used on system tables. Be sure to use ANALYZE on the updated table. Up-to-date statistics allow the optimizer to make the best decisions, which has a major impact on query performance.
- If an error occurs during the download, use VACUUM to free up disk space.
- Disable the automatic collection of statistics at boot time by setting the configuration parameter gp_autostats_mode to NONE.
- Distribute the data as evenly as possible among the hosts and the gpfdist instances running on them, to prevent a performance hit due to a single overloaded item.
Download Example
- Place the files in different directories, for example, example_new_1.txt will be located in /var/load_1, and example_new_2.txt in /var/load_2
- Run gpfdist instances on different ports in each directory.
gpfdist -d /var/load_1 -p 8081
CREATE EXTERNAL TABLE ext_fact_example (like dim_goods_test)
LOCATION (
‘gpfdist://192.168.1.10:8081/example_new_csv_1.txt’,
‘gpfdist://192.168.1.10:8082/example_new_csv_2.txt’
)
FORMAT ‘TEXT’ (
HEADER
DELIMITER AS ‘|’
NULL »
NEWLINE ‘CRLF’
)
ENCODING ‘UTF8’;
3. Create an external table
Assume the data are in a prepared form, so when you create an external table, you can copy the structure of the target table using like
4. Disable automatic collection of statistics by setting the configuration parameter gp_autostats_mode to NONE
5. Delete existing indexes on the target table
DROP INDEX fact_example_id_idx
6. Insert data into the target table
INSERT INTO fact_example
SELECT * FROM ext_fact_example
7. Return database to pre-boot state: restore gp_autostats_mode value, recreate deleted indexes.
CREATE INDEX fact_example_id_idx ON fact_example (id)
8. If the download was to a hip table, you should run VACUUM on it. You should also use VACUUM on system tables. It is not necessary to use VACUUM on append-oriented tables. ANALYZE should be used on all tables.
9. In the case of loading a large amount of data, it is worth checking whether there is a skew.
This can be done with the following query:
SELECT gp_segment_id, count(*)
FROM fact_example
GROUP BY gp_segment_id
ORDER BY 1;
Conclusion
So, in this article we discussed how to use gpfdist and, just as importantly, why you should use it.
We hope that we were able to give an insight into the mechanisms of external tables, gpfdist and the principles of loading large amounts of data into Greenplum.
Sources
https://docs.vmware.com/en/VMware-Greenplum/6/greenplum-database/utility_guide-ref-gpfdist.html
https://greenplum.docs.pivotal.io/6-4/best_practices/data_loading.html
https://gpdb.docs.pivotal.io/6-3/ref_guide/character_sets.html
https://gpdb.docs.pivotal.io/6-3/utility_guide/ref/gpload.html