info@lasmart.biz

june 30, 2023

Loading data from external file sources with gpfdist

Greenplum provides extensive functionality for working with data in external sources, and the source, volume and purpose of use must be taken into account when choosing how to access the data.

To download large amounts of data from files located on remote hosts, the Greenplum documentation recommends using gpfdist.What is it and why it is a good choice will be discussed in this post.

First we will have a short overview of external tables, gpfdist and gpload. We will go over the Best Practices for loading data and finally give an example of the loading algorithm.

 

External tables

An external table is a table whose metadata is stored in the database, but whose data itself is outside the database. External tables allow you to work with sources outside Greenplum as if they were tables in the database.

In Greenplum, an external table is created using the CREATE EXTERNAL TABLE sentence with the LOCATION and FORMAT parameters. 

In the case of gpfdist, the LOCATION will look like this

 

LOCATION(‘gpfdist://<hostname>:[<port>]/<file_pattern>’)

When using a secure connection, gpfdist changes to gpfdists. 

gpfdist can use file masks. For example, if file_pattern is specified as *.txt, all files with the .txt extension will be loaded.

When more than one file corresponds to a mask, the loading is sequential, one file at a time.

The FORMAT parameter defines the format of the source file and can take the following values: CSVTEXTCUSTOM. When CSV or TEXT is specified, the text formatting characteristics can be specified further in parentheses:

FORMAT ‘TEXT’ 

       [( [HEADER]

          [DELIMITER [AS] ‘<delimiter>’ | ‘OFF’]

          [NULL [AS] ‘<null string>’]

          [ESCAPE [AS] ‘<escape>’ | ‘OFF’]

          [NEWLINE [ AS ] ‘LF’ | ‘CR’ | ‘CRLF’]

          [FILL MISSING FIELDS] )]

      | ‘CSV’

       [( [HEADER]

          [QUOTE [AS] ‘<quote>’

          [DELIMITER [AS] ‘<delimiter>’]

          [NULL [AS] ‘<null string>’]

          [FORCE NOT NULL <column> [, …]]

          [ESCAPE [AS] ‘<escape>’]

          [NEWLINE [ AS ] ‘LF’ | ‘CR’ | ‘CRLF’]

          [FILL MISSING FIELDS] )]

      | ‘CUSTOM’ (Formatter=<<formatter_specifications>>)

[ ENCODING ‘<encoding>’ ]

  [ [LOG ERRORS [PERSISTENTLY]] SEGMENT REJECT LIMIT <count>

  [ROWS | PERCENT] ] 

  • HEADER — the presence of a header line
  • DELIMITER — character used to separate data. By default, for CSV it is a comma; for TEXT it is a tab character. When loading unformatted text into a single-column table, you can specify OFF.
  • NULL — character used to represent NULL values. By default, for CSV — unquoted empty string; for TEXT — \N
  • ESCAPE is a symbol used to escape sequences: \n, \t, \100, etc. The default is backslash.
  • NEWLINE is a line feed character. By default it is defined by the first line. The following options are possible: CRLFCRLF
  • QUOTE — text limiter in CSV. Double quotes by default.
  • FORCE NOT NULL — In CSV, listed columns are treated as if they were enclosed in quotes. The default NULL value, a blank string without quotes, will be interpreted as an empty string.
  • FILL MISSING FIELD — Fills the missing fields at the end of the line with NULL values instead of raising an error.
 
When selecting the CUSTOM format, specify the function used for the conversion in parentheses and list its parameters separated by commas.

ENCODING allows you to specify the encoding of the source file. A list of character sets supported in Greenplum is provided at: 

https://docs.vmware.com/en/VMware-Greenplum/6/greenplum-database/ref_guide-character_sets.html 

With the SEGMENT REJECT LIMIT option you can set the number (absolute or percentage) of lines on one segment in which loading errors are allowed.  If the number of errors exceeds the set value, the transaction is cancelled.
If the LOG ERRORS SEGMENT REJECT LIMIT construct is used, errors that occur will be recorded and available for viewing with gp_read_error_log(). The PERSISTENTLY keyword allows error information to be saved after the external table has been deleted.
 
 

gpfdist

 gpfdist is a file-server utility that uses the HTTP protocol to transfer data in parallel to or from Greenplum. The advantage of gpfdist is that all segments of the Greenplum cluster are used in parallel to work with external sources.
 
 The installation only requires copying the utility to the ETL server to be loaded from and adding gpfdist to the PATH variable. After that you just have to run gpfdist in the directory with the files you want to make available.
gpfdist -d <directory with data files> -p <connection port> -l <path to log file> 
Where:
  • -d — local path to the directory
  • -p — the port through which the utility will work (by default 8080)
  • -l is an optional parameter that specifies where to write the logs.
 
Finished! Now the running instance of gpfdist listens on the specified port, allows you to work with files using external tables and logs its operation.
 
 It is important to note that gpfdist is compatible within one major version of Greenplum — gpfdist for Greenplum 5.x.x will not work with Greenplum 6.x.x.
Hereinafter, by instancewill be understood as one running process of the gpfdist utility, and the host will be understood as a server dedicated to ETL processes

 A single instance of gpfdist transfers data at up to 200 MB/s and supports up to 64 connections from Greenplum segments by default.

 The number of segments served by a single gpfdist instance is controlled by the configuration parameter gp_external_max_segs, whose value can be changed in the postgresql.conf file on the Greenplum master note. The value of this parameter should always be an even number multiple of the number of gpfdist instances.

gpfdist instances can run on multiple hosts, multiple instances can run on each host, which, if properly designed, can achieve high download or upload speeds by maximizing network bandwidth and Greenplum concurrency.

 

Figure 1 Example of two gpfdist instances running on different network interfaces of the same ETL host

With gpfdist it is possible to perform data transformation — for this you need to write a transformation specification using any suitable tool, create a YAML file with the configuration and pass it in the utility parameters. This allows you to work with data formats such as JSON or XML, for example. 

When reading compressed files with the extensions .gz, .bz2 or .zst, the files are decompressed «on the fly». However, it should be noted that compressed files are not supported on Windows platforms.

The parameters of the utility can be found in the documentation at

https://docs.vmware.com/en/VMware-Greenplum/6/greenplum-database/utility_guide-ref-gpfdist.html  

 

gpload

 gpload is a python script to automate ETL tasks using gpfdist and external tables.

 Since gpload is essentially a wrapper for gpfdist, we won’t give it too much attention within the scope of this article and will only mention the main points.

The following steps are automated with gpload:

  1. Running gpfdist processes
  2. Create a temporary external table based on the data in the source file and the format specified in the control file.
  3. INSERT, UPDATE or MERGE operations to update data in the target table
  4. Deleting a temporary external table
  5. Ending gpfdist processes.
 
All gpload actions take place within a single transaction. 
To start gpload you need to write a control file in YAML 1.1 format.
The structure of the control file and a description of the parameters are given in the documentation at: 
 

 

 

Best Practices

 There are some recommendations to improve the performance of loading and follow-up queries:
 
  • Remove indexes before loading. Creating an index is faster than incremental updates when adding rows. You can increase the value of the maintenance_work_mem configuration parameter by the time the index is built, if necessary. However, this may affect the loading speed. Delete and create indexes only when end users are not working with the system.
  • If loading was done to a hip table, use VACUUM. Also VACUUM should be used on system tables. Be sure to use ANALYZE on the updated table. Up-to-date statistics allow the optimizer to make the best decisions, which has a major impact on query performance.
  • If an error occurs during the download, use VACUUM to free up disk space.
  • Disable the automatic collection of statistics at boot time by setting the configuration parameter gp_autostats_mode to NONE.
  • Distribute the data as evenly as possible among the hosts and the gpfdist instances running on them, to prevent a performance hit due to a single overloaded item.
 

Download Example

 Suppose there is a table fact_example with index fact_example_id_idx. It needs to load data from two text files, example_new_1.txt and example_new_2.txt, both located on the same host with address 192.168.1.10. 
The files have a header line, NULL values are represented by the empty string ‘ ‘, the separator is the ‘|’ sign, the CR+LF sequence is used for line translation, and the encoding is UTF-8
The steps for downloading will be as follows:
  1. Place the files in different directories, for example, example_new_1.txt will be located in /var/load_1, and example_new_2.txt in /var/load_2
  2. Run gpfdist instances on different ports in each directory. 

gpfdist -d /var/load_1 -p 8081

gpfdist -d /var/load_2 -p 8082
 

CREATE EXTERNAL TABLE ext_fact_example (like dim_goods_test)

LOCATION (

‘gpfdist://192.168.1.10:8081/example_new_csv_1.txt’

‘gpfdist://192.168.1.10:8082/example_new_csv_2.txt’

)

   FORMAT ‘TEXT’ (

HEADER

DELIMITER AS ‘|’ 

NULL »

NEWLINE ‘CRLF’

)

 

ENCODING ‘UTF8’;

   3. Create an external table

Assume the data are in a prepared form, so when you create an external table, you can copy the structure of the target table using like

   4. Disable automatic collection of statistics by setting the configuration parameter gp_autostats_mode to NONE

   5. Delete existing indexes on the target table

 

 

DROP INDEX fact_example_id_idx

   6. Insert data into the target table

 

INSERT INTO fact_example

 

SELECT * FROM ext_fact_example

   7. Return database to pre-boot state: restore gp_autostats_mode value, recreate deleted indexes.

 

 

CREATE INDEX fact_example_id_idx ON fact_example (id)

   8. If the download was to a hip table, you should run VACUUM on it. You should also use VACUUM on system tables. It is not necessary to use VACUUM on append-oriented tables. ANALYZE should be used on all tables.

   9. In the case of loading a large amount of data, it is worth checking whether there is a skew.

This can be done with the following query:

 

 

 

SELECT gp_segment_id, count(*) 

FROM fact_example

GROUP BY gp_segment_id 

 

ORDER BY 1;

For more information, please get in touch here:

Contact Form main