june 08, 2023
Uploading XML files to Greenplum
This article will describe the approach to loading xml-files into Greenplum. The task was to develop a data warehouse in Greenplum and to load fact and reference data directly from xml-files, since the data were provided in xml-files.
Many people store data in xml files, so the need to load data from files of this type is a common task.
This article suggests an approach where the files are located directly on the database cluster and there is no need to use the gpfdist utility. The functions pg_read_file and pg_read_binary_file exist for reading files.
The pg_read_file function returns a text file fragment. If the offset and length parameters are omitted, the entire contents of the file are returned. Bytes read from the file are treated as server encoded characters.
The pg_read_binary_file function is similar to pg_read_file, but its result is of type bytea; consequently, no encoding checks are performed. Combined with convert_from, this function can be used to read files in any encoding.
To get an xml type value from a text string the xmlparse function is used.
The approach used was the getxmldocument function, which returns an xml-type value and the input parameters are the relative path to the file and the encoding name.
CREATE OR REPLACE FUNCTION public.getxmldocument(p_filename varchar, charset varchar)
RETURNS xml
LANGUAGE sql
SECURITY DEFINER
VOLATILE
AS $$
SELECT
XMLPARSE(DOCUMENT convert_from(
pg_read_binary_file($1), $2));
$$
EXECUTE ON ANY;
To use functions like pg_read_file, you must be a root. If you are not, ask your administrator for this role, if for internal reasons you cannot use it, ask the superuser to create a function similar to the one above and give other users rights to use it. To learn more about roles, read the short Managing Roles and Privileges documentation section.
After writing a function to get the data from the file, in this case in the format type xml, you can already carry out specific actions, depending on the task, for example:
1) Create a temporary table;
2) Load the data into the temporary table using the load from file function above;
3) If the file has a complex structure, you may have to use subqueries;
4) Perform some actions with the data from the temporary table.
Since Greenplum is based on PosgreSQL, it has the same XML functionality: it queries XML documents using XPath 1.0 language.
In this work we will write data to a table. The functions unnest and xpath will be used for this purpose. The xpath function returns an array of XML values corresponding to the set of nodes obtained by calculating the XPath expression. And unnest expands the array into a set of strings (the array elements are read in storage order).
The structure of the XML file for the example:
<?xml version=»1.0″ encoding=»UTF-8″?>
<root><success>1</success><type>warehouse</type><count>31</count>
<warehouses>
<warehouse>
<id>1</id>
<number/>
<name>Organic Thames, Backerstreet 31</name>
</warehouse>
<warehouse>
<id>2</id>
<number/>
<name>Backerstreet 31, quarantine</name>
</warehouse>
…
<warehouse>
<id>24</id>
<number>1</number>
<name>THAMESPROM</name>
</warehouse>
…
</warehouses></root>
An example of a table load is shown below.
INSERT INTO tempstore(
store_id,
«name»,
store_number
)
select
cast(cast((xpath(‘//id/text()’,wayp.pt))[1] as varchar) as int) as store_id,
cast((xpath(‘//name/text()’,wayp.pt))[1] as varchar(500)) as «name»,
cast((xpath(‘//number/text()’,wayp.pt))[1] as varchar(100)) as store_number
from(
select
unnest(xpath(‘//warehouse’,g)) as pt
from getxmldocument(‘virtualpos/stores.xml’,‘windows-1251’) as g) as wayp;
This download was split into a subquery due to the presence of empty number items in the document.
This tutorial has demonstrated how to load an xml file into a table in Greenplum. The full steps of data loading and distribution in Greenplum are not shown here, because the purpose of the article is to show a possible variant of data loading from a file.