october 14, 2025
DWH ETL process optimisation: improving performance tenfold
During the development of a data warehouse (DWH) for business analytics, one of the key aspects is the efficient extraction of data from various sources such as other databases, CSV, HTML, and JSON files. Given the large volumes of data received daily, it is critically important to optimise ETL process execution time — in practice, business users need up-to-date data available at the start of the working day.
Data extraction challenges
In the process of implementing a DWH on platforms such as MS SQL Server, PostgreSQL, or Greenplum, issues arose related to the increasing time required to extract data from external database systems with which the DWH interacts via server-to-server connections. When such connections are used to retrieve reference or “lightweight” data (tables or datasets up to 5 million rows), there are no performance problems. However, when extracting business activity data exceeding 50 million rows, data integration performance decreases significantly.
Practical cases
1. MS SQL Server DWH – PostgreSQL as the data source
A DWH built on the MS SQL Server platform, as part of a scheduled update policy, connects daily between 00:00 and 03:00 to a PostgreSQL source server. During each session, it retrieves over 80 million rows of transaction data, which takes approximately 1.5–2 hours of the total integration time.
2. Greenplum DWH – Oracle as the data source
In the case of a Greenplum-based DWH, the system performs a daily extraction of more than 150 million rows of goods movement data from an Oracle source server. This process takes around 3 hours to complete.
Development of a data integration driver
Recognising that the standard tools provided by MS SQL Server, PostgreSQL (Airflow, Spark, NiFi), and Greenplum (Airflow, Spark, NiFi) were insufficient for efficient data integration, a decision was made to develop a universal solution.
A driver written in C# was created to enable data integration from various sources (Oracle, PostgreSQL, MS SQL Server, Greenplum) into target database tables. The driver was built around four key concepts aimed at optimising integration performance:
- Multithreading;
- Asynchronous processing;
- In-memory data handling;
- Batch insertion.
Driver operation mechanism
The driver is deployed locally on the server hosting the DWH and is integrated into the database as a callable stored procedure. The following parameters are passed to it:
- Local path to the driver executable file;
- Type of destination DBMS (Oracle / PostgreSQL / MS SQL Server / Greenplum);
- Type of source DBMS (Oracle / PostgreSQL / MS SQL Server / Greenplum);
- Connection string to the source database;
- SQL query for extracting the required dataset;
- Connection string to the target database;
- Name of the target table for dataset insertion;
- Number of rows to insert in a single integration batch.
The driver then creates a connection object to the source database server and executes the provided SQL query to retrieve the dataset.
Storing the data in memory, the driver performs batch insertions into the target table using built-in, high-performance data-loading methods specific to each platform:
- BCP for MS SQL Server;
- SQL*Loader / External Tables for Oracle;
- COPY for PostgreSQL;
- gpload / gpfdist for Greenplum.
Results and conclusions
The use of the developed driver made it possible to reduce data extraction time from the source by up to tenfold: operations that previously took an hour are now completed in 4–7 minutes.
However, to achieve maximum performance, the driver requires increased memory consumption. Since the driver is highly dependent on available system resources, it is recommended to avoid running resource-intensive processes in parallel with it. It is important to configure the scheduling algorithm so that the driver runs in a dedicated block with a minimal number of concurrent updates.
Thus, the development of a specialised data integration driver has significantly improved the performance of ETL processes within the data warehouse, enabling faster and more efficient access to up-to-date information for business analysts.