info@lasmart.biz

may 14, 2025

Analytical DBMS: Greenplum or ClickHouse?

Among the most popular analytical DBMS solutions on today’s market, two open-source projects stand out as particularly interesting: Greenplum and ClickHouse. Both of these DBMSs handle a wide range of data analytics use cases very well, and when used together in a single stack, they can provide near-instantaneous processing of thousands of analytical queries simultaneously. This article will discuss the key architectural and functional features of Greenplum and ClickHouse that should be taken into account in order to use them effectively within a single infrastructure.

Why Greenplum alone is not enough

When clients approached us for advice on planning an optimal warehouse architecture before we became familiar with ClickHouse, we most often settled on the following classic setup:

  • Data sources
  • Data loading via ETL tools or CDC (Change Data Capture)
  • Product based on Greenplum as the target DBMS
  • ELT solution
  • A tool for creating a repository for loosely structured and unstructured data
  • BI systems
  • End users

This architecture works quite efficiently when providing data access to several hundred users. Under such a load, a mid-range Greenplum cluster is capable of delivering response times for data marts ranging from a few seconds to tens of seconds — depending on various conditions. However, at one point, two clients approached us, each with thousands of active users, and both required query processing times of no more than two seconds!

It became clear that Greenplum alone would not be sufficient in this case. The DBMS cannot handle more than six hundred concurrent connections and performs conservatively when dealing with queries that require very short response times (e.g., under one second). Of course, you can improve performance through various optimisations, but only up to a certain limit — handling such queries is simply not Greenplum’s strongest suit.

That’s when we turned our attention to ClickHouse and discovered the following:

  • ClickHouse does not support transactions;
  • It does not fully comply with ANSI SQL 2008;
  • It lacks support for local and distributed JOINs;
  • And it is not compatible with PostgreSQL — but!

This DBMS is extremely fast! Our tests showed that on comparable hardware, ClickHouse significantly outperformed Greenplum in response time when working with WHERE clauses and aggregations — provided there were no JOINs or complex transformations involved.

Taking these characteristics into account, we decided to implement the following solution: Greenplum would be responsible for building data marts using ELT/ETL tools, while ClickHouse would handle all further processing of queries against those marts.

To implement this solution, we needed not only to ensure fast and accurate data transfer from Greenplum to ClickHouse, but also to establish efficient interaction between the two systems. In other words, we had to take into account the specific features of each DBMS and come up with a way to quickly deliver data from Greenplum to ClickHouse (after all, you can’t simply move tables from one database to another!). We found a way. But before explaining how, let’s first look at how Greenplum and ClickHouse are structured.

MPP DBMS — Greenplum

Greenplum is an analytical, distributed, massively parallel open-source database management system based on PostgreSQL. It is ideally suited for analytical processing of unlimited data volumes and for building large, reliable, and scalable data warehouses.

This DBMS is highly reliable. Thanks to its MPP architecture and powerful optimisation algorithms, Greenplum can process complex analytical queries very quickly, even when working with massive datasets. Greenplum implements a Shared-Nothing architecture (no shared resources). Each node in the cluster participates in all computational operations and has its own dedicated resources for executing them — memory, operating system, CPU, and storage. This allows the DBMS to efficiently parallelise workloads when handling analytical queries, automatically isolate processes from different users, and thus allocate cluster resources effectively.

Greenplum architecture

The Greenplum architecture consists of multiple instances of an object-relational PostgreSQL database that operate together as a single DBMS. The main architectural components are:

1. Master host (or master server). This is where the master instance runs — the primary PostgreSQL instance. It serves as the entry point into Greenplum: users connect to it, and all SQL queries are directed here. The master does not store any user data and plays almost no role in actual data processing. It simply accepts incoming connections, collects and parses queries, and then redistributes them to the segments. The master instance coordinates its work with other database instances.

2. Standby master instance (secondary master instance). A PostgreSQL instance used when the primary master becomes unavailable.

3. Primary segment instances on segment hosts (segment hosts). One or more (typically between two and eight) logical segments (PostgreSQL instances), which store and process data, can be hosted on a single segment host. Each segment contains a portion of the data, and the processes serving that data run within the corresponding segment instance. Segment hosts can only be accessed by the master. It is assumed that all segment hosts are configured identically, since optimal Greenplum performance is achieved through even distribution of both data and workload.

4. Mirror segment instance («mirror»). A PostgreSQL instance that acts as a mirror of one of the logical segments. Each logical segment can have only one mirror.

5. Interconnect network (network layer of the Greenplum architecture). Communication between individual PostgreSQL instances is ensured via interconnects — connections between segments. Using multiple interconnect networks increases channel throughput and provides fault tolerance for the cluster (in case one network fails, traffic is redistributed across the remaining ones).

Strengths of Greenplum:

  • Support for ANSI SQL 2008 with extensions from ANSI SQL 2012 (including OLAP and more).
  • Ability to perform both local and distributed JOIN operations.
  • Support for the relational data model and compatibility with PostgreSQL — and therefore with all BI and ETL systems. Familiar SQL syntax that allows for cost-effective ETL development.
  • Compliance with ACID requirements:
  • Atomicity — a guarantee that each transaction is either completed fully or not at all.
  • Consistency — a guarantee that the database remains consistent (valid) before and after the transaction.
  • Isolation — a guarantee that concurrent transactions do not interfere with the outcome of individual transactions.
  • Durability — a guarantee that once a transaction has been confirmed by the system, its changes will persist and not be lost.
  • Fault tolerance achieved through mirroring of each logical segment and the presence of a standby master server.
  • Horizontal scalability — the ability to add new servers and segments with minimal database downtime. The more nodes added, the faster Greenplum performs.
  • Ability to work with data from multiple sources with minimal preprocessing.

Typical use cases for Greenplum:

  1. Regular reporting systems.
  2. Predictive analytics: analysis of current and historical data or events to forecast future outcomes. For example, in direct marketing, targeted advertising, investment risk management, and fraud detection.
  3. Ad-hoc analytics: market segmentation, product concept development, advertising effectiveness, company or brand potential, sales channels, and sales performance.
  4. Marketing analysis: target audience profiling, competitor analysis, pricing strategies.
  5. Financial scoring: assessment of borrowers’ creditworthiness, prediction of likely financial behaviour, and more.
  6. Customer base analysis: ABC analysis (segmentation by sales volume and profit), XYZ analysis (segmentation by purchase frequency or transaction rate).
  7. Logistics data analysis: delivery times, transportation costs, and storage expenses.

Analytical DBMS — ClickHouse

ClickHouse is a high-performance, column-oriented, open-source DBMS designed for processing analytical queries against structured big data in real time.

The most notable feature of ClickHouse is its extremely fast SQL query execution speed for read operations. It is the fastest columnar OLAP DBMS for its typical types of queries and can scale up to tens of trillions of records with total data volumes reaching several petabytes. ClickHouse allows you to store data streams without preliminary aggregation and quickly generate reports across any dimensions, while enabling interactive execution of analytical queries on data updated in real time.

In certain use cases, ClickHouse also offers better resource efficiency compared to other systems. In particular, unlike many other popular columnar DBMSs for big data processing, ClickHouse is optimised for HDD storage — meaning it can handle datasets that exceed available RAM capacity. This helps reduce system operating costs, as hard drives are cheaper than RAM.

Among other notable features: ClickHouse does not support transactions, point UPDATE and DELETE operations, window functions, or a full-featured query optimizer.

ClickHouse architecture

ClickHouse has a different architecture compared to Greenplum. In this DBMS, a single instance is installed on each of several servers, and user queries are sent directly to these servers. For data replication and execution of distributed DDL queries, ClickHouse uses the ClickHouse Keeper coordination service — an alternative to the widely known open-source coordination service Apache ZooKeeper.

A brief note on the similarities and differences between these solutions.

ZooKeeper is implemented in Java, whereas ClickHouse Keeper is written in C++. ClickHouse Keeper provides the same guarantees as ZooKeeper (linearizable writes, non-linearizable reads), and it also includes a compatible client-server protocol so that standard ZooKeeper clients can be used with ClickHouse Keeper. However, snapshots and logs in ClickHouse Keeper have formats that are incompatible with ZooKeeper (although the data can be converted). The inter-server protocol is also not compatible with ZooKeeper, which means it’s not possible to create a mixed ZooKeeper / ClickHouse Keeper cluster. Access control in ClickHouse Keeper is implemented similarly to ZooKeeper. Overall, ClickHouse Keeper can be used as a drop-in replacement for ZooKeeper or as an internal component of the ClickHouse server.

The high performance of ClickHouse is achieved through several architectural features:

  1. Vectorised processing of column chunks — data is not only stored in columns, but also processed in vectors or parts of columns. This reduces scheduling overhead and allows more efficient use of the CPU.
  2. Physical sorting of data by primary key — enables fast retrieval of specific individual values or ranges of values.
  3. Parallel execution of operations across multiple CPU cores on a single server.
  4. Distributed computing across a cluster through sharding.
  5. Support for approximate calculations using aggregate functions — reduces disk I/O and further increases processing speed.
  6. Data sampling — allows queries to be executed on subsets of data, as well as aggregation over a limited number of random keys.

Features of ClickHouse

  • Extremely high query processing speed!
    ClickHouse processes OLAP queries hundreds of thousands of times faster than many other systems*: its performance reaches petabytes of data per second.
  • Linear scalability
    To improve database performance, simply add more nodes to the cluster. ClickHouse supports Cross Data Center Replication (CDCR), which synchronises data across data centres, ensuring fault tolerance and high-speed data access. ClickHouse Keeper is responsible for replication coordination.
  • High availability and fault tolerance
    All user cluster data in ClickHouse is always available for reading and writing. As a fully decentralised system, it can be used to build a DBMS with any level of fault tolerance. For example, by placing data across three data centres and enabling replication, you can create a cluster with a replication factor of ×2 — capable of surviving a full outage of one data centre.
  • Asynchronous replication by default
    ClickHouse uses asynchronous replication by default. This means that if data is written to any replica, it will be guaranteed to be copied to others. Asynchronous replication runs in the background and usually takes just a few seconds. It maintains full consistency between replicas and automatically recovers after failures. If needed, synchronous replication mode can also be used. To enhance reliability, ClickHouse supports quorum-based writes. In this mode, a write operation is considered successful only when the data has been successfully written to multiple servers — a «quorum» has been reached. This ensures linearisability and simulates synchronous replicas. If the number of successfully updated replicas does not meet the required quorum, the write is rejected, and ClickHouse deletes the block from all replicas to preserve data integrity.
  • Fast data compression
    Since ClickHouse is a columnar DBMS, its compression technology works hundreds of times faster than in row-based databases. Thanks to modern compression algorithms such as LZ4 and ZSTD, data not only occupies less storage space but is also processed significantly faster.
  • SQL support and extensibility
    ClickHouse supports a dialect of the Structured Query Language close to ANSI SQL, extended with features such as:
  • arrays and nested data structures;
  • probabilistic data structures;
  • URI functions;
  • ability to connect external key-value stores;

Also, thanks to the HTTP interface, users can write custom connectors in any programming language. With special integration engines (engines), ClickHouse can use various external storage systems and databases as data sources.

  • Support for arrays and tuples
    Arrays and tuples are supported natively in ClickHouse. The DBMS allows creation of tables where, for example, one column may contain an array (a set of numbers) or a tuple — an array consisting of multiple fields.
  • Integration capabilities
    ClickHouse can connect to any database via JDBC and use its tables as its own. It can also connect to other ClickHouse instances without additional configuration and offers ready-made integrations with other systems. Connection tools include command-line interface, HTTP API, JDBC and ODBC drivers, and numerous wrappers (for Python, PHP, NodeJS, Perl, Ruby and R).
  • Easy installation
    On Ubuntu and Debian Linux operating systems, ClickHouse can be installed from pre-built packages using a few simple commands and requires no complex setup. For distributed operation, ClickHouse Keeper is required.
  • No UPDATE or DELETE statements
    However, ClickHouse provides mutation operations:
    ALTER TABLE [db.]table UPDATE column1 = expr1 [, …] WHERE filter_expr
    and
    ALTER TABLE [db.]table [ON CLUSTER cluster] DELETE WHERE filter_expr.
    Additionally, ClickHouse includes special table engines — CollapsingMergeTree and ReplacingMergeTree — which, under certain conditions, allow implementation of business logic for modifying and deleting data.
  • Working with tables
    The MergeTree table engine enables:
  • non-blocking reads and writes;
  • insertion of millions of rows per second;
  • partitioning, replication, and sampling of data;
  • creation of tables and databases at runtime;
  • loading data and executing queries without reconfiguring or restarting the server;

Typical use cases for ClickHouse

  1. Mobile app analytics : analysis of download and registration numbers, user activity and engagement, session duration, number of invitations sent, etc.
  2. Web analytics : traffic sources, bounce rates, new, returning, and repeat visitors, session length, average number of page views per visit, conversion rates, user actions, and devices used (mobile or desktop).
  3. Advertising and real-time bidding (RTB — technology for buying and selling ad impressions via auctions).
  4. Retail and e-commerce : demand forecasting, inventory tracking and analysis, collection and analysis of online purchase data, and more.
  5. Business intelligence, banking and financial operations : customer data, finance-related metrics, product performance.
  6. Monitoring of various technical and business metrics.
  7. Telecommunications and information security : collecting and analysing data on current threats, providing recommendations for prevention, identifying vulnerabilities in IT infrastructure, tracking software usage and licence expiration dates.
  8. Online gaming : active users, session duration, churn rate, cost per install, payments, etc.
  9. IoT (Internet of Things) and industrial sensor data processing (processing data from industrial robots, real-time production line monitoring).

Greenplum vs ClickHouse (Comparison Table)

Greenplum ClickHouse

ACID (Transactionality)

Good
None
Ability to run multiple queries within a single transaction. Greenplum provides full ACID isolation. In ClickHouse, there is no concept of transactions at all.

Performance on WHERE clause queries

Good
Excellent!
Greenplum: response time varies from a few seconds to tens of seconds, depending on data volume.
ClickHouse: extremely fast even without indexes; response time is under 189 milliseconds.

FULL SCANS

Good
Excellent!
FULL SCANS implies full database scanning followed by large result sets sent to an external resource. Greenplum handles this well, but ClickHouse performs significantly faster thanks to its aggregative and lambda functions.

ANSI SQL support

Good
Minimal
Greenplum supports ANSI SQL 2008 + 2012 extensions (OLAP etc.) and generally aligns with PostgreSQL 9.4 (as of December 2019).
ClickHouse has very limited ANSI SQL support.

JOIN operations

Good
Limited
Greenplum works correctly with both local and distributed JOINs. Occasionally has issues with JOIN indexes (not always used).
In ClickHouse, JOIN support is weak. First, the right-hand side of a JOIN must fit into memory on a single server — joining two tables larger than available memory is not possible. Second, the query planner cannot determine which JOIN type to use, so users must specify it explicitly: LOCAL, GLOBAL, ANY, etc. Users unaware of this may end up with inconsistent results.

Resource management

Good
Fair
Greenplum features a highly flexible resource management system. Specific resources can be assigned to user groups, and these can even be borrowed between groups based on defined rules. This is achieved through the new resource group mechanism (the older resource queue mechanism is less stable).

Disaster recovery

Limited
Fair
Greenplum currently lacks built-in disaster recovery features. However, write-ahead logs are now compatible with PostgreSQL logs, allowing recovery on another cluster in a different data centre.
In ClickHouse, disaster recovery is better implemented — shards and replicas are managed at the table level rather than the entire cluster, enabling custom configurations where some tables reside in one data centre and others in another. This allows for consistency across queries even during network partitions.

Security

Good
Minimal, but improving*
Greenplum inherits classic PostgreSQL security models: users, groups, permissions, inheritance, data encryption and SSL connection support, as well as row-level and column-level access control.
ClickHouse offers none of this out-of-the-box, although recent updates have introduced improved authentication mechanisms (discussed later).

Custom code execution

Fair
Fair
By "custom code", we mean running non-standard SQL — long user-defined instructions written using SQL logic. In Greenplum, there’s a special procedural function that calls Python, but data can only be passed row-by-row. So while you can create powerful machine learning functions that process data at high speed, the result has to be returned one row at a time — faster than standard Python, but slower than batch processing like in Hadoop. In ClickHouse, support for procedural languages is currently implemented to a lesser extent than in Greenplum. Some initial steps have been taken in this direction — for instance, lambda functions are available, which can be used to describe custom user logic. Additionally, a user can compile their own function in C and link it to ClickHouse, but this process is complex and costly.

Greenplum performs best in the following scenarios:

  • ETL and ELT data loading processes — classic data transformation tasks that are most effective when executed using in-database processing.
  • DWH transformations — all other types of data transformations, including key generation and similar operations.
  • Large JOIN operations — efficient joining of large tables, especially when performed on a predefined distribution key. This DBMS handles large table joins well, both locally and across distributed systems.
  • Ad-hoc deep dive analysis — in-depth analysis based on custom queries, where it’s not known in advance which data users will need, but access must be provided and resources allocated for processing.
  • Analytical functions in procedural languages , including those using ready-made algorithm libraries such as MADLib.

ClickHouse, on the other hand, is most effective for:

  • Wide data marts — working with fast data marts. If marts created in Greenplum are exported to ClickHouse and then receive uniform, predictable user queries, queries against these fast marts are processed in milliseconds.
  • Wide fact tables — handling wide denormalised fact tables. When it’s necessary to store large volumes of raw data, aggregate them, and then work with the aggregates. ClickHouse enables fast complex aggregations, with query execution times measured in milliseconds.
  • AD-HOC queries — handling simple, specific queries against data marts when it’s necessary to store vast amounts of raw data.
  • Full-scan operations — performing full scans, especially when filters are applied, as well as working with structured logs and events.

Tkhemali Connector for fast data delivery

We solved the issue of delivering data from Greenplum to ClickHouse by developing a custom connector — which we named Tkhemali .

his tool ensures consistent data transfer, even in the event of transaction rollbacks.
This solution enabled us to achieve very high interaction speeds: early tests on standard SATA drives showed insertion rates of up to 1 GB/s per Greenplum–ClickHouse server pair .

We also ensured that data transfer speed scales reliably with the growth of both Greenplum and ClickHouse clusters. Importantly, the connector guarantees correct data insertion regardless of ClickHouse’s current load. Thanks to Tkhemali, tens of thousands of users can simultaneously access data marts originally built in Greenplum — but served through ClickHouse.

In one of our upcoming articles, we’ll provide a detailed overview of the Tkhemali connector , including its capabilities, key features, and architecture.

For more information, please get in touch here:

Contact Form main