info@lasmart.biz

january 15, 2026

Introduction to Lakehouse

In modern business, companies are increasingly faced with the need to combine large volumes of data from different sources — from sales and logistics systems to marketing reports and user activity. For this data to deliver real value, it is not enough merely to store it; it must also be possible to extract analytical insights from it quickly.

Traditional Data Warehouses (DWH) handle analytics well, but are limited when working with unstructured data and can be expensive when storing and processing very large volumes of raw data. Data Lakes, on the other hand, allow any type of data to be stored in its raw form, but are complex to maintain and do not provide transactional consistency. At the intersection of these two approaches, the Lakehouse architecture emerged.

A Lakehouse is a modern approach to organising data storage that combines the strengths of Data Lakes and Data Warehouses.

In short, a Data Lake is a flexible file-based repository for raw data of any format, while a Data Warehouse is a structured store designed for analytics. A Lakehouse brings these advantages together: the scalability and flexibility of a data lake combined with the reliability, transactional guarantees, and query convenience typical of data warehouses. In this article, we will examine how a Lakehouse works in order to understand in practice what benefits it can offer for business analytics.

At the core of the Lakehouse concept is the idea of a single unified space in which raw data and analytically prepared tables coexist side by side. The storage itself is a combination of file-based and object storage, together with a metadata and management layer that imposes a tabular structure on these files.

In practice, a Lakehouse structure may look as follows:

  • a directory (bucket) containing source data — raw extracts from upstream systems (transactions, sales, and so on);
  • directories with prepared and organised data and tables — where data is stored in a format optimised for querying;
  • system (technical) files — metadata, schema definitions, table versions, and snapshots that ensure data integrity and manageability.

As a first step, we create a logical container for the data — a bucket (lakehouse). This is the main storage directory that holds all objects, both raw and processed.

Figure 1 – Lakehouse bucket structure

In the storage, you can see that inside the bucket (lakehouse) there is a folder named cheques. This is a table created in the Iceberg format — not merely a collection of directories and files, but a logical table built on top of file-based storage. Iceberg organises data in such a way that, alongside the primary data files, auxiliary metadata files are created. These metadata files contain information about the table structure, its versions, snapshots, and partitions, ensuring consistency and manageability.

Figure 2– contents of the cheques folder

It is precisely through this type of organisation that a Lakehouse brings together the worlds of file-based storage and tabular analytics. The key optimisation mechanism here is metadata, which controls partitioning and indexing. These mechanisms can be applied both to pre-organised datasets and to portions of raw data during the preparation stage, significantly speeding up query execution.

Let us take a closer look at the metadata folder. This is where the service information is stored that makes an Iceberg table “smart” — it knows which files it consists of, how it is partitioned, which versions exist, and what the current state of the data is. Several types of files can be found here:

  • manifest files (*-m0.avro) — contain lists of Parquet files belonging to a specific snapshot (that is, a particular version of the table);
  • snapshot files (snap-*.avro) — describe the state of the table at the moment a snapshot is created: which files were added, modified, or removed;
  • metadata.json (v1.metadata.json) — the main table configuration files that store the data schema, partitioning information, snapshot lists, and general table properties;
  • version-hint.text — a small technical file indicating the current version of the table.

Figure 3– contents of the metadata folder

The Iceberg operating principle is based on versioning: any change — whether data is added, updated, or deleted — does not overwrite existing files but creates a new version of the table. One of the key characteristics is that versioning and data changes operate at the partition level. This means that when data is modified, new files are created only for the affected partition, while the rest of the table remains unchanged. As a result, Iceberg preserves the complete history of changes and allows rollback to any previous state if required.

However, efficient operation critically depends on correct data partitioning. If large files are not split into partitions or if partitioning is implemented incorrectly (for example, not aligned with ETL process requirements), the system will be forced to create copies of massive data volumes for every minor change. This leads to a sharp drop in performance and inefficient resource usage when working with billions of records.

In addition to the metadata folder, the structure also contains a data directory, which stores the main data files in Parquet format. Parquet is a columnar storage format optimised for analytical workloads. It allows reading only the required columns rather than entire files, significantly speeding up query execution. Thanks to this, Iceberg combines the benefits of a file-based approach (flexibility and scalability) with the efficiency typical of analytical data warehouses.

Figure 4– this is what a Parquet file looks like inside an Iceberg table — it is stored in a partition corresponding to a specific year and month

Thus, an Iceberg table combines two key layers: the data stored in Parquet files and the metadata that describes their structure, versions, and partitions. Thanks to this, an analytical engine does not work with each file directly, but instead operates through the intelligent Iceberg layer, which knows exactly which files need to be read to execute a specific query. In practice, this is especially noticeable when working with large data volumes.

In this example, Apache Spark was used to execute analytical queries — a distributed computing engine that enables parallel processing of large datasets. Spark integrates with Iceberg and can interact with it via SQL queries as if working with ordinary tables. To see in practice how Iceberg optimises data reads, let us run two queries with identical conditions but different syntax. At first glance, they do the same thing — count the number of receipts per store for June 2023 — but their execution times differ.

Query without using partitions

In this case, the filter is applied after all data has been read, because the Year and Month fields were cast to string types, and Iceberg cannot use partitions for optimisation. As a result, Spark effectively reads all table files, which takes more time.

Figure 5– result of executing the query without using partitions

Query using partitions

Here, the filter is written correctly — Spark passes it to Iceberg, which then uses partition information to read only the required files. As a result, the query runs significantly faster because Spark processes only a small subset of the data.

Figure 6– result of executing the query using partitions

This example clearly demonstrates one of the key advantages of the Lakehouse approach: analytical tools such as Spark do not work with data directly, but instead operate through the intelligent Iceberg layer, which determines on its own which parts of a table need to be read. Thanks to Iceberg and the partitioning mechanism, queries are processed faster, the load on compute resources is reduced, and analytics becomes more flexible and responsive. This is especially important when dealing with billions of rows of data — it is precisely in such scenarios that the Lakehouse truly shows its strength.

Conclusion

The Lakehouse architecture is a modern and flexible approach to data organisation. It combines the benefits of storing large volumes of information with advanced analytical processing capabilities, simplifying work with data and making it more accessible for decision-making. Lakehouse demonstrates that even complex datasets can be processed quickly, reliably, and at scale.

For more information, please get in touch here:

Contact Form main