october 14, 2024

Key nuances of building a Kylin cube

Not long ago, our company started working with a new Apache technology stack. After several successful projects, we would like to share our development experience and some of our insights.

Apache Kylin is a distributed storage system for pre-calculated and aggregated data designed for complex analysis, specifically Online Analytical Processing (OLAP) cubes. A key step in developing cubes is configuring dimension types and aggregation groups. Apache Kylin offers extensive documentation on configuring and optimising the cube-building process. This article will explore the identified patterns of how dimension types and aggregation groups affect cube build time and query latency, as well as propose an optimal cube configuration with more than 30 dimensions.

The development takes place in an environment deployed from the Apache Kylin Docker image (apachekylin/apache-kylin-standalone:4.0.0) and configured according to the capabilities of the working server (the Apache Kylin configuration is detailed in previous articles). The solution discussed in this article is based on a data model that includes one fact table containing an average of 1.5 million rows per month and three dimension tables. The Apache Kylin cube builder treats each column in a dimension table as a separate entity, so the model includes a total of 51 dimension columns (hereafter referred to as «dimensions»).

The Apache Kylin cube builder offers two types of dimensions: normal and derived. It has been determined through experimentation that queries to the cube containing normal dimensions return data twice as fast as the same queries with derived dimensions. Kylin limits the number of normal dimensions in a cube to no more than 63. According to the Apache Kylin documentation, theoretically, for N normal dimensions, 2ᴺ combinations of dimensions (cuboids) can be created. The cube processing time and the amount of computational resources required are directly proportional to the number of cuboids. Therefore, to optimise this process, it’s necessary to carefully define dimension types and reduce the number of dimension combinations.

Before determining dimension types, it’s important to understand which SQL queries will most frequently be sent to the cube. The cube discussed in this article serves as the data source for reports in SuperSet and Kylin MDX. While it’s difficult to form a pool of the most frequent queries for Kylin MDX, this is not an issue for dashboards in SuperSet. Therefore, the SuperSet queries were used as the basis for testing cube response times.

The derived dimensions selected for testing were those that:

Are not used in SuperSet reports and are rarely used in reports based on the Kylin MDX dataset (e.g., pharmacy opening/closing date, sales area/total area of the pharmacy, VAT rate);
Are not used in SuperSet reports and can be derived from another dimension—a foreign key (FK), in other words, they have a specific mapping to it (e.g., item SKU, primary barcode);
Are primary keys (PK) of dimension tables, since FK and PK are always identical, Kylin can first apply filters/grouping to the FK and transparently replace them with the PK.

Following the principles for selecting dimension types outlined above, the cube was configured with 13 derived dimensions and 38 normal dimensions (Figure 1).

Figure 1. Configuration of cube dimension types

The next essential step in cube development is configuring aggregation groups to reduce the number of cuboids.

The most optimal solution in our case is to create two aggregation groups:

Includes the main dimension fields used in SuperSet reports;
Includes all other dimension fields used in Kylin MDX.

As a result of testing various aggregation group configurations, the following general rules were identified for their formation:

Mandatory Dimensions are absent.
Hierarchy Dimensions contain all existing hierarchies. In the hierarchies, dimensions with the lowest cardinality are selected first, followed by those with higher cardinality (examples are shown in Figures 2 and 3).

Figure 2. Example configuration of Hierarchy Dimensions for aggregation group No. 1

Figure 3. Example configuration of Hierarchy Dimensions for aggregation group No. 2

Joint Dimensions group together dimensions that belong to the same table or are rarely used in SQL queries (examples are shown in Figures 4 and 5). It’s important that Hierarchy and Joint Dimensions contain non-overlapping dimensions.

Figure 4. Example configuration of Joint Dimensions for aggregation group No. 1

Figure 5. Example configuration of Joint Dimensions for aggregation group No. 2

Pre-calculated data is stored in Parquet files. Before saving to HDFS (object storage), Kylin performs a redistribution (Spark dataframe transformation) and partitions the data according to the shard by column. This positively affects the cube query execution time, reducing it by 30% compared to the same solution without a shard by column.
The most effective choice for the shard by column is the dimension with the highest cardinality (Figure 6). The order of dimensions should follow the recommended guidelines.

Figure 6. Defining shard by item_id

The primary goal of data manipulation is to calculate measures. Apache Kylin provides a limited set of aggregation functions, as shown in Figure 7.

Figure 7. Apache Kylin aggregation functions

Special attention should be given to the count_distinct function, specifically the determination of the type of value it returns. To obtain accurate results, the measure should be configured as shown in Figure 8.

Figure 8. Measure configuration with the count_distinct aggregation function

The calculation of count_distinct in Kylin can be optimised by defining the minimum global dictionary partition size in the «Configuration Overwrites» step:

Additionally, to use Spark Cubing, the following configuration must be set:

It’s also necessary to specify the cube-building algorithm and the amount of memory allocated for this process (Figure 9):

Figure 9. Modifying Kylin properties for a specific cube

Following the above principles for designing a cube in Kylin for the solution discussed allows for a more even distribution of cuboids (Figure 10), reducing the build time for a one-month section with 1.5 million rows of data to 4-5 minutes, while maintaining average and maximum SQL query latency times of around 0.3 and 1.2 seconds, respectively.

Figure 10. Cuboid distribution for a specific Kylin cube

References:

https://kylin.apache.org/docs/tutorial/create_cube.html
https://kylin.apache.org/docs15/howto/howto_optimize_cubes.html
https://kylin.apache.org/docs/
https://cwiki.apache.org/confluence/display/KYLIN/Improve+query+performance+by+setting+shard+by+column
https://cwiki.apache.org/confluence/display/KYLIN/Global+Dictionary+on+Spark
https://juejin.cn/post/6854573220016701454
https://www.mdpi.com/2076-3417/10/23/8674

info@lasmart.biz

Key nuances of building a Kylin cube

october 14, 2024

Key nuances of building a Kylin cube

For more information, please get in touch here:

Go back

info@lasmart.biz