april 21, 2024
Algorithm for calculating the required computational resources for data processing in Kylin
Apache Kylin is a distributed repository of precomputed and aggregated data for complex analyses, that is, operational analytical processing (OLAP) cubes.
The data pre-aggregation process is performed after initiating a request to build a cube segment that generates a Spark job. In the cube configuration it is possible to define the number of executors (processes) and the amount of RAM allocated for each executor. When a Spark job is started, the resources required for its execution are automatically configured; these values can be a guideline for the developer, but not a direct guide to action. For example, Spark may request 20 executors, but if the job is running on a single server with 10 cores, this will not lead to a positive result.
Thus, when building the cube, the developer should consider both the recommended parameters obtained from the Spark job auto-configuration and the available cluster/server resources. The formula for calculating the required amount of cluster/server RAM is shown below:
M=n*(Me+MeO)+(Md+MdO)+Ms
M – is the cluster/server RAM size;
n – number of Spark executors;
Me – the size of RAM allocated to each Spark executor;
MeO – size of RAM allocated to each Spark executor for overhead;
Md – size of RAM allocated to the Spark driver;
MdO – size of RAM allocated to the Spark driver for overhead;
Ms – size of RAM allocated to Sparder;
In order to obtain optimal configuration parameters and minimise unused reserved compute resources for data processing in Kylin, testing was performed. The starting configuration of Hadoop and Spark was set so that about 90% of resources were used.
For a cluster with one node (node), the technical characteristics of which are 150 GB RAM, 981.46 GB ROM, 10-core processor, the parameters described earlier in the articles were set:
- The total RAM capacity of the cluster is 125 GB;
- 4+2 GB are allocated for the spark driver;
- 8 executors are allocated for the spark job;
- Each executor is allocated 8+2 GB;
- 21 GB are allocated for the Sparder.
The number of spark job executors is one of the key parameters affecting the cube build time. Testing has shown that the more individual processors run during the cube building task, the shorter the execution time. It is important to take into account that each process must be allocated at least one physical processor core, i.e. the number of executors cannot exceed the number of processor cores. It should also be taken into account that besides the executors, the Spark driver is launched at the job startup, besides, the Sparder executor is already running in the cluster. Thus, the optimal number of Spark executors for the specified server is 8.
Monitoring of the used resources when building cube sections on a server with linux-like OS can be performed by means of the console command top. Figure 1 shows the output for the top command during the execution of the spark job, which shows that about 10 GB of RAM (VIRT) is allocated for each executor, while only 4-5 GB of RAM (RES) is actually used. It is important to note that the Hadoop resource manager displays rss+swap memory metrics (Figure 2), which corresponds to the VIRT column in Figure 1 and does not show the actual resources used.
Figure 1 — monitoring of server resources during cube processing
Figure 2 — Resource Manager resource monitoring during cube processing
In addition to monitoring server resources during job execution, you should see Spark’s recommended auto configuration, which can be tracked in the log file (Figure 3, 4).
Figure 3 — cube construction log
Figure 4 — Spark auto configuration
The analysis of testing results and Spark auto configuration shows that it is rational to set 5 GB to each executor and 2 GB as memory overhead. Resource utilisation with such spark configuration is presented in Figures 5, 6. It should be noted that reducing the allocated RAM for the executor reduced the time of cube segment processing.
Figure 5 — monitoring of server resources during cube processing
Figure 6 — monitoring of server resources during cube processing
Now after calculating the optimal Spark configuration with respect to server resources, the minimum required server RAM size for Apache Kylin is:
8*(5GB+2GB)+(4GB+2GB)+21GB=83GB
Figure 5 shows that despite the allocated 83GB of RAM, about 63GB is used, i.e. a kind of resource reserve is provided. Thus, for the current tasks the server with 150GB RAM is redundant, it would be enough to have a server with 83GB RAM and similar current CPU power.