may 25, 2023
Configuring Kylin (Spark) Components
Not so long ago our company started to master the new Apache technology stack. After several successful projects we want to share our development experience and some features.
After deploying the Apache Kylin docker image from the developers (apachekylin/apache-kylin-standalone:4.0.0), working with analytics solutions, encountered the problem of resource limits set in Kylin by default. In this post we will discuss options for changing the configuration of Spark for a cluster with one node whose specifications are 150 GB of RAM, 981.46 GB of ROM, a 10-core processor with a base clock speed of 2.40 Hz.
To change the configuration of Spark, after connecting to the server, you need to go to the command line of the Kylin container, find the necessary file and edit it using vi (commands are shown in Figure 1). All Spark configurations can be managed in the kylin.properties file with «kylin.engine.spark-conf» prefix. The following post discusses the configuration parameters that have been changed relative to the standard one.
Figure 1 — commands to go to the container Kylin, search for file by name
To form the Spark configuration, we tested various options and thus found the optimal solution for a particular work server. Here are the parameters to be changed (listing 1):
1) spark.driver.memory — memory size used for the driver process which initializes SparkContext. Optimal is 4G.
2) spark.driver.memoryOverhead — The amount of non-bunk memory allocated to each driver process in cluster mode. This is memory that accounts for things like virtual machine overhead, interned strings, other native overhead, etc. It tends to grow with container size (usually 6-10%). 2G is optimal.
3) spark.executor.memory — the amount of memory used by each executor process. Optimally 8G.
4) spark.executor.memoryOverhead — The amount of extra memory allocated to each executor process. This is memory that takes into account things like virtual machine overhead, interned strings, other proprietary overhead, etc. Optimally 2G.
5) spark.executor.cores — number of virtual cores to use on each executor. Optimal is 1.
6) spark.executor.instances — the number of physical cores to be used by all executors. The optimal number is 8.
7) spark.hadoop.dfs.replication — HDFS replication factor. Optimal is 1.
8) spark.sql.shuffle.partitions — default number of partitions used when shuffling data for connections or aggregations. Optimally 16.
Listing 1 — Spark configuration parameters
kylin.engine.spark-conf.spark.driver.memory=4G
kylin.engine.spark-conf.spark.driver.memoryOverhead=2G
kylin.engine.spark-conf.spark.executor.memory=8G
kylin.engine.spark-conf.spark.executor.memoryOverhead=2G
kylin.engine.spark-conf.spark.executor.cores=1
kylin.engine.spark-conf.spark.executor.instances=8
kylin.engine.spark-conf.spark.hadoop.dfs.replication=1
kylin.engine.spark-conf.spark.sql.shuffle.partitions=16
It is important to understand that when the configuration is as listed in Listing 1, during the cube build, 4G (+2G) of RAM is first allocated to the driver, and then the process is divided into 16 sub-processes which are executed by eight executors, each using 8G (+2G) of RAM. Thus, no more than 86G of server RAM is used during the cube build. In addition, the spark application itself takes another 25G (Figure 2). The available cluster (server) resources are specified when configuring Hadoop (described in the previous article). If the resources needed to start the cube processing job run out of the available cluster resources, the Application manager sends the job to wait until the resources in the cluster are freed.
Figure 1 — commands to go to the container Kylin, search for file by name
Sparder (spardercontext) is a distributed Kylin4 query engine implemented by the server side of the spark application. The configurations can be managed in the kylin.properties file prefixed with «kylin.query.spark-conf» (listing 2).
Listing 2 — Sparder configuration parameters
kylin.query.spark-conf.spark.driver.memory=4G
kylin.query.spark-conf.spark.driver.memoryOverhead=2G
kylin.query.spark-conf.spark.executor.memory=16G
#kylin.query.spark-conf.spark.executor.instances=8
kylin.query.spark-conf.spark.executor.memoryOverhead=4G
kylin.query.spark-conf.spark.executor.cores=1
kylin.query.spark-conf.spark.sql.shuffle.partitions=16
For the changes to take effect, the container must be restarted (listing 3) or in the System section of the Kylin web interface (Figure 3) you need to reload the configuration (Reload Config).
Listing 3 — Restarting the container
sudo docker restart kylin
Figure 3 — The System section of the Kylin web interface
Source List:
https://kylin.apache.org/docs31/tutorial/cube_spark.html
https://spark.apache.org/docs/latest/configuration.html
https://kylin.apache.org/docs/