may 11, 2023
Kylin cube segment processing automation via NiFi
Recently, open-source software has been gaining in popularity and active use. Our company has followed the main trend and started to master the new Apache technology stack. After several successful projects we want to share our development experience and some features.
Developing an analytics solution in Apache Kylin, we encountered difficulties in automating the updating of Kylin cubes segments after data loading into the repository. While at the development stage performing these actions manually is not a problem, it is a problem when deploying the solution to the product.
The open source software Apache NiFi is used to automate data processing operations. The development task is to automatically start processing cube segments via RestAPI using the ExecuteScript processor.
The RestAPI request code was generated programmatically using Postman. The software provides a choice of programming languages to be used. Python and Ruby fall into the intersection of the possible languages to be used with ExecuteScript (NiFi) and Postman. Testing has shown that in order to perform HTTP requests on Python, you need to install libraries, while on Ruby, it is enough «out of the box» configuration. This influenced the choice of programming language — Ruby.
In addition to the query code itself, it is necessary to develop an algorithm for executing queries for a number of cubes, as well as the functionality to transfer the start and end dates of the data period of the segment.
To store information about cubes, a two-dimensional array with names of cubes and average processing time of its segment is created. The loop performs a pass through the array, the cube name is used in the url string and the average processing time of the section at the end of the loop as a sleep function argument to start segment processing sequentially (Kylin is configured so that all server resources are used when one Spark job runs, when running in parallel the task will go to sleep for a random period).
To determine the start and end date of the period of segment data, we set the start date, the first day of the current month: get the year and month of the current date and substitute the first number. We set the number n — the number of updated segments in the cube. For each cube in the loop from 0 to n in reverse order (i.e. starting from n and ending with 0) subtract passed number of months from start date, getting startTime and one less than passed number of months for endTime. In the body of the HTTP request we convert the dates to Unix Timestamp type in milliseconds, the argument «buildType»: «BUILD» can be used both to update an existing segment and to create a new one. The code is shown in listing 1.
Listing 1 — script for automating the processing of Kylin cubes sections