june 23, 2023
Apache Sqoop - installation and configuration
Analysts often have to group and merge data. These operations in relational databases are resource-intensive. Operational analytical processing (OLAP) is a technology that organizes large commercial databases and supports complex analysis. It can be used to perform complex analytical queries without negatively impacting transaction systems. OLAP data is pre-calculated and aggregated, which speeds up analysis.
Apache Kylin is an open source distributed analytics data warehouse designed to enable OLAP (online analytics) in the Big Data era. Distributed computing and data storage provides a number of benefits such as scalability, fault tolerance and load balancing.
In developing the Kylin OLAP cube, it is necessary to arrange for Data Lake (Hadoop) populations, from which data will be processed into the cube. One of the most common cases for Hadoop populations is the import of data from relational databases. To cover this case, Apache has developed special software — Apache Sqoop.
Apache Sqoop(TM) is a tool designed to efficiently transfer bulk data between Apache Hadoop and structured repositories such as relational databases. The high performance of data import and export, when using Sqoop, is achieved by using MapReduce infrastructure, which provides parallelism in data processing.
In our company, Sqoop was used to implement data integration from SQL Server into the Hadoop Distributed File System (HDFS), which was deployed in the docker container kylin, running on images apachekylin/apache-kylin-standalone:kylin-4.0.1-mondrian from docker hub; Sqoop was installed in the same docker container; detailed installation and configuration steps for Sqoop are described below.
- Uploading the Sqoop archive to the docker container. To import the Sqoop archive into the current directory, use the following command: wget https://archive.apache.org/dist/sqoop/1.4.4/sqoop-1.4.4.bin__hadoop-2.0.4-alpha.tar.gz
Figure 1: Loading the Sqoop archive in the current directory.
2. Extracting the Sqoop archive. To extract the Sqoop archive to the current directory, use the following command: tar -xvf sqoop-1.4.4.bin__hadoop-2.0.4-alpha.tar.gz
Figure 2: Extracting the Sqoop archive to the current directory.
3. Transferring the Sqoop directory to the working directory.To move the Sqoop directory from the current directory to the working directory, use the following command: mv $PWD/sqoop-1.4.4.bin__hadoop-2.0.4-alpha /home/admin
Figure 3: Transferring the Sqoop directory to the working directory.
4. Configuring bashrc. To configure the Sqoop environment, add the following line to the file ~/.bashrc: export SQOOP_HOME=/home/admin/sqoop-1.4.4.bin__hadoop-2.0.4-alpha export PATH=$PATH:$SQOOP_HOME/bin
Figure 4: Configuring bashrc.
Run file bashrc: source ~/.bashrc
Figure 5. Running bashrc.
5. Linking Sqoop to Hadoop. To link Sqoop to Hadoop, the following commands must be run:
5.1 Go to catalog $SQOOP_HOME/conf: cd $SQOOP_HOME/conf
5.2 Using the sqoop-env-template.sh template, create a running file sqoop-env.sh: mv sqoop-env-template.sh sqoop-env.sh
Figure 6. Creating sqoop-env.sh.
5.3 Open sqoop-env.sh and edit the following lines:
5.3.1 export HADOOP_COMMON_HOME= *path to the hadoop root directory*
5.3.2 export HADOOP_MAPRED_HOME= *path to the hadoop root directory
Figure 7: Linking sqoop to hadoop.
6. Checking that the Sqoop installation is correct. To verify that sqoop has been installed, you can run the following command: sqoop version
Figure 8. Checking sqoop installation.
7. Installing the jdbc driver. To implement integrations between a relational database and Hadoop via Sqoop, you must install the jdbc driver, the corresponding DBMS in the following directory: $SQOOP_HOME/lib.
In our case, the jdbc driver for SQL Server — sqljdbc4-2.0.jar — was installed in the specified directory.
8. Example of using sqoop:
sqoop import
—connect «jdbc:sqlserver://IP:Port;database=databaseName»
—username XXXXXX
—password XXXXXX
—driver com.microsoft.sqlserver.jdbc.SQLServerDriver
—table «tableName»
—target-dir «hdfs path» -m 1
In this example, all SQL Server database table data is imported into the hdfs directory.