PRATHYUSHA ENGINEERING COLLEGE
DEPARTMENT OF ARTIFICIAL
INTELLIGENCE AND DATA SCIENCE
LAB MANUAL
for
CCS334 – BIGDATA ANALYTICS
(Regulation 2021, V Semester, Theory cum Practical
Course)
ACADEMIC YEAR: 2023 – 2024
(ODD Semester)
Exp.1 Downloading and installing Hadoop, Understanding different Hadoop modes,
Startup scripts, Configuration files.
AIM:
To Download and install Hadoop, Understanding different Hadoop modes, Startup
scripts, Configuration files.
Procedure:
Step 1 : Install Java Development Kit
The default Ubuntu repositories contain Java 8 and Java 11 both. But, Install Java 8 because
hive only works on this version. Use the following command to install it.
$sudo apt update&&sudo apt install openjdk-8-jdk
Step 2 : Verify the Java version
Once installed, verify the installed version of Java with the following command:
$ java -version
Output:
Step 3: Install SSH
SSH (Secure Shell) installation is vital for Hadoop as it enables secure communication
between nodes in the Hadoop cluster. This ensures data integrity, confidentiality, and allows
for efficient distributed processing of data across the cluster.
$sudo apt install ssh
Step 4 : Create the hadoop user :
All the Hadoop components will run as the user that you create for Apache Hadoop, and the
user will also be used for logging in to Hadoop‟s web interface.
Run the command to create user and set password:
$ sudo adduser hadoop
Output:
Step 5 : Switch user
Switch to the newly created hadoop user:
$ su - hadoop
Step 6 : Configure SSH
Now configure password-less SSH access for the newly created hadoop user, so didn‟t enter
the key to save file and passphrase. Generate an SSH keypair (generate Public and Private
Key Pairs)first
$ssh-keygen -t rsa
Step 7 : Set permissions :
Next, append the generated public keys from id_rsa.pub to authorized_keys and set proper
permission:
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 640 ~/.ssh/authorized_keys
Step 8 : SSH to the localhost
Next, verify the password less SSH authentication with the following command:
$ ssh localhost
You will be asked to authenticate hosts by adding RSA keys to known hosts. Type yes and hit
Enter to authenticate the localhost:
Step 9 : Switch user
Again switch to hadoop. So, First, change the user to hadoop with the following command:
$ su–hadoop
Step 10 : Install hadoop
Next, download the latest version of Hadoop using the wget command:
$ wgethttps://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
Once downloaded, extract the downloaded file:
$ tar -xvzf hadoop-3.3.6.tar.gz
Next, rename the extracted directory to hadoop:
$ mv hadoop-3.3.6 hadoop
Next, you will need to configure Hadoop and Java Environment Variables on your system.
Open the ~/.bashrc file in your favorite text editor. Use nano editior , to pasting the code we
use ctrl+shift+v for saving the file ctrl+x and ctrl+y ,then hit enter:
Next, you will need to configure Hadoop and Java Environment Variables on your system.
Open the ~/.bashrc file in your favorite text editor:
$ nano ~/.bashrc
Append the below lines to file.
Save and close the file. Then, activate the environment variables with the following
command:
s$ source ~/.bashrc
Next, open the Hadoop environment variable file:
$ nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Search for the “export JAVA_HOME” and configure it.
JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Save and close the file when you are finished.
Step 11 : Configuring Hadoop :
First, you will need to create the namenode and datanode directories inside the Hadoop user
home directory. Run the following command to create both directories:
$ cd hadoop/
$mkdir -p ~/hadoopdata/hdfs/{namenode,datanode}
Next, edit the core-site.xml file and update with your system hostname:
$nano $HADOOP_HOME/etc/hadoop/core-site.xml
Change the following name as per your system hostname:
Save and close the file.
Then, edit the hdfs-site.xml file:
$nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Change the NameNode and DataNode directory paths as shown below:
Then, edit the mapred-site.xml file:
$nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
Make the following changes:
Then, edit the yarn-site.xml file:
$nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
Make the following changes:
Save the file and close it .
Step 12 – Start Hadoop Cluster
Before starting the Hadoop cluster. You will need to format the Namenode as a hadoop user.
Run the following command to format the Hadoop Namenode:
$hdfsnamenode –format
Once the namenode directory is successfully formatted with hdfs file system, you will see the
message “Storage directory /home/hadoop/hadoopdata/hdfs/namenode has been successfully
formatted “
Then start the Hadoop cluster with the following command.
$ start-all.sh
You can now check the status of all Hadoop services using the jps command:
$ jps
Step 13 – Access Hadoop Namenode and Resource Manager
First we need to know our ipaddress,In Ubuntu we need to install net-tools to run ipconfig
command,
If you installing net-tools for the first time switch to default user:
$sudo apt install net-tools
Then run ifconfig command to know our ip address:
ifconfig
Here my ip address is 192.168.1.6.
To access the Namenode, open your web browser and visit the URL http://your-server-
ip:9870.
You should see the following screen:
http://192.168.1.6:9870
To access Resource Manage, open your web browser and visit the URL http://your-server-
ip:8088. You should see the following screen:
http://192.168.16:8088
Step 14 – Verify the Hadoop Cluster
At this point, the Hadoop cluster is installed and configured. Next, we will create some
directories in the HDFS filesystem to test the Hadoop.
Let‟s create some directories in the HDFS filesystem using the following command:
$ hdfsdfs -mkdir /test1
$ hdfsdfs -mkdir /logs
Next, run the following command to list the above directory:
$ hdfs dfs -ls /
You should get the following output:
Also, put some files to hadoop file system. For the example, putting log files from host
machine to hadoop file system.
$ hdfs dfs -put /var/log/* /logs/
You can also verify the above files and directory in the Hadoop Namenode web interface.
Go to the web interface, click on the Utilities => Browse the file system. You should see your
directories which you have created earlier in the following screen:
Step 15 – Stop Hadoop Cluster
To stop the Hadoop all services, run the following command:
$ stop-all.sh
Result:
The step-by-step installation and configuration of Hadoop on Ubutu linux system have been
successfully completed.
Exp 2: Hadoop Implementation of file management tasks, such as Adding files
and directories, retrieving files and Deleting files
AIM:
To implement the file management tasks, such as Adding files and directories, retrieving files
and Deleting files.
DESCRIPTION: -
HDFS is a scalable distributed filesystem designed to scale to petabytes of data while running
on top of the underlying filesystem of the operating system. HDFS keeps track of where the
data resides in a network by associating the name of its rack (or network switch) with the
dataset. This allows Hadoop to efficiently schedule tasks to those nodes that contain data, or
which are nearest to it, optimizing bandwidth utilization. Hadoop provides a set of command
line utilities that work similarly to the Linux file commands and serve as your primary
interface with HDFS.
The most common file management tasks in Hadoop, which include:
● Adding files and directories to HDFS
● Retrieving files from HDFS to local filesystem
● Deleting files from HDFS
PROCEDURE:
SYNTAX AND COMMANDS TO ADD, RETRIEVE AND DELETE DATA FROM
HDFS
Step-1: Adding Files and Directories to HDFS Before you can run Hadoop programs on data
stored in HDFS, you„ll need to put the data into HDFS first. Let„s create a directory and put a
file in it. HDFS has a default working directory of /user/$USER, where $USER is your login
username. This directory isn„t automatically created for you, though, so let„s create it with
the mkdir command. Login with your hadoop user
Firstly, we start those Hadoop service by running this command on terminal :
start-all.sh
For the purpose of illustration, we use chuck. You should substitute your user name in
the example commands.
hadoop fs -mkdir /chuck
hadoop fs -put example.txt /chuck
Step-2 : Retrieving Files from HDFS The Hadoop command get copies files from HDFS
back to the local filesystem. To retrieve example.txt, we can run the following command.
hadoop fs –cat /chuck/example.txt
● Command for creating a directory in hdfs is “hdfs dfs -mkdir /lendicse”
● Adding files to the directory is done through the command
“hdfs dfs –mkdir sanjay_english.txt
Step-3:
Deleting Files from HDFS
hadoop fs -rm example.txt
The above command will delete the file named as “example.txt”
Step-4:
Copying Data from NFS to HDFS
First create set of glossaries as text file.
nano glossary
Put your glossary text in their Copying from directory command is
hdfs dfs -copyFromLocal /home/hadoop/gloassary /sanjay_english
View the file by using the command “hdfs dfs -cat /sanjay_english/glossary”
● Command for listing of items in Hadoop is hdfs dfs -ls hdfs://localhost:9000/
● Command for Deleting files is “hdfs dfs –rmdir /lendicse
EXPECTED OUTPUT:
Result:
Thus, the implementation of the file management tasks, such as Adding files and directories,
retrieving files and Deleting files in hadoop have been successfully completed.
EXP 3: Implement of Matrix Multiplication with Hadoop Map Reduce
AIM:
To implement the Matrix Multiplication with Hadoop Map Reduce
Procedure:
Step 1: Before writing the code let‟s first create matrices and put them in HDFS.
● Create two files M1, M2 and put the matrix values. (seperate columns with spaces and
rows with a line break)
Save the matrices using nano command
$nano m1
123
456
$ nano m2
78
9 10
11 12
● Put the above files to HDFS at location /user/path/to/matrices/
hdfsdfs -mkdir /user/path/to/matrices
hdfsdfs -put /path/to/M1 /user/path/to/matrices/
hdfsdfs -put /path/to/M2 /user/path/to/matrices/
Step 2: Mapper Function:
Create a mapper function to process the input chunks and generate intermediate key-
value pairs. In the context of matrix multiplication, the mapper processes submatrices A(i,k)
and B(k,j), generating pairs (j, (i,k,B(k,j))) and (i, (j,k,A(i,k))), with i, j, and k as indices.
#!/usr/bin/env python3
import sys
m_r = 2
m_c = 3
n_r = 3
n_c = 2
i=0
for line in sys.stdin:
el = list(map(int, line.split())
if i< m_r:
for j in range(len(el)):
for k in range(n_c):
print("{0}\t{1}\t{2}\t{3}".format(i, k, j, el[j]))
else:
for j in range(len(el)):
for k in range(m_r):
print("{0}\t{1}\t{2}\t{3}".format(k, j, i - m_r, el[j])
i=i+1
Step 3: Reducer Function:
Develop a reducer function to aggregate the intermediate key-value pairs into the final
output. For matrix multiplication, the reducer combines pairs (j, [(i,k,B(k,j)), ...]) and (i,
[(j,k,A(i,k)), ...]) to compute C(i,j) = sum(A(i,k) * B(k,j)).
#!/usr/bin/env python3
import sys
m_r = 2
m_c = 3
n_r = 3
n_c = 2
matrix = []
for row in range(m_r):
r = []
for col in range(n_c):
s=0
for el in range(m_c):
mul = 1
for num in range(2):
line = input() # Read input from standard input
n = list(map(int, line.split('\t')))[-1]
mul *= n
s += mul
r.append(s)
matrix.append(r)
# Print the matrix
for row in matrix:
print('\t'.join([str(x) for x in row]))
Step 4: Running the Map-Reduce Job on Hadoop
You can run the map reduce job and view the result by the following code (considering you
have already put input files in HDFS)
To download the Hadoop jar file, use the following command:
Wget https://jar-download.com/download-handling.php
$ chmod +x /path/to/Mapper.py
$ chmod +x /path/to/Reducer.py
$ hadoop jar /path/to/hadoop-streaming.jar
-input /user/path/to/matrices/
-output /user/path/to/mat_output
-mapper /path/to/Mapper.py
-reducer /path/to/Reducer.py
This will take some time as Hadoop do its mapping and reducing work. After the successful
completion of the above process view the output by:
hdfs dfs -cat /user/path/to/mat_output/*
Output:
Result:
Thus, the matrix multiplication using map reduce program has been successfully completed.
EXP 4: Run a basic Word Count Map Reduce program to understand Map
Reduce Paradigm.
AIM:
To run a basic Word Count MapReduce program.
Procedure:
Step 1: Create Data File:
Create a file named "word_count_data.txt" and populate it with text data that you wish to
analyse.
Login with your hadoop user.
nano word_count.txt
Output:
Step 2: Mapper Logic - mapper.py:
Create a file named "mapper.py" to implement the logic for the mapper. The mapper
will read input data from STDIN, split lines into words, and output each word with its count.
nano mapper.py
# Copy and paste the mapper.py code
#!/usr/bin/env python3
# import sys because we need to read and write data to STDIN and STDOUT
import sys
# reading entire line from STDIN (standard input)
for line insys.stdin:
# to remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# we are looping over the words array and printing the word
# with the count of 1 to the STDOUT
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
print('%s\t%s' % (word, 1))
Here in the above program #! is known as shebang and used for interpreting the script.
The file will be run using the command we are specifying.
Step 3: Reducer Logic - reducer.py:
Create a file named "reducer.py" to implement the logic for the reducer. The reducer
will aggregate the occurrences of each word and generate the final output.
nano reducer.py
# Copy and paste the reducer.py code
reducer.py
#!/usr/bin/env python3
from operator importitemgetter
import sys
current_word = None
current_count = 0
word = None
for line insys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
try:
count = int(count)
exceptValueError:
continue
ifcurrent_word == word:
current_count += count
else:
ifcurrent_word:
print('%s\t%s' % (current_word, current_count))
current_count = count
current_word = word
if current_word == word:
print('%s\t%s' % (current_word, current_count))
Step 4: Prepare Hadoop Environment:
Start the Hadoop daemons and create a directory in HDFS to store your data.
start-all.sh
hdfsdfs -mkdir /word_count_in_python
hdfsdfs -copyFromLocal /path/to/word_count.txt/word_count_in_python
Step 6: Make Python Files Executable:
Give executable permissions to your mapper.py and reducer.py files.
chmod 777 mapper.py reducer.py
Step 7: Run Word Count using Hadoop Streaming:
Download the latest hadoop-streaming jar file and place it in a location you can
easily access.
Then run the Word Count program using Hadoop Streaming.
hadoop jar /path/to/hadoop-streaming-3.3.6.jar \
-input /word_count_in_python/word_count_data.txt \
-output /word_count_in_python/new_output \
-mapper /path/to/mapper.py \
-reducer /path/to/reducer.py
Step 8: Check Output:
Check the output of the Word Count program in the specified HDFS output
directory.
hdfs dfs -cat /word_count_in_python/new_output/part-00000
Result:
Thus, the program for basic Word Count Map Reduce has been executed successfully.
Exp5: Installation of Hive on Ubuntu
Aim:
To Download and install Hive, Understanding Startup scripts, Configuration files.
Procedure:
Step 1: Download and extract it
Download the Apache hive and extract it use tar, the commands given below:
$wgethttps://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
$ tar –xvf apache-hive-3.1.2-bin.tar.gz
Step 2: Place different configuration properties in Apache Hive
In this step, we are going to do two things
o Placing Hive Home path in bashrc file
$nano .bashrc
And append the below lines in it
2. Exporting Hadoop path in Hive-config.sh (To communicate with the Hadoop eco
system we are defining Hadoop Home path in hive config field) Open the hive-
config.sh as shown in below
$cd apache-hive-3.1.2-bin/bin
$cp hive-env.sh.template hive-env.sh
$nano hive-env.sh
Append the below commands on it
export HADOOP_HOME=/home/Hadoop/Hadoop
export HIVE_CONF_DIR=/home/Hadoop/apache-hive-3.1.2/conf
Step 3: Install mysql
1. Install mysql in Ubuntu by running this command:
$sudo apt update
$sudo apt install mysql-server
2. Alter username and password for MySQLby running below commands:
$sudomysql
Pops command line interface for MySQLand run the below SQL queries to change username
and set password
mysql> SELECT user, host, plugin FROM mysql.user WHERE user = 'root';
mysql> ALTER USER 'root'@'localhost' IDENTIFIED WITH 'mysql_native_password' BY
'your_new_password';
mysql> FLUSH PRIVILEGES;
Step 4:Config hive-site.xml
Config the hive-site.xml by appending this xml code and change the username and password
according to your MySQL.
$cd apache-hive-3.1.2-bin/bin
$cp hive-default.xml.template hive-site.xml
$nano hive-site.xml
Append these lines into it
Replace root as your username of MySQL
Replaceyour_new_password as with your password of MySQL
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost/metastore?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>your_new_password</value>
</property>
<property>
<name>datanucleus.autoCreateSchema</name>
<value>true</value>
</property>
<property>
<name>datanucleus.fixedDatastore</name>
<value>true</value>
</property>
<property>
<name>datanucleus.autoCreateTables</name>
<value>True</value>
</property>
</configuration>
Step 5: Setup MySQL java connector:
First, you'll need to download the MySQL Connector/J, which is the JDBC driver for
MySQL. You can download it from the below link
https://drive.google.com/file/d/1QFhB7Kvcat7a4LzDRe6GcmZva1yAxKz-
/view?usp=drive_link
Copy the downloaded MySQL Connector/J JAR file to the Hive library directory. By default,
the Hive library directory is usually located at/path/to/apache-hive-3.1.2/lib/on Ubuntu. Use
the following command to copy the JAR file:
$sudo cp /path/to/mysql-connector-java-8.0.15.jar /path/to/apache-hive-3.1.2/lib/
Replace /path/to/ with the actual path to the JAR file.
Step 6:Initialize the Hive Metastore Schema:
Run the following command to initialize the Hive metastore schema:
$$HIVE_HOME/bin/schematool -initSchema -dbTypemysql
Step 7: Start hive:
You can test Hive by running the Hive shell: Copy code hive You should be able to run Hive
queries, and metadata will be stored in your MySQL database.
$hive
Result:
Thus, the Apache Hive installation is completed successfully on Ubuntu.
Exp 6: Design and test various schema models to optimize data storage and retrieval
Using Hive.
Aim:
To Design and test various schema models to optimize data storage and retrieval Using
Hbase.
Procedure:
Step 1: Start Hive
Open a terminal and start Hive by running:
$hive
Step 2: Create a Database
Create a new database in Hive:
hive>CREATE DATABASE financials;
Step 3: Use the Database:
Switch to the newly created database:
hive>use financials;
Step 4: Create a Table:
Create a simple table in your database:
hive>CREATE TABLE finance_table( id INT, name STRING );
Step 5: Load Sample Data:
You can insert sample data into the table:
hive>INSERT INTO finance_tableVALUES (1, 'Alice'), (2, 'Bob'), (3, 'Charlie');
Step 6: Query Your Data
Use SQL-like queries to retrieve data from your table:
hive>CREATE VIEW myview AS SELECT name, id FROM finance_table;
Step 7: View the data:
To see the data in the view, you would need to query the view
hive>SELECT*FROM myview;
Step 8: Describe a Table:
You can describe the structure of a table using the DESCRIBE command:
hive>DESCRIBE finance_table;
Step 9: Alter a Table:
You can alter the table structure by adding a new column:
hive>ALTER TABLE finance_table ADD COLUMNS (age INT);
Step 10: Quit Hive:
To exit the Hive CLI, simply type:
hive>quit;
Result:
Thus, the usage of various commands in Hive has been successfully completed.
Exp 7: Installation of Hbase on Ubuntu
Aim:
To download and install Hbase, Understanding different Hbase modes, Startup
scripts, Configuration files.
Procedure:
Step 1. Please verify if Hadoop is installed.
Step 2. Please verify if Java is installed.
Step 3. Please download HBase 2.5.5 from the below link.
$wget https://downloads.apache.org/hbase/2.3.0/hbase-2.5.5-bin.tar.gz
Step 4. Let us extract the tar file using the below command and rename the folder to HBase
to make it meaningful.
$tar -xzf hbase-2.5.5-bin.tar.gz
$mv hbase-2.5.5hbase
Step 5. Now edit (hbase-env.sh) configuration file which is present under conf in hbase and
add JAVA path as mentioned below.
/hbase/conf$ nano hbase-env.sh
Now put JAVA path.
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/
Step 6. After this edit (.bashrc) file to update the environment variable of Apache HBase so
that it can be accessed from any directory.
nano .bashrc
Add below path.
export HBASE_HOME= /home/hadoop/hbase
export PATH=$PATH:$HBASE_HOME/bin
Step 7. Now config hbase-site.xml by append these xml code .
cd ..
mkdirhbasfile
mkdir zookeeper
chmod 777 hbasfile zookeeper
cd hbase/conf
nano hbase-site.xml
<property>
<name>hbase.rootdir</name>
<value>file:///home/hadoop/hbasfile</value>
</property>
<property>
<name>hbase.zookeeper.property.datadir</name>
<value>/home/Hadoop/zookeeper</value>
</property>
Step 8. Now start Apache HBase and verify it using the below commands.
cd hbase
bin/start-hbase.sh
jps
Step 9. After this we can see Hbase services are running from the JPS command, now let us
start the HBase shell using the below command.
bin/hbase shell
Result:
Thus, the Apache HBase installation is completed successfully on Ubuntu.
Exp 8: Design and test various schema models to optimize data storage and
retrieval Using Hbase.
Aim:
To design and test various schema models to optimize data storage and retrieval Using
Hbase.
Procedure:
Step-by-step schema design experiment with HBase using sample commands:
1. Start HBase:
- If HBase is not running, start it with the following command:
Start -hbase.sh
hbase shell
2. Create a Table:
Create a sample HBase table named "student" with a column family "info" using the
HBase shell:
hbase:002:0>create 'student', 'info'
3. Insert Data:
- Insert sample data into the "student" table. Let's assume you're storing student information
with student IDs as row keys:
hbase:003:0>put 'student', '1', 'info:name', 'Alice'
hbase:004:0>put 'student', '1', 'info:age', '20'
hbase:005:0>put 'student', '2', 'info:name', 'Bob'
hbase:006:0>put 'student', '2', 'info:age', '22'
4.Query Data:
- Retrieve student information using the row key:
hbase:008:0>get 'student', '1'
5. Add Another Column Family:
- You decide to add a new column family "grades" to store student grades.
hbase:00.9:0>alter 'student', {NAME => 'grades', VERSIONS => 1}
6. Insert Data into the New Column Family:
- Insert sample grade data into the "grades" column family:
put 'student', '1', 'grades:math', 'A'
put 'student', '2', 'grades:math', 'B'
7. Query Data from Multiple Column Families:
- Retrieve student information and grades for a student:
get 'student', '1', {COLUMN => ['info', 'grades']}
8. Row Key Design:
- Realize that using student IDs as row keys is not efficient for querying all students.
Redesign the row key to use a composite key that includes the year of enrollment:
put 'student', '2023_1', 'info:name', 'Carol'
put 'student', '2023_2', 'info:name', 'David'
9. Query Data with New Row Key:
- Retrieve student information with the updated row key:
hbase:00.10:0>get 'student', '2023_1'
10. Cleanup and Stop HBase:
- When you're done, you can drop the "student" table and stop HBase:
hbase:017:0>disable 'student'
hbase:018:0>drop 'student'
Result:
Thus, the execution of various Hbase commands has been successfully completed.
Exp 9: Practice importing and exporting data from various databases.
Aim:
To practice importing and exporting data from various databases.
Procedure:
SQOOP is basically used to transfer data from relational databases such as MySQL, Oracle
to data warehouses such as Hadoop HDFS (Hadoop File System). Thus, when data is
transferred from a relational database to HDFS, we say we are importing data. Otherwise,
when we transfer data from HDFS to relational databases, we say we are exporting data.
Note: To import or export, the order of columns in both MySQL and Hive should be the
same.
i)Installation of Sqoop:
Step 1: Download the stable version of Apache SqoopWebsite
URL https://archive.apache.org/dist/sqoop/1.4.7/
$wgethttps://archive.apache.org/dist/sqoop/1.4.7/sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
Step 2: Unzip the downloaded file using the tar command
$tar -xvzf sqoop-1.4.7.bin__hadoop-2.6.0.tar.gz
Step 3: Edit the .bashrc file by using the command
$nano .bashrc
$source .bashrc
Step 4: Set up libraries for sqoop:
Download commons lang from the given link and extract it
$wgethttps://dlcdn.apache.org/commons/lang/binaries/commons-lang-2.6-bin.tar.gz
$tar -xvfcommons-lang-2.6-bin.tar.gz
Copy the commons lang.jar file into sqoop-1.4.7/lib.
Next, Copy the mysql-java-connector from hive lib and paste it in sqoop-1.4.7/lib.
Step 5: Check the installed sqoop version using the below command
$sqoop version
ii)Importing data from MySQL to HDFS
In order to store data into HDFS, we make use of Apache Hive which provides an SQL-like
interface between the user and the Hadoop distributed file system (HDFS) which integrates
Hadoop. We perform the following steps:
Step 1: Login into MySQL
$mysql -u root -pyour_new_passwordSt
Step 2: Create a database and table and insert data.
mysql>create database importtohadoop;
mysql>use importtohadoop;
mysql>create table album_details(album_namevarchar(65), year int, artist varchar(65));
(“Rohan”,10,123456789,”Lucknow”);
Insert the values using this command
mysql> insert into album_detailsvalues
Step 3: Create a database and table in the hive where data should be imported.
hive>create database album_hive;
hive>use album_hive;
hive>create table album_details_hive(album_name varchar(65), year int, artist
varchar(65));
Step 4 : Run this command on terminal :
$sqoop export --connect "jdbc:mysql://127.0.0.1:3306/importtohadoop?useSSL=false" \
--username root --password your_new_password \
--table album_details\
--hive-import –hive-table album_hive.album_hive_table \
--m 1 \
Step 5: Check-in hive if data is imported successfully or not.
hive> use album_hive;
hive> select*from album_hive_table;
iii)To export data into MySQL from HDFS, perform the following steps:
Step 1: Create a database and table in the hive.
hive>create table hive_table_export(album_namevarchar(65),company, year int, artist
varchar(65));
Step 2: Insert data into the hive table.
hive>insert into hive_table_export values
hive>select*from hive_table_export;
Step 3: Create a database and table in MySQL in which data should be exported.
Step 4: Run the following command on Terminal.
$sqoop export –connect
"jdbc:mysql://127.0.0.1:3306/mysql_export?useSSL=false" \
--username root --password your_new_password \
--table mysql_table_export \
--export-dir /user/hive/warehouse/hive_export.db/hive_table_export \
--input-fields-terminated-by ',' \
--m 1 \
--driver com.mysql.cj.jdbc.Driver
Step 5: Check-in MySQL if data is exported successfully or not.
mysql> select*from mysql_table_export;
Result:
Importing and exporting data between MySQL and Hive implemented successfully.