Exp.
1 Downloading and installing Hadoop, Understanding
different Hadoop modes, Startup scripts, Configuration files.
AIM:
To Download and install Hadoop, Understanding different Hadoop modes, Startup
scripts, Configuration files.
Procedure:
Step 1: Install Java Development Kit
The default Ubuntu repositories contain Java 8 and Java 11 both. But, Install Java 8
because hive only works on this version. Use the following command to install it.
$sudo apt update && sudo apt install openjdk-8-jdk
Step 2 : Verify the Java version
Once installed, verify the installed version of Java with the following command:
$ java -version
Output:
Step 3: Install SSH
SSH (Secure Shell) installation is vital for Hadoop as it enables secure communication
between nodes in the Hadoop cluster. This ensures data integrity, confidentiality, and allows
for efficient distributed processing of data across the cluster.
$sudo apt install ssh
Step 4: Create the hadoop user :
All the Hadoop components will run as the user that you create for Apache Hadoop, and
the user will also be used for logging in to Hadoop’s web interface.
Run the command to create user and set password:
$sudo adduser hadoop
Output:
Step 5 : Switch user
Switch to the newly created hadoop user:
$ su - hadoop
Step 6 : Configure SSH
Now configure password-less SSH access for the newly created hadoop user, so didn’t enter
the key to save file and passphrase. Generate an SSH keypair (generate Public and Private
Key Pairs) first
$ssh-keygen -t rsa
Step 7: Set permissions:
Next, append the generated public keys from id_rsa.pub to authorized_keys and set proper
permission:
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 640 ~/.ssh/authorized_keys
Step 8 : SSH to the localhost
Next, verify the password less SSH authentication with the following command:
$ ssh localhost
You will be asked to authenticate hosts by adding RSA keys to known hosts. Type yes and
hit Enter to authenticate the localhost:
Step 9 : Switch user
Again switch to hadoop. So, First, change the user to hadoop with the following command:
$ su – hadoop
Step 10 : Install hadoop
Next, download the latest version of Hadoop using the wget command:
$ wgethttps://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
Once downloaded, extract the downloaded file:
$ tar -xvzf hadoop-3.3.6.tar.gz
Next, rename the extracted directory to hadoop:
$ mv hadoop-3.3.6 hadoop
Next, you will need to configure Hadoop and Java Environment Variables on your system.
Open the ~/.bashrc file in your favorite text editor. Use nano editior , to pasting the code we
use ctrl+shift+v for saving the file ctrl+x and ctrl+y ,then hit enter:
Next, you will need to configure Hadoop and Java Environment Variables on your system.
Open the ~/.bashrc file in your favorite text editor:
$ nano ~/.bashrc
Append the below lines to file.
Save and close the file. Then, activate the environment variables with the following
command:
s$ source ~/.bashrc
Next, open the Hadoop environment variable file:
$ nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Search for the “export JAVA_HOME” and configure
it. JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Save and close the file when you are finished.
Step 11 : Configuring Hadoop :
First, you will need to create the namenode and datanode directories inside the Hadoop
user home directory. Run the following command to create both directories:
$cd hadoop/
$mkdir -p ~/hadoopdata/hdfs/{namenode,datanode}
Next, edit the core-site.xml file and update with your system hostname:
$nano $HADOOP_HOME/etc/hadoop/core-site.xml
Change the following name as per your system hostname:
Save and close the file.
Then, edit the hdfs-site.xml file:
$nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
Change the NameNode and DataNode directory paths as shown below:
Then, edit the mapred-site.xml file:
$nano $HADOOP_HOME/etc/hadoop/mapred-site.xml
Make the following changes:
Then, edit the yarn-site.xml file:
$nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
Make the following changes:
Save the file and close it .
Step 12 – Start Hadoop Cluster
Before starting the Hadoop cluster. You will need to format the Namenode as a hadoop user.
Run the following command to format the Hadoop Namenode:
$hdfs namenode –format
Once the namenode directory is successfully formatted with hdfs file system, you will see
the message “Storage directory /home/hadoop/hadoopdata/hdfs/namenode has been
successfully formatted “
Then start the Hadoop cluster with the following command.
$ start-all.sh
You can now check the status of all Hadoop services using the jps command:
$ jps
Step 13 – Access Hadoop Namenode and Resource Manager
First we need to know our ip address,In Ubuntu we need to install net-tools to
run ipconfig command,
If you installing net-tools for the first time switch to default user:
$sudo apt install net-tools
Then run ifconfig command to know our ip address:
ifconfig
Here my ip address is 192.168.1.6.
To access the Namenode, open your web browser and visit the URL http://your-
server- ip:9870. You should see the following screen:
http://192.168.1.6:9870
To access Resource Manager, open your web browser and visit the URL http://your-server-
ip:8088. You should see the following screen:
http://192.168.16:8088
Step 14 – Verify the Hadoop Cluster
At this point, the Hadoop cluster is installed and configured. Next, we will create some
directories in the HDFS filesystem to test the Hadoop.
Let’s create some directories in the HDFS filesystem using the following command:
$ hdfs dfs -mkdir /test1
$ hdfs dfs -mkdir /logs
Next, run the following command to list the above directory:
$ hdfs dfs -ls /
You should get the following output:
Also, put some files to hadoop file system. For the example, putting log files from host
machine to hadoop file system.
$ hdfs dfs -put /var/log/* /logs/
You can also verify the above files and directory in the Hadoop Namenode web interface.
Go to the web interface, click on the Utilities => Browse the file system. You should see your
directories which you have created earlier in the following screen:
Step 15 – Stop Hadoop Cluster
To stop the Hadoop all services, run the following command:
$ stop-all.sh
Result:
The step-by-step installation and configuration of Hadoop on Ubutu linux
system have been successfully completed.
Exp 2: Hadoop Implementation of file management tasks, such as
Adding files and directories, retrieving files and Deleting files
AIM:
To implement the file management tasks, such as Adding files and directories, retrieving
files and Deleting files.
DESCRIPTION: -
HDFS is a scalable distributed filesystem designed to scale to petabytes of data while running
on top of the underlying filesystem of the operating system. HDFS keeps track of where the
data resides in a network by associating the name of its rack (or network switch) with the
dataset. This allows Hadoop to efficiently schedule tasks to those nodes that contain data, or
which are nearest to it, optimizing bandwidth utilization. Hadoop provides a set of command
line utilities that work similarly to the Linux file commands and serve as your primary
interface with HDFS.
The most common file management tasks in Hadoop, which include:
● Adding files and directories to HDFS
● Retrieving files from HDFS to local filesystem
● Deleting files from HDFS
PROCEDURE:
SYNTAX AND COMMANDS TO ADD, RETRIEVE AND DELETE
DATA FROM HDFS
Step-1: Adding Files and Directories to HDFS Before you can run Hadoop programs on data
stored in HDFS, you‘ll need to put the data into HDFS first. Let‘s create a directory and put a
file in it. HDFS has a default working directory of /user/$USER, where $USER is your login
username. This directory isn‘t automatically created for you, though, so let‘s create it with
the mkdir command. Login with your hadoop user
Firstly, we start those Hadoop service by running this command on terminal :
start-all.sh
For the purpose of illustration, we use chuck. You should substitute your user name in
the example commands.
hadoop fs -mkdir /chuck
hadoop fs -put example.txt /chuck
Step-2 : Retrieving Files from HDFS The Hadoop command get copies files from HDFS
back to the local filesystem. To retrieve example.txt, we can run the following command.
hadoop fs –cat /chuck/example.txt
Step-3:
Deleting Files from HDFS
hadoop fs -rm example.txt
● Command for creating a directory in hdfs is “hdfsdfs -mkdir /lendicse”
● Adding directory is done through the command “hdfsdfs –mkdirsanjay_english.txt
Step-4:
Copying Data from NFS to HDFS
First create set of glossaries as text file.
nano glossary
Put your glossary text in their Copying from directory command is
“hdfsdfs -copyFromLocal /home/hadoop/gloassary /sanjay_english
View the file by using the command “hdfsdfs -cat /sanjay_english/glossary”
● Command for listing of items in Hadoop is “hdfsdfs -ls hdfs://localhost:9000/
● Command for Deleting files is “hdfsdfs –rmdir /lendicse
EXPECTED OUTPUT:
Result:
Thus, the implementation of the file management tasks, such as Adding files
and directories, retrieving files and Deleting files in hadoop have been
successfully completed.
EXP 3: Implement of Matrix Multiplication with Hadoop Map Reduce
AIM:
To implement the Matrix Multiplication with Hadoop Map Reduce
Procedure:
Step 1: Before writing the code let’s first create matrices
and put them in HDFS.
● Create two files M1, M2 and put the matrix values.
(seperate columns with spaces and rows with a line
break)
Save the matrices using nano command
$nano
m1 1 2 3
456
$nano
m2 7 8
9 10
11 12
● Put the above files to HDFS at location /user/path/to/matrices/
hdfs dfs -mkdir /user/path/to/matrices
hdfs dfs -put /path/to/M1 /user/path/to/matrices/
hdfs dfs -put /path/to/M2 /user/path/to/matrices/
Step 2: Mapper Function:
Create a mapper function to process the input chunks and
generate intermediate key-value pairs. In the context of matrix
multiplication, the mapper processes submatrices A(i,k) and B(k,j),
generating pairs (j, (i,k,B(k,j))) and (i, (j,k,A(i,k))), with i, j, and k as
indices.
#!/usr/bin/env python3
import sys
m_r = 2
m_c = 3
n_r = 3
n_c = 2
i=0
for line in sys.stdin:
el = list(map(int,
line.split()) if i < m_r:
for j in
range(len(el)): for
k in range(n_c):
print("{0}\t{1}\t{2}\t{3}".format(i, k, j, el[j]))
else:
for j in
range(len(el)): for
k in range(m_r):
print("{0}\t{1}\t{2}\t{3}".format(k, j, i - m_r, el[j])
i=i+1
Step 3: Reducer Function:
Develop a reducer function to aggregate the intermediate key-
value pairs into the final output. For matrix multiplication, the reducer
combines pairs (j, [(i,k,B(k,j)), ...]) and (i, [(j,k,A(i,k)), ...]) to compute
C(i,j) = sum(A(i,k) * B(k,j)).
#!/usr/bin/env
python3 import sys
m_r = 2
m_c = 3
n_r = 3
n_c = 2
matrix = []
for row in
range(m_r): r =
[]
for col in
range(n_c): s =
0
for el in
range(m_c): mul
=1
for num in range(2):
line = input() # Read input from
standard input n = list(map(int,
line.split('\t')))[-1]
mul *=
n s +=
mul
r.append(s)
matrix.append(r
)
# Print the
matrix for row in
matrix:
print('\t'.join([str(x) for x in row]))
Step 4: Running the Map-Reduce Job on Hadoop
You can run the map reduce job and view the result by the
following code (considering you have already put input
files in HDFS)
To download the Hadoop jar file, use the following command:
wget https://jar-download.com/download-handling.php
$ chmod +x /path/to/Mapper.py
$ chmod +x /path/to/Reducer.py
$ hadoop jar /path/to/hadoop-streaming.jar
-input /user/path/to/matrices/
-output /user/path/to/mat_output
-mapper /path/to/Mapper.py
-reducer /path/to/Reducer.py
This will take some time as Hadoop do its mapping and
reducing work. After the successful completion of the
above process view the output by:
hdfs dfs -cat /user/path/to/mat_output/*
Output:
Result:
Thus, the matrix multiplication using mapreduce program have been
successfully completed.
EXP 4: Run a basic Word Count Map Reduce program to understand
Map Reduce Paradigm.
AIM:
To run a basic Word Count MapReduce program.
Introduction:
Hadoop Streaming is a feature that comes with Hadoop and
allows users or developers to use various different languages for
writing MapReduce programs like Python, C++, Ruby, etc. It
supports all the languages that can read from standard input and
write to standard output. We will be implementing Python with
Hadoop Streaming and will observe how it works.
We will implement the word count problem in python to understand
Hadoop Streaming. We will be creating mapper.py and reducer.py to
perform map and reduce tasks.
Procedure:
Step 1: Create Data File:
Create a file named "word_count_data.txt" and populate it with
text data that you wish to analyse.
Login with your hadoop user.
nano word_count.txt
Step 2: Mapper Logic - mapper.py:
Create a file named "mapper.py" to implement the logic for
the mapper. The mapper will read input data from STDIN, split lines
into words, and output each word with its count.
nano mapper.py
# Copy and paste the mapper.py code
#!/usr/bin/env python3
# import sys because we need to read and write data to
STDIN and STDOUT
import sys
# reading entire line from STDIN (standard
input) for line insys.stdin:
# to remove leading and trailing
whitespace line = line.strip()
# split the line into words
words = line.split()
# we are looping over the words array and printing the
wor
d
# with the count of 1 to the
STDOUT for word in words:
# write the results to STDOUT (standard
output); # what we output here will be the
input for the # Reduce step, i.e. the input
for reducer.py print('%s\t%s' % (word, 1))
Here in the above program #! is known as shebang and used
for interpreting the script. The file will be run using the command
we are specifying.
Step 3: Reducer Logic - reducer.py:
Create a file named "reducer.py" to implement the logic for the reducer.
The reducer will aggregate the occurrences of each word and
generate the final output.
nano reducer.py
# Copy and paste the reducer.py code
reducer.py
#!/usr/bin/env python3
from operator import itemgetter
import sys
current_word =
None
current_count = 0
word = None
for line in sys.stdin:
line = line.strip()
word, count = line.split('\t', 1)
try:
count = int(count)
exceptValueError:
continue
if current_word == word:
current_count += count
else:
if current_word:
print('%s\t%s' % (current_word,
current_count)) current_count = count
current_word = word
ifcurrent_word == word:
print('%s\t%s' % (current_word, current_count))
Step 4: Prepare Hadoop Environment:
Start the Hadoop daemons and create a directory in HDFS to store your
data.
start-all.sh
hdfs dfs -mkdir /word_count_in_python
hdfs dfs -copyFromLocal /path/to/word_count.txt
/word_count_in_python
Step 6: Make Python Files Executable:
Give executable permissions to your mapper.py and reducer.py files.
chmod 777 mapper.py reducer.py
Step 7: Run Word Count using Hadoop Streaming:
Download the latest hadoop-streaming jar file and place it in a
location you can easily access.
Then run the Word Count program using Hadoop Streaming.
hadoop jar /path/to/hadoop-streaming-3.3.6.jar \
-input /word_count_in_python/word_count_data.txt \
-output /word_count_in_python/new_output \
-mapper /path/to/mapper.py \
-reducer /path/to/reducer.py
Step 8: Check Output:
Check the output of the Word Count program in the specified
HDFS output directory.
hdfs dfs -cat /word_count_in_python/new_output/part-00000
output:
Result:
Thus, the program for basic Word Count Map Reduce has been executed
successfully.
Exp 5: Installation of Hive on Ubuntu
Aim:
To Download and install Hive, Understanding Startup scripts,
Configuration files.
Procedure:
Step 1: Download and extract it
Download the Apache hive and extract it use tar, the
commands given below:
$wget https://archive.apache.org/dist/hive/hive-
3.1.3/apache-hive-3.1.3-bin.tar.gz
$ tar -xvf apache-hive-3.1.3-bin.tar.gz
Step 2: Place different configuration properties in Apache Hive
In this step, we are going to do two things
1. Placing Hive Home path in bashrc file
$nano .bashrc
And append the below lines in it
2. Exporting Hadoop path in Hive-config.sh (To
communicate with the Hadoop eco system we are
defining Hadoop Home path in hive config field) Open
the hive-config.sh as shown in below
$cd apache-hive-3.1.2-bin/bin
$cp hive-env.sh.template hive-env.sh
$nano hive-env.sh
Append the below commands on it
export HADOOP_HOME=/home/Hadoop/Hadoop
export HIVE_CONF_DIR=/home/Hadoop/apache-hive-3.1.2/conf
export HADOOP_TMP_DIR=/tmp/hadoop-${USER}
Step 3: Install mysql
To download the MySQL Connector JAR file using wget, you can use
the following command:
wget https://dev.mysql.com/get/Downloads/Connector-
J/mysql-connector-j-9.0.0.tar.gz
Once downloaded, extract the .tar.gz file:
tar -xzf mysql-connector-j-9.0.0.tar.gz
Then move the .jar file to the Hive lib directory:
mv mysql-connector-j-9.0.0/mysql-connector-j-
9.0.0.jar /home/hadoop/apache-hive-3.1.3-bin/lib
1. Install mysql in Ubuntu by running this command:
$sudo apt update
$sudo apt install mysql-server
2. Alter username and password for MySQL by running below
commands:
$sudo mysql
Pops command line interface for MySQL and run the
below SQL queries to change username and set
password
mysql> SELECT user, host, plugin FROM mysql.user WHERE user = 'root';
mysql> ALTER USER 'root'@'localhost' IDENTIFIED WITH
'mysql_native_password' BY 'your_new_password';
mysql> FLUSH PRIVILEGES;
Step 5: Config hive-site.xml
Config the hive-site.xml by appending this xml code and
change the username and password according to your
MySQL.
$cd apache-hive-3.1.2-bin/bin
$cp hive-default.xml.template hive-site.xml
$nano hive-site.xml
Append these lines into it
Replace root as your username of MySQL
Replace your_new_password as with your password of MySQL
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive_metastore</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>your_new_password</value>
</property>
</configuration>
Step 5: Move the File Using sudo
Run the following command to move hive-site.xml from your Downloads folder to the
Hive configuration directory:
sudo mv /home/hasif/Downloads/hive-site.xml
/home/hadoop/apache-hive-3.1.3-bin/conf/
Verify the Move
After moving the file, you can check if it’s in the correct directory by
running: ls
/home/hadoop/apache-hive-3.1.3-bin/conf You
should see hive-site.xml listed there.
Step 6: Create directories and Initialize the hive:
Create Local Directories: You can create the required directories on your
local filesystem with the following commands:
$mkdir -p /tmp/hadoop
$mkdir -p /tmp/hive
$mkdir -p /tmp/hive/local
This creates the necessary directories for Hadoop and Hive.
Create Directories in HDFS: To create the required directories in HDFS, use
these commands:
$hdfs dfs -mkdir -p /user/hive/warehouse
$hdfs dfs -chmod 755 /user/hive/warehouse
Step 6: Initialize the Hive Metastore Schema:
Run the following command to initialize the Hive metastore schema:
$ cd /home/hadoop/apache-hive-3.1.3-bin/bin
$./schematool -initSchema -dbType mysql
Step 7: Start hive:
You can test Hive by running the Hive shell: Copy code hive
You should be able to run Hive queries, and metadata will be
stored in your MySQL database.
$./hive
Result:
Apache Hive installation is completed on Ubuntu 23.04 LTS (Bionic Beaver).
Exp 6: Design and test various schema models to optimize data storage and
retrieval Using Hive.
Aim:
To Design and test various schema models to optimize data
storage and retrieval Using Hbase.
Procedure:
Step 1: Start Hive
Open a terminal and start Hive by running:
$hive
Step 2: Create a Database
Create a new database in Hive:
hive>CREATE DATABASE financials;
Step 3: Use the Database:
Switch to the newly created
database: hive>use financials;
Step 4: Create a Table:
Create a simple table in your database:
hive>CREATE TABLE finance_table ( id INT, name STRING );
Step 5: Load Sample Data:
You can insert sample data into the table:
hive> INSERT INTO finance_table VALUES (1, 'Alice'),
(2, 'Bob'),
(3,
'Charlie');
Step 6: Query Your Data
Use SQL-like queries to retrieve data from your table:
hive>CREATE VIEW myview AS SELECT name, id FROM
finance_table;
Step 7: View the data:
To see the data in the view, you would need to query the view
hive>SELECT*FROM myview;
Step 8: Describe a Table:
You can describe the structure of a table using the DESCRIBE
command:
hive>DESCRIBE finance_table;
Step 9: Alter a Table:
You can alter the table structure by adding a new
column: hive> ALTER TABLE finance_table ADD
COLUMNS (age INT);
Step 10: Quit Hive:
To exit the Hive CLI, simply type:
hive>quit;
Result:
These commands demonstrate a simple schema design experiment with
Hive. You can adapt and extend these steps to test and refine your schema
for more complex use cases.
Exp 7: Installation of Hbase on Ubuntu
Aim:
To Download and install Hbase, Understanding different Hbase
modes, Startup scripts, Configuration files.
Procedure:
Step 1. Please verify if Hadoop is installed.
Step 2. Please verify if Java is installed.
Step 3. Please download HBase 2.5.5 from the below link.
$ wget https://archive.apache.org/dist/hbase/2.5.10/hbase-2.5.10-bin.tar.gz
Step 4. Let us extract the tar file using the below command and
rename the folder to HBase to make it meaningful.
$ tar -zxvf hbase-2.5.10-bin.tar.gz
$ sudo mv hbase-2.5.10 /usr/local/hbase
Step 5. Now edit (hbase-env.sh) configuration file which is
present under conf in hbase and add JAVA path as mentioned
below .
/hbase/conf$ nano hbase-env.sh
Now put JAVA path.
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
Step 6. After this edit (.bashrc) file to update the environment
variable of Apache HBase so that it can be accessed from any
directory.
nano .bashrc
Add below path.
export HBASE_HOME=/usr/local/hbase
export PATH=$PATH:$HBASE_HOME/bin
Step 7. Now config hbase-site.xml by append these xml code.
$nano $HBASE_HOME/conf/hbase-site.xml
Add the following configuration inside the <configuration> tags for standalone
mode:
<property>
<name>hbase.rootdir</name>
<value>file:///usr/local/hbase/data</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/usr/local/hbase/zookeeper</value>
</property>
3. Save and exit the file.
Step 8. Now start Apache HBase and verify it using the below
commands.
cd hbase
bin/start-hbase.sh
jps
Step 9. After this we can see Hbase services are running from
the JPS command, now let us start the HBase shell using the
below command.
bin/hbase shell
Access the HBase Web UI
To ensure that HBase is running properly, you can check the HBase Web UI.
Open a web browser and go to:
http://127.0.0.1:16010
Result:
Apache HBase installation is completed on Ubuntu 23.04 LTS (Bionic Beaver).
Exp 8: Design and test various schema models to optimize data storage and
retrieval Using Hbase.
Aim:
To Design and test various schema models to optimize data
storage and retrieval Using Hbase.
Procedure:
Certainly, here's a step-by-step schema design experiment with
HBase using sample commands:
1. Start HBase:
If HBase is not running, start it with the following
command: start-hbase.sh
hbase shell
2. Create a Table:
Create a sample HBase table named "student" with a
column family "info" using the HBase shell:
hbase:002:0>create 'student', 'info'
3. Insert Data:
- Insert sample data into the "student" table. Let's
assume you're storing student information with student
IDs as row keys:
hbase:003:0>put 'student', '1', 'info:name', 'Alice'
hbase:004:0>put 'student', '1', 'info:age', '20'
hbase:005:0>put 'student', '2', 'info:name', 'Bob'
hbase:006:0>put 'student', '2', 'info:age', '22'
4. Query Data:
- Retrieve student information using the row key:
\hbase:008:0>get 'student', '1'
5. Add Another Column Family:
- You decide to add a new column family "grades" to store
student grades.
hbase:00.9:0>alter 'student', {NAME => 'grades', VERSIONS => 1}
6. Insert Data into the New Column Family:
- Insert sample grade data into the "grades" column family:
put 'student', '1', 'grades:math', 'A'
put 'student', '2', 'grades:math', 'B'
7. Query Data from Multiple Column Families:
- Retrieve student information and grades for a student:
get 'student', '1', {COLUMN => ['info', 'grades']}
8. Row Key Design Experiment:
- Realize that using student IDs as row keys is not efficient
for querying all students. Redesign the row key to use a
composite key that includes the year of enrollment:
put 'student', '2023_1', 'info:name', 'Carol'
put 'student', '2023_2', 'info:name', 'David'
9. Query Data with New Row Key:
- Retrieve student information with the updated row key:
hbase:00.10:0>get 'student', '2023_1'
10. Cleanup and Stop HBase:
- When you're done, you can drop the "student" table and
stop HBase: hbase:017:0>disable 'student'
hbase:018:0>drop 'student'
Result:
These commands demonstrate a simple schema design experiment with HBase.
You can adapt and extend these steps to test and refine your schema for more
complex use cases.
Exp 9: Practice importing and exporting data from various databases.
Aim:
To Practice importing and exporting data from various databases.
Procedure:
SQOOP is basically used to transfer data from relational
databases such as MySQL, Oracle to data warehouses such as
Hadoop HDFS (Hadoop File System). Thus, when data is
transferred from a relational database to HDFS, we say we are
importing data. Otherwise, when we transfer data from HDFS
to relational databases, we say we are exporting data.
Note: To import or export, the order of columns in both MySQL
and Hive should be the same.
i) Installation of Sqoop:
Step 1: Download the stable version of Apache Sqoop
Website URL https://archive.apache.org/dist/sqoop/1.4.7/
$wget https://archive.apache.org/dist/sqoop/1.4.7/sqoop-
1.4.7.bin hadoop-2.6.0.tar.gz
Step 2: Unzip the downloaded file using the tar command
$tar -xvzf sqoop-1.4.7.bin hadoop-2.6.0.tar.gz
Step 3: Edit the .bashrc file by using the command
$nano .bashrc
$source .bashrc
Step 4: Set up libraries for sqoop:
Download commons lang from the given link and extract it
$wget
https://dlcdn.apache.org/commons/lang/binaries/commons-
lang- 2.6-bin.tar.gz
$tar -xvf commons-lang-2.6-bin.tar.gz
Copy the commons lang.jar file into sqoop-1.4.7/lib.
Next, Copy the mysql-java-connector from hive lib and paste it in sqoop-
1.4.7/lib.
Step 5: Check the installed sqoop version using the below
command
$sqoop version
ii) Importing data from MySQL to HDFS
In order to store data into HDFS, we make use of Apache Hive
which provides an SQL-like interface between the user and the
Hadoop distributed file system (HDFS) which integrates
Hadoop. We perform the following steps:
Step 1: Login into MySQL
$mysql -u root -pyour_new_password
Step 2: Create a database and table and insert data.
mysql>create database importtohadoop;
mysql>use importtohadoop;
mysql> create table album_details(album_name varchar(65),
year int, artist varchar(65));
Insert the values using this command
mysql> insert into album_details
values
Step 3: Create a database and table in the hive where data should be
imported.
hive>create database album_hive;
hive>use album_hive;
hive>create table album_details_hive(album_name
varchar(65), year int, artist varchar(65));
Step 4: Run this command on terminal :
$ sqoop import --connect
"jdbc:mysql://127.0.0.1:3306/importtohadoop?useSSL=false"
\
--username root -P \
--table album_details \
--hive-import \
--hive-table album_hive.album_hive_table \
--m 1
Step 4: Check-in hive if data is imported successfully
or not. hive> use album_hive;
hive> select*from album_hive_table;
iii) To export data into MySQL from HDFS, perform the following steps:
Step 1: Create a database and table in the hive.
hive> create table hive_table_export(album_name
varchar(65),company, year int, artist varchar(65));
Step 2: Insert data into the hive table.
hive> insert into hive_table_export
values hive>select*from
hive_table_export;
Step 3: Create a database and table in MySQL in which data
should be exported.
Step 4: Run the following command on Terminal.
$ sqoop export --connect
"jdbc:mysql://127.0.0.1:3306/mysql_export?useSSL=false" \
--username root --password your_new_password \
--table mysql_table_export \
--export-dir /user/hive/warehouse/hive_export.db/hive_table_export \
--input-fields-terminated-by ',' \
--m 1 \
--driver com.mysql.cj.jdbc.Driver
Step 5: Check-in MySQL if data is exported successfully or
not. mysql> select*from mysql_table_export;
Result:
Importing and exporting data between MySQL and Hive implemented
successfully.
Additional Experiments
Exp 10: factorial mapreduce program in Python
Aim:
To Implement factorial mapreduce program in Python.
Procedure:
Step 1: Create the Input File
1. Open the Terminal.
2. Create a file named numbers.txt with the numbers you want to calculate factorials
for (e.g., 3, 5, 7, and 10):
This command:
Creates a file named numbers.txt.
Adds each number on a new line.
Verify the file contents:
Expected Output:
Step 2: Write the Mapper Code
1. Create the Mapper file:
nano mapper.py
This opens a new file named mapper.py in the Nano text editor.
2. Paste the following code into mapper.py:
# mapper.py
import sys
import math
for line in sys.stdin:
num =
int(line.strip())
print(f"{num}\t{math.factorial(num)}")
Save and Exit:
In Nano, press CTRL + X, then Y, and Enter to save and close the file.
Step 3: Create the Reducer file:
nano reducer.py
This opens a new file named reducer.py in the Nano text editor.
3. Paste the following code into reducer.py:
# reducer.py
import sys
for line in sys.stdin:
num, fact = line.strip().split("\t") # Read the tab-separated
number and factorial
print(f"Factorial of {num} is {fact}") # Output in a readable format
Step 4: Run the MapReduce Program
To run the program and see the output:
1. Run the following command:
OUTPUT:
Result:
Thus, the implementation of MapReduce program successfully executed.
Exp 11: MapReduce that performs log analysis to count the number of
requests each IP address has made from a server log.
Aim:
To implement a MapReduce program in Python that analyzes a server
log file and counts the number of requests made by each IP address.
Procedure:
Step 1: Prepare the Log File
1. Create a sample log file named server_log.txt with entries similar to
those found in a typical web server log:
o Each line represents a server request, and the IP address is the
first part of each entry.
2. Save and close the file.
Step 2: Write the Mapper Code
1. Create a Python file named mapper.py:
nano mapper.py
2. Paste the following code into mapper.py:
# mapper.py
import sys
for line in sys.stdin:
line = line.strip()
if line:
ip_address = line.split()[0]
print(f"{ip_address}\t1")
Step 3: Write the Reducer Code
1. Create another Python file named reducer.py:
nano reducer.py
2. Paste the following code into reducer.py:
# reducer.py
import sys
current_ip = None
current_count = 0
for line in sys.stdin:
line = line.strip()
ip_address, count = line.split("\t") # Split IP address and count
count = int(count)
if current_ip == ip_address:
current_count += count
else:
if current_ip:
print(f"{current_ip}\t{current_count}") # Output the count for
previous IP
current_ip = ip_address
current_count = count
if current_ip == ip_address:
print(f"{current_ip}\t{current_count}") # Output count for the last IP
Step 4: Run the MapReduce Program
1. Run the following command to execute the MapReduce
program: cat server_log.txt | python3 mapper.py | sort | python3
reducer.py
OUTPUT:
Result:
Thus, the implementation of MapReduce program successfully executed.