KEMBAR78
Big Data Record | PDF | Apache Hadoop | Databases
0% found this document useful (0 votes)
14 views69 pages

Big Data Record

The document provides a comprehensive guide on downloading, installing, and configuring Hadoop on an Ubuntu system, including steps for setting up Java, SSH, and Hadoop environment variables. It also details file management tasks within Hadoop's HDFS, such as adding, retrieving, and deleting files, as well as implementing matrix multiplication and a basic word count using MapReduce. The procedures include necessary commands and scripts for each task, ensuring successful execution and verification of the Hadoop cluster.

Uploaded by

Bharath V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views69 pages

Big Data Record

The document provides a comprehensive guide on downloading, installing, and configuring Hadoop on an Ubuntu system, including steps for setting up Java, SSH, and Hadoop environment variables. It also details file management tasks within Hadoop's HDFS, such as adding, retrieving, and deleting files, as well as implementing matrix multiplication and a basic word count using MapReduce. The procedures include necessary commands and scripts for each task, ensuring successful execution and verification of the Hadoop cluster.

Uploaded by

Bharath V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 69

Exp.

1 Downloading and installing Hadoop, Understanding


different Hadoop modes, Startup scripts, Configuration files.
AIM:
To Download and install Hadoop, Understanding different Hadoop modes, Startup
scripts, Configuration files.

Procedure:
Step 1: Install Java Development Kit
The default Ubuntu repositories contain Java 8 and Java 11 both. But, Install Java 8
because hive only works on this version. Use the following command to install it.

$sudo apt update && sudo apt install openjdk-8-jdk

Step 2 : Verify the Java version


Once installed, verify the installed version of Java with the following command:
$ java -version
Output:

Step 3: Install SSH


SSH (Secure Shell) installation is vital for Hadoop as it enables secure communication
between nodes in the Hadoop cluster. This ensures data integrity, confidentiality, and allows
for efficient distributed processing of data across the cluster.
$sudo apt install ssh
Step 4: Create the hadoop user :
All the Hadoop components will run as the user that you create for Apache Hadoop, and
the user will also be used for logging in to Hadoop’s web interface.
Run the command to create user and set password:

$sudo adduser hadoop


Output:

Step 5 : Switch user


Switch to the newly created hadoop user:
$ su - hadoop
Step 6 : Configure SSH
Now configure password-less SSH access for the newly created hadoop user, so didn’t enter
the key to save file and passphrase. Generate an SSH keypair (generate Public and Private
Key Pairs) first

$ssh-keygen -t rsa

Step 7: Set permissions:


Next, append the generated public keys from id_rsa.pub to authorized_keys and set proper
permission:
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 640 ~/.ssh/authorized_keys
Step 8 : SSH to the localhost
Next, verify the password less SSH authentication with the following command:
$ ssh localhost
You will be asked to authenticate hosts by adding RSA keys to known hosts. Type yes and
hit Enter to authenticate the localhost:

Step 9 : Switch user


Again switch to hadoop. So, First, change the user to hadoop with the following command:
$ su – hadoop
Step 10 : Install hadoop
Next, download the latest version of Hadoop using the wget command:
$ wgethttps://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
Once downloaded, extract the downloaded file:
$ tar -xvzf hadoop-3.3.6.tar.gz
Next, rename the extracted directory to hadoop:
$ mv hadoop-3.3.6 hadoop

Next, you will need to configure Hadoop and Java Environment Variables on your system.
Open the ~/.bashrc file in your favorite text editor. Use nano editior , to pasting the code we
use ctrl+shift+v for saving the file ctrl+x and ctrl+y ,then hit enter:

Next, you will need to configure Hadoop and Java Environment Variables on your system.
Open the ~/.bashrc file in your favorite text editor:
$ nano ~/.bashrc

Append the below lines to file.

Save and close the file. Then, activate the environment variables with the following
command:
s$ source ~/.bashrc
Next, open the Hadoop environment variable file:
$ nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh
Search for the “export JAVA_HOME” and configure
it. JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Save and close the file when you are finished.
Step 11 : Configuring Hadoop :
First, you will need to create the namenode and datanode directories inside the Hadoop
user home directory. Run the following command to create both directories:
$cd hadoop/
$mkdir -p ~/hadoopdata/hdfs/{namenode,datanode}

 Next, edit the core-site.xml file and update with your system hostname:

$nano $HADOOP_HOME/etc/hadoop/core-site.xml
Change the following name as per your system hostname:

Save and close the file.


Then, edit the hdfs-site.xml file:
$nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml
 Change the NameNode and DataNode directory paths as shown below:

 Then, edit the mapred-site.xml file:


$nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

 Make the following changes:

 Then, edit the yarn-site.xml file:


$nano $HADOOP_HOME/etc/hadoop/yarn-site.xml
 Make the following changes:
Save the file and close it .

Step 12 – Start Hadoop Cluster


Before starting the Hadoop cluster. You will need to format the Namenode as a hadoop user.
Run the following command to format the Hadoop Namenode:
$hdfs namenode –format
Once the namenode directory is successfully formatted with hdfs file system, you will see
the message “Storage directory /home/hadoop/hadoopdata/hdfs/namenode has been
successfully formatted “
Then start the Hadoop cluster with the following command.
$ start-all.sh

You can now check the status of all Hadoop services using the jps command:

$ jps

Step 13 – Access Hadoop Namenode and Resource Manager


 First we need to know our ip address,In Ubuntu we need to install net-tools to
run ipconfig command,
If you installing net-tools for the first time switch to default user:
$sudo apt install net-tools
 Then run ifconfig command to know our ip address:
ifconfig

Here my ip address is 192.168.1.6.


 To access the Namenode, open your web browser and visit the URL http://your-
server- ip:9870. You should see the following screen:
 http://192.168.1.6:9870

To access Resource Manager, open your web browser and visit the URL http://your-server-
ip:8088. You should see the following screen:

http://192.168.16:8088
Step 14 – Verify the Hadoop Cluster
At this point, the Hadoop cluster is installed and configured. Next, we will create some
directories in the HDFS filesystem to test the Hadoop.

Let’s create some directories in the HDFS filesystem using the following command:

$ hdfs dfs -mkdir /test1


$ hdfs dfs -mkdir /logs

Next, run the following command to list the above directory:

$ hdfs dfs -ls /


You should get the following output:

Also, put some files to hadoop file system. For the example, putting log files from host
machine to hadoop file system.

$ hdfs dfs -put /var/log/* /logs/


You can also verify the above files and directory in the Hadoop Namenode web interface.

Go to the web interface, click on the Utilities => Browse the file system. You should see your
directories which you have created earlier in the following screen:

Step 15 – Stop Hadoop Cluster


To stop the Hadoop all services, run the following command:

$ stop-all.sh

Result:
The step-by-step installation and configuration of Hadoop on Ubutu linux
system have been successfully completed.
Exp 2: Hadoop Implementation of file management tasks, such as
Adding files and directories, retrieving files and Deleting files

AIM:

To implement the file management tasks, such as Adding files and directories, retrieving
files and Deleting files.

DESCRIPTION: -

HDFS is a scalable distributed filesystem designed to scale to petabytes of data while running
on top of the underlying filesystem of the operating system. HDFS keeps track of where the
data resides in a network by associating the name of its rack (or network switch) with the
dataset. This allows Hadoop to efficiently schedule tasks to those nodes that contain data, or
which are nearest to it, optimizing bandwidth utilization. Hadoop provides a set of command
line utilities that work similarly to the Linux file commands and serve as your primary
interface with HDFS.

The most common file management tasks in Hadoop, which include:

● Adding files and directories to HDFS


● Retrieving files from HDFS to local filesystem
● Deleting files from HDFS

PROCEDURE:

SYNTAX AND COMMANDS TO ADD, RETRIEVE AND DELETE


DATA FROM HDFS

Step-1: Adding Files and Directories to HDFS Before you can run Hadoop programs on data
stored in HDFS, you‘ll need to put the data into HDFS first. Let‘s create a directory and put a
file in it. HDFS has a default working directory of /user/$USER, where $USER is your login
username. This directory isn‘t automatically created for you, though, so let‘s create it with
the mkdir command. Login with your hadoop user

Firstly, we start those Hadoop service by running this command on terminal :

start-all.sh

For the purpose of illustration, we use chuck. You should substitute your user name in
the example commands.
hadoop fs -mkdir /chuck

hadoop fs -put example.txt /chuck


Step-2 : Retrieving Files from HDFS The Hadoop command get copies files from HDFS
back to the local filesystem. To retrieve example.txt, we can run the following command.

hadoop fs –cat /chuck/example.txt

Step-3:
Deleting Files from HDFS
hadoop fs -rm example.txt
● Command for creating a directory in hdfs is “hdfsdfs -mkdir /lendicse”
● Adding directory is done through the command “hdfsdfs –mkdirsanjay_english.txt

Step-4:
Copying Data from NFS to HDFS
First create set of glossaries as text file.
nano glossary

Put your glossary text in their Copying from directory command is


“hdfsdfs -copyFromLocal /home/hadoop/gloassary /sanjay_english
View the file by using the command “hdfsdfs -cat /sanjay_english/glossary”

● Command for listing of items in Hadoop is “hdfsdfs -ls hdfs://localhost:9000/

● Command for Deleting files is “hdfsdfs –rmdir /lendicse


EXPECTED OUTPUT:

Result:
Thus, the implementation of the file management tasks, such as Adding files
and directories, retrieving files and Deleting files in hadoop have been
successfully completed.
EXP 3: Implement of Matrix Multiplication with Hadoop Map Reduce

AIM:
To implement the Matrix Multiplication with Hadoop Map Reduce

Procedure:

Step 1: Before writing the code let’s first create matrices


and put them in HDFS.

● Create two files M1, M2 and put the matrix values.


(seperate columns with spaces and rows with a line
break)
Save the matrices using nano command
$nano
m1 1 2 3
456
$nano
m2 7 8
9 10
11 12

● Put the above files to HDFS at location /user/path/to/matrices/


hdfs dfs -mkdir /user/path/to/matrices
hdfs dfs -put /path/to/M1 /user/path/to/matrices/
hdfs dfs -put /path/to/M2 /user/path/to/matrices/

Step 2: Mapper Function:


Create a mapper function to process the input chunks and
generate intermediate key-value pairs. In the context of matrix
multiplication, the mapper processes submatrices A(i,k) and B(k,j),
generating pairs (j, (i,k,B(k,j))) and (i, (j,k,A(i,k))), with i, j, and k as
indices.
#!/usr/bin/env python3
import sys
m_r = 2
m_c = 3
n_r = 3
n_c = 2
i=0
for line in sys.stdin:
el = list(map(int,
line.split()) if i < m_r:
for j in
range(len(el)): for
k in range(n_c):
print("{0}\t{1}\t{2}\t{3}".format(i, k, j, el[j]))
else:
for j in
range(len(el)): for
k in range(m_r):
print("{0}\t{1}\t{2}\t{3}".format(k, j, i - m_r, el[j])

i=i+1
Step 3: Reducer Function:
Develop a reducer function to aggregate the intermediate key-
value pairs into the final output. For matrix multiplication, the reducer
combines pairs (j, [(i,k,B(k,j)), ...]) and (i, [(j,k,A(i,k)), ...]) to compute
C(i,j) = sum(A(i,k) * B(k,j)).

#!/usr/bin/env
python3 import sys
m_r = 2
m_c = 3
n_r = 3
n_c = 2
matrix = []
for row in
range(m_r): r =
[]
for col in
range(n_c): s =
0
for el in
range(m_c): mul
=1
for num in range(2):
line = input() # Read input from
standard input n = list(map(int,
line.split('\t')))[-1]
mul *=
n s +=
mul
r.append(s)
matrix.append(r
)
# Print the
matrix for row in
matrix:
print('\t'.join([str(x) for x in row]))

Step 4: Running the Map-Reduce Job on Hadoop

You can run the map reduce job and view the result by the
following code (considering you have already put input
files in HDFS)

To download the Hadoop jar file, use the following command:


wget https://jar-download.com/download-handling.php
$ chmod +x /path/to/Mapper.py
$ chmod +x /path/to/Reducer.py
$ hadoop jar /path/to/hadoop-streaming.jar
-input /user/path/to/matrices/
-output /user/path/to/mat_output
-mapper /path/to/Mapper.py
-reducer /path/to/Reducer.py
This will take some time as Hadoop do its mapping and
reducing work. After the successful completion of the
above process view the output by:
hdfs dfs -cat /user/path/to/mat_output/*
Output:

Result:
Thus, the matrix multiplication using mapreduce program have been
successfully completed.
EXP 4: Run a basic Word Count Map Reduce program to understand
Map Reduce Paradigm.
AIM:
To run a basic Word Count MapReduce program.

Introduction:
Hadoop Streaming is a feature that comes with Hadoop and
allows users or developers to use various different languages for
writing MapReduce programs like Python, C++, Ruby, etc. It
supports all the languages that can read from standard input and
write to standard output. We will be implementing Python with
Hadoop Streaming and will observe how it works.
We will implement the word count problem in python to understand
Hadoop Streaming. We will be creating mapper.py and reducer.py to
perform map and reduce tasks.

Procedure:
Step 1: Create Data File:
Create a file named "word_count_data.txt" and populate it with
text data that you wish to analyse.
Login with your hadoop user.

nano word_count.txt
Step 2: Mapper Logic - mapper.py:
Create a file named "mapper.py" to implement the logic for
the mapper. The mapper will read input data from STDIN, split lines
into words, and output each word with its count.

nano mapper.py
# Copy and paste the mapper.py code

#!/usr/bin/env python3

# import sys because we need to read and write data to


STDIN and STDOUT
import sys

# reading entire line from STDIN (standard


input) for line insys.stdin:
# to remove leading and trailing
whitespace line = line.strip()
# split the line into words
words = line.split()

# we are looping over the words array and printing the


wor
d
# with the count of 1 to the
STDOUT for word in words:
# write the results to STDOUT (standard
output); # what we output here will be the
input for the # Reduce step, i.e. the input
for reducer.py print('%s\t%s' % (word, 1))

Here in the above program #! is known as shebang and used


for interpreting the script. The file will be run using the command
we are specifying.
Step 3: Reducer Logic - reducer.py:
Create a file named "reducer.py" to implement the logic for the reducer.
The reducer will aggregate the occurrences of each word and
generate the final output.

nano reducer.py
# Copy and paste the reducer.py code

reducer.py

#!/usr/bin/env python3

from operator import itemgetter


import sys

current_word =
None
current_count = 0
word = None

for line in sys.stdin:


line = line.strip()
word, count = line.split('\t', 1)
try:
count = int(count)
exceptValueError:
continue

if current_word == word:
current_count += count
else:
if current_word:
print('%s\t%s' % (current_word,
current_count)) current_count = count
current_word = word

ifcurrent_word == word:
print('%s\t%s' % (current_word, current_count))

Step 4: Prepare Hadoop Environment:


Start the Hadoop daemons and create a directory in HDFS to store your
data.

start-all.sh

hdfs dfs -mkdir /word_count_in_python


hdfs dfs -copyFromLocal /path/to/word_count.txt
/word_count_in_python

Step 6: Make Python Files Executable:


Give executable permissions to your mapper.py and reducer.py files.

chmod 777 mapper.py reducer.py


Step 7: Run Word Count using Hadoop Streaming:
Download the latest hadoop-streaming jar file and place it in a
location you can easily access.

Then run the Word Count program using Hadoop Streaming.

hadoop jar /path/to/hadoop-streaming-3.3.6.jar \


-input /word_count_in_python/word_count_data.txt \
-output /word_count_in_python/new_output \
-mapper /path/to/mapper.py \
-reducer /path/to/reducer.py

Step 8: Check Output:


Check the output of the Word Count program in the specified
HDFS output directory.
hdfs dfs -cat /word_count_in_python/new_output/part-00000

output:

Result:
Thus, the program for basic Word Count Map Reduce has been executed
successfully.
Exp 5: Installation of Hive on Ubuntu

Aim:
To Download and install Hive, Understanding Startup scripts,
Configuration files.
Procedure:
Step 1: Download and extract it

Download the Apache hive and extract it use tar, the


commands given below:
$wget https://archive.apache.org/dist/hive/hive-
3.1.3/apache-hive-3.1.3-bin.tar.gz
$ tar -xvf apache-hive-3.1.3-bin.tar.gz

Step 2: Place different configuration properties in Apache Hive

In this step, we are going to do two things

1. Placing Hive Home path in bashrc file

$nano .bashrc

And append the below lines in it

2. Exporting Hadoop path in Hive-config.sh (To


communicate with the Hadoop eco system we are
defining Hadoop Home path in hive config field) Open
the hive-config.sh as shown in below

$cd apache-hive-3.1.2-bin/bin

$cp hive-env.sh.template hive-env.sh


$nano hive-env.sh

Append the below commands on it

export HADOOP_HOME=/home/Hadoop/Hadoop

export HIVE_CONF_DIR=/home/Hadoop/apache-hive-3.1.2/conf

export HADOOP_TMP_DIR=/tmp/hadoop-${USER}

Step 3: Install mysql

To download the MySQL Connector JAR file using wget, you can use
the following command:

wget https://dev.mysql.com/get/Downloads/Connector-
J/mysql-connector-j-9.0.0.tar.gz

Once downloaded, extract the .tar.gz file:

tar -xzf mysql-connector-j-9.0.0.tar.gz

Then move the .jar file to the Hive lib directory:


mv mysql-connector-j-9.0.0/mysql-connector-j-
9.0.0.jar /home/hadoop/apache-hive-3.1.3-bin/lib

1. Install mysql in Ubuntu by running this command:


$sudo apt update
$sudo apt install mysql-server

2. Alter username and password for MySQL by running below


commands:
$sudo mysql

Pops command line interface for MySQL and run the


below SQL queries to change username and set
password

mysql> SELECT user, host, plugin FROM mysql.user WHERE user = 'root';

mysql> ALTER USER 'root'@'localhost' IDENTIFIED WITH


'mysql_native_password' BY 'your_new_password';

mysql> FLUSH PRIVILEGES;

Step 5: Config hive-site.xml


Config the hive-site.xml by appending this xml code and
change the username and password according to your
MySQL.

$cd apache-hive-3.1.2-bin/bin

$cp hive-default.xml.template hive-site.xml


$nano hive-site.xml

Append these lines into it


Replace root as your username of MySQL
Replace your_new_password as with your password of MySQL
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive_metastore</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>your_new_password</value>
</property>

</configuration>

Step 5: Move the File Using sudo

Run the following command to move hive-site.xml from your Downloads folder to the
Hive configuration directory:
sudo mv /home/hasif/Downloads/hive-site.xml
/home/hadoop/apache-hive-3.1.3-bin/conf/

Verify the Move

After moving the file, you can check if it’s in the correct directory by

running: ls

/home/hadoop/apache-hive-3.1.3-bin/conf You

should see hive-site.xml listed there.

Step 6: Create directories and Initialize the hive:

 Create Local Directories: You can create the required directories on your
local filesystem with the following commands:

$mkdir -p /tmp/hadoop
$mkdir -p /tmp/hive
$mkdir -p /tmp/hive/local

This creates the necessary directories for Hadoop and Hive.

 Create Directories in HDFS: To create the required directories in HDFS, use


these commands:

$hdfs dfs -mkdir -p /user/hive/warehouse


$hdfs dfs -chmod 755 /user/hive/warehouse

Step 6: Initialize the Hive Metastore Schema:


Run the following command to initialize the Hive metastore schema:
$ cd /home/hadoop/apache-hive-3.1.3-bin/bin
$./schematool -initSchema -dbType mysql
Step 7: Start hive:
You can test Hive by running the Hive shell: Copy code hive
You should be able to run Hive queries, and metadata will be
stored in your MySQL database.
$./hive

Result:

Apache Hive installation is completed on Ubuntu 23.04 LTS (Bionic Beaver).


Exp 6: Design and test various schema models to optimize data storage and
retrieval Using Hive.

Aim:
To Design and test various schema models to optimize data
storage and retrieval Using Hbase.

Procedure:
Step 1: Start Hive
Open a terminal and start Hive by running:
$hive
Step 2: Create a Database
Create a new database in Hive:
hive>CREATE DATABASE financials;

Step 3: Use the Database:


Switch to the newly created
database: hive>use financials;
Step 4: Create a Table:
Create a simple table in your database:
hive>CREATE TABLE finance_table ( id INT, name STRING );

Step 5: Load Sample Data:


You can insert sample data into the table:
hive> INSERT INTO finance_table VALUES (1, 'Alice'),
(2, 'Bob'),
(3,
'Charlie');

Step 6: Query Your Data


Use SQL-like queries to retrieve data from your table:
hive>CREATE VIEW myview AS SELECT name, id FROM
finance_table;
Step 7: View the data:
To see the data in the view, you would need to query the view
hive>SELECT*FROM myview;

Step 8: Describe a Table:


You can describe the structure of a table using the DESCRIBE
command:
hive>DESCRIBE finance_table;

Step 9: Alter a Table:


You can alter the table structure by adding a new
column: hive> ALTER TABLE finance_table ADD
COLUMNS (age INT);
Step 10: Quit Hive:
To exit the Hive CLI, simply type:
hive>quit;

Result:
These commands demonstrate a simple schema design experiment with
Hive. You can adapt and extend these steps to test and refine your schema
for more complex use cases.
Exp 7: Installation of Hbase on Ubuntu

Aim:
To Download and install Hbase, Understanding different Hbase
modes, Startup scripts, Configuration files.

Procedure:

Step 1. Please verify if Hadoop is installed.

Step 2. Please verify if Java is installed.

Step 3. Please download HBase 2.5.5 from the below link.

$ wget https://archive.apache.org/dist/hbase/2.5.10/hbase-2.5.10-bin.tar.gz

Step 4. Let us extract the tar file using the below command and
rename the folder to HBase to make it meaningful.
$ tar -zxvf hbase-2.5.10-bin.tar.gz
$ sudo mv hbase-2.5.10 /usr/local/hbase
Step 5. Now edit (hbase-env.sh) configuration file which is
present under conf in hbase and add JAVA path as mentioned
below .
/hbase/conf$ nano hbase-env.sh

Now put JAVA path.

export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
Step 6. After this edit (.bashrc) file to update the environment
variable of Apache HBase so that it can be accessed from any
directory.
nano .bashrc

Add below path.

export HBASE_HOME=/usr/local/hbase
export PATH=$PATH:$HBASE_HOME/bin
Step 7. Now config hbase-site.xml by append these xml code.

$nano $HBASE_HOME/conf/hbase-site.xml

Add the following configuration inside the <configuration> tags for standalone
mode:
<property>
<name>hbase.rootdir</name>
<value>file:///usr/local/hbase/data</value>
</property>

<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>

<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/usr/local/hbase/zookeeper</value>
</property>

3. Save and exit the file.


Step 8. Now start Apache HBase and verify it using the below
commands.
cd hbase

bin/start-hbase.sh

jps

Step 9. After this we can see Hbase services are running from
the JPS command, now let us start the HBase shell using the
below command.
bin/hbase shell

Access the HBase Web UI


To ensure that HBase is running properly, you can check the HBase Web UI.
Open a web browser and go to:
http://127.0.0.1:16010
Result:

Apache HBase installation is completed on Ubuntu 23.04 LTS (Bionic Beaver).


Exp 8: Design and test various schema models to optimize data storage and
retrieval Using Hbase.
Aim:
To Design and test various schema models to optimize data
storage and retrieval Using Hbase.

Procedure:
Certainly, here's a step-by-step schema design experiment with
HBase using sample commands:

1. Start HBase:
If HBase is not running, start it with the following
command: start-hbase.sh
hbase shell

2. Create a Table:
Create a sample HBase table named "student" with a
column family "info" using the HBase shell:
hbase:002:0>create 'student', 'info'
3. Insert Data:
- Insert sample data into the "student" table. Let's
assume you're storing student information with student
IDs as row keys:
hbase:003:0>put 'student', '1', 'info:name', 'Alice'

hbase:004:0>put 'student', '1', 'info:age', '20'

hbase:005:0>put 'student', '2', 'info:name', 'Bob'

hbase:006:0>put 'student', '2', 'info:age', '22'

4. Query Data:
- Retrieve student information using the row key:
\hbase:008:0>get 'student', '1'
5. Add Another Column Family:
- You decide to add a new column family "grades" to store
student grades.
hbase:00.9:0>alter 'student', {NAME => 'grades', VERSIONS => 1}

6. Insert Data into the New Column Family:


- Insert sample grade data into the "grades" column family:

put 'student', '1', 'grades:math', 'A'

put 'student', '2', 'grades:math', 'B'

7. Query Data from Multiple Column Families:


- Retrieve student information and grades for a student:
get 'student', '1', {COLUMN => ['info', 'grades']}

8. Row Key Design Experiment:


- Realize that using student IDs as row keys is not efficient
for querying all students. Redesign the row key to use a
composite key that includes the year of enrollment:

put 'student', '2023_1', 'info:name', 'Carol'

put 'student', '2023_2', 'info:name', 'David'

9. Query Data with New Row Key:


- Retrieve student information with the updated row key:

hbase:00.10:0>get 'student', '2023_1'


10. Cleanup and Stop HBase:
- When you're done, you can drop the "student" table and
stop HBase: hbase:017:0>disable 'student'
hbase:018:0>drop 'student'

Result:
These commands demonstrate a simple schema design experiment with HBase.
You can adapt and extend these steps to test and refine your schema for more
complex use cases.
Exp 9: Practice importing and exporting data from various databases.
Aim:
To Practice importing and exporting data from various databases.

Procedure:
SQOOP is basically used to transfer data from relational
databases such as MySQL, Oracle to data warehouses such as
Hadoop HDFS (Hadoop File System). Thus, when data is
transferred from a relational database to HDFS, we say we are
importing data. Otherwise, when we transfer data from HDFS
to relational databases, we say we are exporting data.
Note: To import or export, the order of columns in both MySQL
and Hive should be the same.

i) Installation of Sqoop:
Step 1: Download the stable version of Apache Sqoop
Website URL https://archive.apache.org/dist/sqoop/1.4.7/

$wget https://archive.apache.org/dist/sqoop/1.4.7/sqoop-
1.4.7.bin hadoop-2.6.0.tar.gz
Step 2: Unzip the downloaded file using the tar command

$tar -xvzf sqoop-1.4.7.bin hadoop-2.6.0.tar.gz

Step 3: Edit the .bashrc file by using the command

$nano .bashrc

$source .bashrc

Step 4: Set up libraries for sqoop:


Download commons lang from the given link and extract it

$wget
https://dlcdn.apache.org/commons/lang/binaries/commons-
lang- 2.6-bin.tar.gz

$tar -xvf commons-lang-2.6-bin.tar.gz

Copy the commons lang.jar file into sqoop-1.4.7/lib.

Next, Copy the mysql-java-connector from hive lib and paste it in sqoop-
1.4.7/lib.

Step 5: Check the installed sqoop version using the below


command

$sqoop version

ii) Importing data from MySQL to HDFS


In order to store data into HDFS, we make use of Apache Hive
which provides an SQL-like interface between the user and the
Hadoop distributed file system (HDFS) which integrates
Hadoop. We perform the following steps:

Step 1: Login into MySQL

$mysql -u root -pyour_new_password

Step 2: Create a database and table and insert data.


mysql>create database importtohadoop;
mysql>use importtohadoop;
mysql> create table album_details(album_name varchar(65),
year int, artist varchar(65));
Insert the values using this command
mysql> insert into album_details
values
Step 3: Create a database and table in the hive where data should be
imported.
hive>create database album_hive;

hive>use album_hive;
hive>create table album_details_hive(album_name
varchar(65), year int, artist varchar(65));
Step 4: Run this command on terminal :
$ sqoop import --connect
"jdbc:mysql://127.0.0.1:3306/importtohadoop?useSSL=false"
\
--username root -P \
--table album_details \
--hive-import \
--hive-table album_hive.album_hive_table \
--m 1
Step 4: Check-in hive if data is imported successfully
or not. hive> use album_hive;
hive> select*from album_hive_table;

iii) To export data into MySQL from HDFS, perform the following steps:
Step 1: Create a database and table in the hive.
hive> create table hive_table_export(album_name
varchar(65),company, year int, artist varchar(65));
Step 2: Insert data into the hive table.
hive> insert into hive_table_export
values hive>select*from
hive_table_export;

Step 3: Create a database and table in MySQL in which data


should be exported.
Step 4: Run the following command on Terminal.
$ sqoop export --connect
"jdbc:mysql://127.0.0.1:3306/mysql_export?useSSL=false" \
--username root --password your_new_password \
--table mysql_table_export \
--export-dir /user/hive/warehouse/hive_export.db/hive_table_export \
--input-fields-terminated-by ',' \
--m 1 \
--driver com.mysql.cj.jdbc.Driver
Step 5: Check-in MySQL if data is exported successfully or
not. mysql> select*from mysql_table_export;

Result:
Importing and exporting data between MySQL and Hive implemented
successfully.
Additional Experiments

Exp 10: factorial mapreduce program in Python

Aim:
To Implement factorial mapreduce program in Python.

Procedure:
Step 1: Create the Input File

1. Open the Terminal.


2. Create a file named numbers.txt with the numbers you want to calculate factorials
for (e.g., 3, 5, 7, and 10):

This command:

 Creates a file named numbers.txt.


 Adds each number on a new line.

Verify the file contents:

Expected Output:
Step 2: Write the Mapper Code

1. Create the Mapper file:

nano mapper.py

This opens a new file named mapper.py in the Nano text editor.

2. Paste the following code into mapper.py:

# mapper.py
import sys
import math

for line in sys.stdin:


num =
int(line.strip())
print(f"{num}\t{math.factorial(num)}")

Save and Exit:

 In Nano, press CTRL + X, then Y, and Enter to save and close the file.

Step 3: Create the Reducer file:


nano reducer.py
This opens a new file named reducer.py in the Nano text editor.

3. Paste the following code into reducer.py:

# reducer.py
import sys

for line in sys.stdin:


num, fact = line.strip().split("\t") # Read the tab-separated
number and factorial
print(f"Factorial of {num} is {fact}") # Output in a readable format

Step 4: Run the MapReduce Program

To run the program and see the output:

1. Run the following command:


OUTPUT:

Result:
Thus, the implementation of MapReduce program successfully executed.
Exp 11: MapReduce that performs log analysis to count the number of
requests each IP address has made from a server log.

Aim:

To implement a MapReduce program in Python that analyzes a server


log file and counts the number of requests made by each IP address.

Procedure:

Step 1: Prepare the Log File

1. Create a sample log file named server_log.txt with entries similar to


those found in a typical web server log:

o Each line represents a server request, and the IP address is the


first part of each entry.
2. Save and close the file.

Step 2: Write the Mapper Code

1. Create a Python file named mapper.py:

nano mapper.py
2. Paste the following code into mapper.py:

# mapper.py
import sys
for line in sys.stdin:
line = line.strip()
if line:
ip_address = line.split()[0]
print(f"{ip_address}\t1")

Step 3: Write the Reducer Code

1. Create another Python file named reducer.py:

nano reducer.py

2. Paste the following code into reducer.py:

# reducer.py
import sys

current_ip = None
current_count = 0

for line in sys.stdin:


line = line.strip()
ip_address, count = line.split("\t") # Split IP address and count
count = int(count)

if current_ip == ip_address:
current_count += count
else:
if current_ip:
print(f"{current_ip}\t{current_count}") # Output the count for
previous IP
current_ip = ip_address
current_count = count

if current_ip == ip_address:
print(f"{current_ip}\t{current_count}") # Output count for the last IP

Step 4: Run the MapReduce Program

1. Run the following command to execute the MapReduce

program: cat server_log.txt | python3 mapper.py | sort | python3


reducer.py
OUTPUT:

Result:
Thus, the implementation of MapReduce program successfully executed.

You might also like