KEMBAR78
Big Data Lab Manual | PDF | Apache Hadoop | Map Reduce
0% found this document useful (0 votes)
10 views58 pages

Big Data Lab Manual

The document is a lab manual for the Big Data and Analytics Lab at Meerut Institute of Engineering and Technology, outlining the course objectives, experiments, and outcomes related to Hadoop and MapReduce. It includes detailed instructions for installing Hadoop, understanding its architecture, and performing various data management tasks. Additionally, it specifies the vision and mission of the institute, along with recommended textbooks and references for further study.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views58 pages

Big Data Lab Manual

The document is a lab manual for the Big Data and Analytics Lab at Meerut Institute of Engineering and Technology, outlining the course objectives, experiments, and outcomes related to Hadoop and MapReduce. It includes detailed instructions for installing Hadoop, understanding its architecture, and performing various data management tasks. Additionally, it specifies the vision and mission of the institute, along with recommended textbooks and references for further study.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

MEERUT INSTITUTE OF ENGINEERING AND

TECHNOLOGY MEERUT

LAB MANUAL
(Big Data and Analytics Lab)

DR. A.P.J. ABDUL KALAM TECHNICAL UNIVERSITY


LUCKNOW

B.TECH. THIRD YEAR


DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING - DS

2 BIG DATA ANALYTICS LABORATORY (KDS651)


Vision of Institute

To be an outstanding institution in the country imparting technical education, providing need-based, value-based
and career-based programs and producing self-reliant, self-sufficient technocrats capable of meeting new
challenges.

Mission of Institute

The mission of the institute is to educate young aspirants in various technical fields to fulfill global requirement of
human resources by providing sustainable quality education, training and invigorating environment besides
molding them into skilled competent and socially responsible citizens who will lead the building of a powerful
nation.

3 BIG DATA ANALYTICS LABORATORY (KDS651)


Course Objectives:

This course is designed to:

1. Get familiar with Hadoop distributions, configuring Hadoop and performing


File management tasks
2. Experiment MapReduce in Hadoop frameworks
3. Implement MapReduce programs in variety applications
4. Explore MapReduce support for debugging
5. Understand different approaches for building Hadoop MapReduce programs for real-
time applications

List of Experiments:

1 Downloading and Installation Hadoop, Understanding different Hadoop modes, startup scripts,
configuration files.

2. Implementation of file management in Hadoop


➢ Adding files and Directories
➢ Retrieving file
➢ Deleting files

3. Matrix Multiplication with Hadoop MapReduce

4. To run a Word Count MapReduce program to understand the MapReduce paradigm

5. Implementation of K-mean clustering using Map reduce

6. Installation of HIVE alone practical Example

7. Installing of HBase, Installation thrift alone with practice example

8. Write PIG Commands. Write pig Latin script sort, group and filter in your data

9. Run the PIG Latin Script to find world count

10. Run the PIG Latin Script to find a max temp for each and every year.

4 BIG DATA ANALYTICS LABORATORY (KDS651)


Text Books:

1. Tom White, “Hadoop: The Definitive Guide” Fourth Edition, O’reilly Media, 2015.

Reference Books:

1. Glenn J. Myatt, Making Sense of Data , John Wiley & Sons, 2007 Pete Warden, Big Data
Glossary, O’Reilly, 2011.
2. Michael Berthold, David J.Hand, Intelligent Data Analysis, Spingers, 2007.
3. Chris Eaton, Dirk DeRoos, Tom Deutsch, George Lapis, Paul Zikopoulos, Uderstanding Big
Data : Analytics for Enterprise Class Hadoop and Streaming Data, McGrawHill Publishing,
2012.
4. AnandRajaraman and Jeffrey David UIIman, Mining of Massive Datasets Cambridge
University Press, 2012.

Course Outcomes:

Upon completion of the course, the students should be able to:

1. Configure Hadoop and perform File Management Tasks (L2)


2. Apply MapReduce programs to real time issues like word count, weather dataset and
sales of a company (L3)
3. Critically analyze huge data set using Hadoop distributed file systems and MapReduce
(L5)
4. Apply different data processing tools like Pig, Hive and Spark.(L6)

5 BIG DATA ANALYTICS LABORATORY (KDS651)


6

Lab: 1
Downloading and Installation Hadoop, Understanding different Hadoop
modes, startup scripts, configuration files.

AIM: To Install Apache Hadoop.


Introduction of bigdata architecture:
➢ A big data architecture is designed to handle the ingestion, processing, and analysis of data that
is too large or complex for traditional database systems.
Component of Big Data Architecture
➢ Data sources. All big data solutions start with one or more data sources.
➢ Data storage. Data for batch processing operations is typically stored in a distributed file store
that can hold high volumes of large files in various formats
➢ Batch processing. Because the data sets are so large, often a big data solution must process data
files using long-running batch jobs to filter, aggregate, and otherwise prepare the data for analysis.
Usually these jobs involve reading source files, processing them, and writing the output to new files.
➢ Real-time message ingestion. If the solution includes real-time sources, the architecture must
include a way to capture and store real-time messages for stream processing.
➢ Stream processing. After capturing real-time messages, the solution must process them by
filtering, aggregating, and otherwise preparing the data for analysis
➢ Analytical data store. Many big data solutions prepare data for analysis and then serve the
processed data in a structured format that can be queried using analytical tools
➢ Analysis and reporting. The goal of most big data solutions is to provide insights into the data
through analysis and reporting.
➢ Orchestration. Most big data solutions consist of repeated data processing operations,
encapsulated in work flows that transform source data, move data between multiple sources and
sinks, load the processed data into an analytical data store, or push the results straight to a report
or dashboard.

❖ Introduction of Hadoop Architecture:


➢ Apache Hadoop offers a scalable, flexible and reliable distributed computing big data framework
for a cluster of systems with storage capacity and local computing power by leveraging commodity
hardware.
➢ Hadoop follows a Master Slave architecture for the transformation and analysis of large datasets
using Hadoop MapReduce paradigm. The 3 important hadoop components that play a vital role in
the Hadoop architecture.
➢ Hadoop Common – the libraries and utilities used by other Hadoop modules
➢ Hadoop Distributed File System (HDFS) – the Java-based scalable system that stores data
across multiple machines without prior organization.
➢ YARN – (Yet Another Resource Negotiator) provides resource management for the processes
running on Hadoop.
➢ MapReduce – a parallel processing software framework. It is comprised of two steps. Map step
is a master node that takes inputs and partitions them into smaller sub problems and then

6 BIG DATA ANALYTICS LABORATORY (KDS651)


7

distributes them to worker nodes. After the map step has taken place, the master node takes the
answers to all of the sub problems and combines them to produce output.
EXCERCISE:
1) What do you know about the term “Big Data” and what is application of Big Data ?
2) What is Internet of things?
3) What are the challenges of using Hadoop?
4) Tell us how big data and Hadoop are related to each other.
5) How would you transform unstructured data into structured data?

Hadoop can run in 3 different modes.


1. Standalone (Local) Mode
By default, Hadoop is configured to run in a no distributed mode. It runs as a single Java process.
Instead of HDFS, this mode utilizes the local file system. This mode useful for debugging and there
isn't any need to configure core-site.xml, hdfs-site.xml, mapred-site.xml, masters & slaves. Stand-
alone mode is usually the fastest mode in Hadoop.
In Standalone Mode none of the Daemon will run i.e. Namenode, Datanode, Secondary Name node,
Job Tracker, and Task Tracker. We use job-tracker and task-tracker for processing purposes in
Hadoop1. For Hadoop2 we use Resource Manager and Node Manager. Standalone Mode also
means that we are installing Hadoop only in a single system. By default, Hadoop is made to run in
this Standalone Mode or we can also call it as the Local mode. We mainly use Hadoop in this Mode
for the Purpose of Learning, testing, and debugging.
Hadoop works very much Fastest in this mode among all of these 3 modes. As we all know HDFS
(Hadoop distributed file system) is one of the major components for Hadoop whichutilized for
storage Permission is not utilized in this mode. You can think of HDFS as similar to the file system’s
available for windows i.e. NTFS (New Technology File System) and FAT32(File Allocation Table
which stores the data in the blocks of 32 bits ). when your Hadoop works in this mode there is no
need to configure the files – hdfs-site.xml, mapred-site.xml, core- site.xml for Hadoop environment.
In this Mode, all of your Processes will run on a single JVM(Java Virtual Machine) and
this mode can only be used for small development purposes.

2. Pseudo-Distributed Mode (Single node)


Hadoop can also run on a single node in a Pseudo Distributed mode. In this mode, each daemon
runs on separate java process. In this mode custom configuration is required( core-site.xml, hdfs-
site.xml, mapred-site.xml ). Here HDFS is utilized for input and output. This mode of deployment
is useful for testing and debugging purposes.
In Pseudo-distributed Mode we also use only a single node, but the main thing is that the cluster
is simulated, which means that all the processes inside the cluster will run independently to each
other. All the daemons that are Namenode, Datanode, Secondary Name node, Resource Manager,
Node Manager, etc. will be running as a separate process on separate JVM(Java Virtual Machine)
or we can say run on different java processes that is why it is called a Pseudo- distributed.
One thing we should remember that as we are using only the single node set up so all the Master
and Slave processes are handled by the single system. Namenode and Resource Manager are used
as Master and Datanode and Node Manager is used as a slave. A secondary name node isalso
used as a Master. The purpose of the Secondary Name node is to just keep the hourly based backup
of the Name node. In this Mode,
Hadoop is used for development and for debugging purposes both.
Our HDFS(Hadoop Distributed File System ) is utilized for managing the Input and Output
processes.

7 BIG DATA ANALYTICS LABORATORY (KDS651)


8

We need to change the configuration files mapred-site.xml, core-site.xml, hdfs-site.xml for


setting up the environment.

3. Fully Distributed Mode


This is the production mode of Hadoop. In this mode typically one machine in the cluster is
designated as NameNode and another as Resource Manager exclusively. These are masters. All
other nodes act as Data Node and Node Manager. These are the slaves. Configuration parameters
and environment need to specified for Hadoop Daemons.
This mode offers fully distributed computing capability, reliability, fault tolerance and scalability.
This is the most important one in which multiple nodes are used few of them run the Master
Daemon’s that are Namenode and Resource Manager and the rest of them run the Slave
Daemon’s that are DataNode and Node Manager. Here Hadoop will run on the clusters of
Machine or nodes. Here the data that is used is distributed across different nodes. This is
actually the Production Mode of Hadoop let’s clarify or understand this Mode in a better way in
Physical Terminology.
Once you download the Hadoop in a tar file format or zip file format then you install it in your
system and you run all the processes in a single system but here in the fully distributed mode we
are extracting this tar or zip file to each of the nodes in the Hadoop cluster and then we are
using a particular node for a particular process. Once you distribute the process among the
nodes then you’ll define which nodes are
working as a master or which one of them is working as a slave.

8 BIG DATA ANALYTICS LABORATORY (KDS651)


9

Startup scripts
The $HADOOP_INSTALL/hadoop/bin directory contains some scripts used to launch Hadoop
DFS and Hadoop Map/Reduce daemons. These are:
start-dfs.sh - Starts the Hadoop DFS daemons, the namenode and datanodes. Use this
before start-mapred.sh
stop-dfs.sh - Stops the Hadoop DFS daemons.
start-mapred.sh - Starts the Hadoop Map/Reduce daemons, the jobtracker and tasktrackers.
stop-mapred.sh - Stops the Hadoop Map/Reduce daemons.
start-all.sh - Starts all Hadoop daemons, the namenode, datanodes, the jobtracker and
tasktrackers. Deprecated; use start-dfs.sh then start-mapred.sh
stop-all.sh - Stops all Hadoop daemons. Deprecated; use stop-mapred.sh then stop-dfs.sh

Configuration files

The $HADOOP_INSTALL/hadoop/conf directory contains some configuration files for Hadoop.


These are:
hadoop-env.sh - This file contains some environment variable settings used by Hadoop. You can
use these to affect some aspects of Hadoop daemon behavior, such as where log files are stored, the
maximum amount of heap used etc. The only variable you should need to change in this fileis
JAVA_HOME, which specifies the path to the Java 1.5.x installation used by Hadoop.
slaves - This file lists the hosts, one per line, where the Hadoop slave daemons (datanodes and
tasktrackers) will run. By default this contains the single entry localhost
hadoop-default.xml - This file contains generic default settings for Hadoop daemons and
Map/Reduce jobs. Do not modify this file.
mapred-default.xml - This file contains site specific settings for the Hadoop Map/Reduce daemons
and jobs. The file is empty by default. Putting configuration properties in this file will override
Map/Reduce settings in the hadoop-default.xml file. Use this file to tailor the behavior of
Map/Reduce on your site.
hadoop-site.xml - This file contains site specific settings for all Hadoop daemons and
9 BIG DATA ANALYTICS LABORATORY (KDS651)
10

Map/Reduce jobs. This file is empty by default. Settings in this file override those in hadoop-
default.xml and mapred-default.xml. This file should contain settings that must be respected by all
servers and clients in a Hadoop installation, for instance, the location of the namenode and the
jobtracker.

Hadoop software can be installed in three modes of

Hadoop is a Java-based programming framework that supports the processing and storage of
extremely large datasets on a cluster of inexpensive machines. It was the first major open source
project in the big data playing field and is sponsored by the Apache Software Foundation.

Hadoop-2.7.3 is comprised of four main layers:

➢ Hadoop Common is the collection of utilities and libraries that support other Hadoop
modules.
➢ HDFS, which stands for Hadoop Distributed File System, is responsible for persisting
data to disk.
➢ YARN, short for Yet Another Resource Negotiator, is the "operating system" for HDFS.
➢ MapReduce is the original processing model for Hadoop clusters. It distributes work
within the cluster or map, then organizes and reduces the results from the nodes into a
response to a query. Many other processing models are available for the 2.x version of
Hadoop.
Hadoop clusters are relatively complex to set up, so the project includes a stand-alone mode
which is suitable for learning about Hadoop, performing simple operations, and debugging.

Procedure:

we'll install Hadoop in stand-alone mode and run one of the example example MapReduce
programs it includes to verify the installation.

Prerequisites:

Step1: Installing Java 8 version.


Openjdk version "1.8.0_91"
OpenJDK Runtime Environment (build 1.8.0_91-8u91-b14-3ubuntu1~16.04.1-b14)
OpenJDK 64-Bit Server VM (build 25.91-b14, mixed mode)
This output verifies that OpenJDK has been successfully installed.
Note: To set the path for environment variables. i.e. JAVA_HOME

Step2: Installing Hadoop


With Java in place, we'll visit the Apache Hadoop Releases page to find the most
recent stable release. Follow the binary for the current release:

10 BIG DATA ANALYTICS LABORATORY (KDS651)


11

Download Hadoop from www.hadoop.apache.org

11 BIG DATA ANALYTICS LABORATORY (KDS651)


12

Procedure to Run Hadoop

1. Install Apache Hadoop 2.2.0 in Microsoft Windows OS

If Apache Hadoop 2.2.0 is not already installed then follow the post Build, Install,
Configure and Run Apache Hadoop 2.2.0 in Microsoft Windows OS.

2. Start HDFS (Namenode and Datanode) and YARN (Resource Manager and Node
Manager)

Run following commands.


Command Prompt
C:\Users\abhijitg>cd c:\hadoop
c:\hadoop>sbin\start-dfs
c:\hadoop>sbin\start-yarn
starting yarn daemons

Namenode, Datanode, Resource Manager and Node Manager will be started in few
minutes and ready to execute Hadoop MapReduce job in the Single Node (pseudo-
distributed mode) cluster.

Resource Manager & Node Manager:

12 BIG DATA ANALYTICS LABORATORY (KDS651)


13

Run wordcount MapReduce job

Now we'll run wordcount MapReduce job available


in %HADOOP_HOME%\share\hadoop\mapreduce\hadoop-mapreduce-examples-
2.2.0.jar

Create a text file with some content. We'll pass this file as input to
the wordcount MapReduce job for counting words.
C:\file1.txt
Install Hadoop

Run Hadoop Wordcount Mapreduce Example

Create a directory (say 'input') in HDFS to keep all the text files (say 'file1.txt') to be used for
counting words.
C:\Users\abhijitg>cd c:\hadoop
C:\hadoop>bin\hdfs dfs -mkdir input

Copy the text file(say 'file1.txt') from local disk to the newly created 'input' directory in HDFS.

C:\hadoop>bin\hdfs dfs -copyFromLocal c:/file1.txt input

13 BIG DATA ANALYTICS LABORATORY (KDS651)


14

Check content of the copied file.

C:\hadoop>hdfs dfs -ls input


Found 1 items
-rw-r--r-- 1 ABHIJITG supergroup 55 2014-02-03 13:19 input/file1.txt

C:\hadoop>bin\hdfs dfs -cat input/file1.txt


Install Hadoop
Run Hadoop Wordcount Mapreduce Example

Run the wordcount MapReduce job provided


in %HADOOP_HOME%\share\hadoop\mapreduce\hadoop-mapreduce-examples-2.2.0.jar

C:\hadoop>bin\yarnjarshare/hadoop/mapreduce/hadoop-mapreduce-examples- 2.2.0.jar
wordcount input output
14/02/03 13:22:02 INFO client.RMProxy: Connecting to ResourceManager at
/0.0.0.0:8032
14/02/03 13:22:03 INFO input.FileInputFormat: Total input paths to process : 1 14/02/03
13:22:03 INFO mapreduce.JobSubmitter: number of splits:1
:
:
14/02/03 13:22:04 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1391412385921_0002
14/02/03 13:22:04 INFO impl.YarnClientImpl: Submitted application
application_1391412385921_0002 to ResourceManager at /0.0.0.0:8032 14/02/03
13:22:04 INFO mapreduce.Job: The url to track the job:
http://ABHIJITG:8088/proxy/application_1391412385921_0002/
14/02/03 13:22:04 INFO mapreduce.Job: Running job: job_1391412385921_0002 14/02/03
13:22:14 INFO mapreduce.Job: Job job_1391412385921_0002 running in uber mode : false
14/02/03 13:22:14 INFO mapreduce.Job: map 0% reduce 0%
14/02/03 13:22:22 INFO mapreduce.Job: map 100% reduce 0%
14/02/03 13:22:30 INFO mapreduce.Job: map 100% reduce 100%
14/02/03 13:22:30 INFO mapreduce.Job: Job job_1391412385921_0002 completed
successfully
14/02/03 13:22:31 INFO mapreduce.Job: Counters: 43 File
System Counters
FILE: Number of bytes read=89
FILE: Number of bytes written=160142
FILE: Number of read operations=0 FILE:
Number of large read operations=0 FILE:
Number of write operations=0
5

14 BIG DATA ANALYTICS LABORATORY (KDS651)


15

HDFS: Number of bytes read=171


HDFS: Number of bytes written=59
HDFS: Number of read operations=6
HDFS: Number of large read operations=0 HDFS:
Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=5657 Total
time spent by all reduces in occupied slots (ms)=6128
Map-Reduce Framework Map
input records=2 Map
output records=7 Map
output bytes=82
Map output materialized bytes=89 Input
split bytes=116
Combine input records=7
Combine output records=6
Reduce input groups=6
Reduce shuffle bytes=89
Reduce input records=6
Reduce output records=6
Spilled Records=12 Shuffled
Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=145
CPU time spent (ms)=1418
Physical memory (bytes) snapshot=368246784
Virtual memory (bytes) snapshot=513716224 Total
committed heap usage (bytes)=307757056
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=55
File Output Format Counters
6

15 BIG DATA ANALYTICS LABORATORY (KDS651)


16

Bytes Written=59
http://abhijitg:8088/cluster

Result: We've installed Hadoop in stand-alone mode and verified it by running an


example program it provided.

Other method to install Hadoop

Installation of HADOOP
1-Download virtual box
➢ Virtual box-oracle (https://www.virtualbox.org/wiki/Downloads)
➢ VirtualBox 7.0.6 platform packages
➢ Windows hosts

2. Google: sandbox HDP clouder


➢ Go to Hortonswork HDP -> download now
➢ Choose Installation Type: virtual Box
➢ Fill form (chose for udemy training)
➢ After all click: download older version: 2.5.0 (11 GB)

3- Connect virtual box to HDP


➢ Open virtualbox
➢ Click import
➢ Chose path of set of of HDP 2.5.0
➢ It will take time to Import (Approx 2 min)

16 BIG DATA ANALYTICS LABORATORY (KDS651)


17

Page of Ambari:
➢ Open setting
➢ Click network : reduce base memory (up to less than half) : it is form remove GURU
MEDITATION ERROR
✓ After start machine (main page of virtual box)
✓ After some time aapprox 3-4 min: it will generate http:// link
✓ Local host connect: http://127.0.0.1:8888/ 0r
✓ http://127.0.0.1:8080/#/main/dashboard/metrics
✓ Login on ambari
✓ Username: maria_dev
✓ Password: maria_dev

5- Hadoop Main page

17 BIG DATA ANALYTICS LABORATORY (KDS651)


18

Lab NO: 2 Implementation of file management in Hadoop


➢ Adding files and Directories
➢ Retrieving file
➢ Deleting files

AIM: Implementation of file management in Hadoop


With growing data velocity the data size easily outgrows the storage limit of a machine.
A solution would be to store the data across a network of machines. Such filesystems are
called distributed filesystems. Since data is stored across a network all the complications
of a network come in.
This is where Hadoop comes in. It provides one of the most reliable filesystems. HDFS
(Hadoop Distributed File System) is a unique design that provides storage for extremely
large files with streaming data access pattern and it runs on commodity hardware. Let’s
elaborate the terms:
• Extremely large files: Here we are talking about the data in range of petabytes(1000
TB).
• Streaming Data Access Pattern: HDFS is designed on principle of write-once and
read-many-times. Once data is written large portions of dataset can be processed any
number times.
• Commodity hardware: Hardware that is inexpensive and easily available in the
market. This is one of feature which specially distinguishes HDFS from other file
system.
Nodes: Master-slave nodes typically forms the HDFS cluster.
1. NameNode(MasterNode):
• Manages all the slave nodes and assign work to them.
• It executes filesystem namespace operations like opening, closing,
renaming
• files and directories.
• It should be deployed on reliable hardware which has the high config. not
on commodity hardware.
2. DataNode(SlaveNode):
• Actual worker nodes, who do the actual work like reading, writing,
processing etc.
• They also perform creation, deletion, and replication upon instruction from
the master.
• They can be deployed on commodity hardware.
HDFS daemons: Daemons are the processes running in background.
• Namenodes:
• Run on the master node.
• Store metadata (data about data) like file path, the number of blocks, block
Ids. etc.
• Require high amount of RAM.
• Store meta-data in RAM for fast retrieval i.e to reduce seek time. Though a
persistent copy of it is kept on disk.
• DataNodes:
• Run on slave nodes.
18 BIG DATA ANALYTICS LABORATORY (KDS651)
19

• Require high memory as data is actually stored here.


Data storage in HDFS: Now let’s see how the data is stored in a distributed manner.
Lets assume that 100TB file is inserted, then masternode(namenode) will first divide the
file into blocks of 10TB (default size is 128 MB in Hadoop 2.x and above). Then these
blocks are stored across different datanodes(slavenode).
Datanodes(slavenode)replicate the blocks among themselves and the information of
what blocks they contain is sent to the master. Default replication factor is 3 means for
each block 3 replicas are created (including itself). In hdfs.site.xml we can increase or
decrease the replication factor i.e we can edit its configuration here.
Note: MasterNode has the record of everything, it knows the location and info of each
and every single data nodes and the blocks they contain, i.e. nothing is done without the
permission of masternode.
Why divide the file into blocks?
Answer: Let’s assume that we don’t divide, now it’s very difficult to store a 100 TB file
on a single machine. Even if we store, then each read and write operation on that whole file
is going to take very high seek time. But if we have multiple blocks of size 128MB then its
become easy to perform various read and write operations on it compared todoing it
on a whole file at once. So we divide the file to have faster data access i.e. reduce seek
time.
Why replicate the blocks in data nodes while storing?
Answer: Let’s assume we don’t replicate and only one yellow block is present on datanode
D1. Now if the data node D1 crashes we will lose the block and which willmake the
overall data inconsistent and faulty. So we replicate the blocks to achieve fault-
tolerance.
Terms related to HDFS:
• HeartBeat : It is the signal that datanode continuously sends to namenode. If
namenode doesn’t receive heartbeat from a datanode then it will consider it dead.
• Balancing : If a datanode is crashed the blocks present on it will be gone too and the
blocks will be under-replicated compared to the remaining blocks. Here master
node(namenode) will give a signal to datanodes containing replicas of those lostblocks
to replicate so that overall distribution of blocks is balanced.
• Replication:: It is done by datanode.
Note: No two replicas of the same block are present on the same datanode.
Features:
• Distributed data storage.
• Blocks reduce seek time.
• The data is highly available as the same block is present at multiple datanodes.
• Even if multiple datanodes are down we can still do our work, thus making it
highly reliable.
• High fault tolerance.
Limitations: Though HDFS provide many features there are some areas where it
doesn’t work well.
• Low latency data access: Applications that require low-latency access to data i.e in
the range of milliseconds will not work well with HDFS, because HDFS is designed
keeping in mind that we need high-throughput of data even at the cost of latency.
• Small file problem: Having lots of small files will result in lots of seeks and lots
of movement from one datanode to another datanode to retrieve each small file, this
whole process is a very inefficient data access pattern.

19 BIG DATA ANALYTICS LABORATORY (KDS651)


20

File Management tasks in Hadoop


1. Create a directory in HDFS at given path(s).
Usage:
hadoop fs -mkdir <paths>
Example:
hadoop fs -mkdir /user/saurzcode/dir1 /user/saurzcode/dir2

2. List the contents of a directory.


Usage :
hadoop fs -ls <args>
Example:
hadoop fs -ls /user/saurzcode

3. Upload and download a file in HDFS.


Upload: hadoop fs -put:
Copy single src file, or multiple src files from local file system to the Hadoop data file system
Usage:
hadoop fs -put <localsrc> ... <HDFS_dest_Path>
Example:
hadoop fs -put /home/saurzcode/Samplefile.txt /user/ saurzcode/dir3/
Download: hadoop fs -get:
Copies/Downloads files to the local file system
Usage:
hadoop fs -get <hdfs_src> <localdst>
Example:
hadoop fs -get /user/saurzcode/dir3/Samplefile.txt /home/

4. See contents of a file


Same as unix cat command:
Usage:
hadoop fs -cat <path[filename]>
Example:
hadoop fs -cat /user/saurzcode/dir1/abc.txt

20 BIG DATA ANALYTICS LABORATORY (KDS651)


21

5. Copy a file from source to destination


This command allows multiple sources as well in which case the destination must be a directory.
Usage:
hadoop fs -cp <source> <dest>
Example:
hadoop fs -cp /user/saurzcode/dir1/abc.txt /user/saurzcode/ dir2

6. Copy a file from/To Local file system to HDFS


copyFromLocal
Usage:
hadoop fs -copyFromLocal <localsrc> URI
Example:
hadoop fs -copyFromLocal /home/saurzcode/abc.txt /user/ saurzcode/abc.txt

Similar to put command, except that the source is restricted to a local file reference.
copyToLocal
Usage:
hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>
Similar to get command, except that the destination is restricted to a local file reference.

7. Move file from source to destination.


Note:- Moving files across filesystem is not permitted.
Usage :
hadoop fs -mv <src> <dest>
Example:
hadoop fs -mv /user/saurzcode/dir1/abc.txt /user/saurzcode/ dir2

8. Remove a file or directory in HDFS.


Remove files specified as argument. Deletes directory only when it is empty

Usage :

hadoop fs -rm <arg> Example:


hadoop fs -rm /user/saurzcode/dir1/abc.txt

9. Recursive version of delete.


Usage :
hadoop fs -rmr <arg>
Example:
hadoop fs -rmr /user/saurzcode/

10. Display last few lines of a file.


Similar to tail command in Unix.
Usage :
hadoop fs -tail <path[filename]>
Example:
hadoop fs -tail /user/saurzcode/dir1/abc.txt

21 BIG DATA ANALYTICS LABORATORY (KDS651)


22

11. Display the aggregate length of a file.


Usage :
hadoop fs -du <path>
Example:
hadoop fs -du /user/saurzcode/dir1/abc.txt

22 BIG DATA ANALYTICS LABORATORY (KDS651)


23

Lab NO: 3
Matrix Multiplication with Hadoop MapReduce

AIM: To understand matrix multiplication in Hadoop MapReduce


MapReduce is a technique in which a huge program is subdivided into small tasks and run parallelly
to make computation faster, save time, and mostly used in distributed systems. It has 2 important
parts:
Mapper: It takes raw data input and organizes into key, value pairs. For example, In a dictionary,
you search for the word “Data” and its associated meaning is “facts and statistics collected together
for reference or analysis”. Here the Key is Data and the Value associated with is facts and
statistics collected together for reference or analysis.

Reducer: It is responsible for processing data in parallel and produce final output.
Let us consider the matrix multiplication example to visualize MapReduce. Consider the following
matrix:

2×2 matrices A and B

Here matrix A is a 2×2 matrix which means the number of rows(i)=2 and the number of
columns(j)=2. Matrix B is also a 2×2 matrix where number of rows(j)=2 and number of
columns(k)=2. Each cell of the matrix is labelled as Aij and Bij. Ex. element 3 in matrix A is called
A21 i.e. 2nd-row 1st column. Now One step matrix multiplication has 1 mapper and 1 reducer. The
Formula is:
Mapper for Matrix A (k, v)=((i, k), (A, j, Aij)) for all k
Mapper for Matrix B (k, v)=((i, k), (B, j, Bjk)) for all i
Therefore computing the mapper for Matrix A:
# k, i, j computes the number of times it occurs.
# Here all are 2, therefore when k=1, i can have
# 2 values 1 & 2, each case can have 2 further
# values of j=1 and j=2. Substituting all values
# in formula

k=1 i=1 j=1 ((1, 1), (A, 1, 1))


j=2 ((1, 1), (A, 2, 2))
i=2 j=1 ((2, 1), (A, 1, 3))
j=2 ((2, 1), (A, 2, 4))

23 BIG DATA ANALYTICS LABORATORY (KDS651)


24

k=2 i=1 j=1 ((1, 2), (A, 1, 1))


j=2 ((1, 2), (A, 2, 2))
i=2 j=1 ((2, 2), (A, 1, 3))
j=2 ((2, 2), (A, 2, 4))
Computing the mapper for Matrix B
i=1 j=1 k=1 ((1, 1), (B, 1, 5))
k=2 ((1, 2), (B, 1, 6))
j=2 k=1 ((1, 1), (B, 2, 7))
k=2 ((1, 2), (B, 2, 8))

i=2 j=1 k=1 ((2, 1), (B, 1, 5))


k=2 ((2, 2), (B, 1, 6))
j=2 k=1 ((2, 1), (B, 2, 7))
k=2 ((2, 2), (B, 2, 8))
The formula for Reducer is:
Reducer(k, v)=(i, k)=>Make sorted Alist and Blist
(i, k) => Summation (Aij * Bjk)) for j
Output =>((i, k), sum)
Therefore computing the reducer:
# We can observe from Mapper computation
# that 4 pairs are common (1, 1), (1, 2),
# (2, 1) and (2, 2)
# Make a list separate for Matrix A &
# B with adjoining values taken from
# Mapper step above:

(1, 1) =>Alist ={(A, 1, 1), (A, 2, 2)}


Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(1*5) + (2*7)] =19 ------- (i)

(1, 2) =>Alist ={(A, 1, 1), (A, 2, 2)}


Blist ={(B, 1, 6), (B, 2, 8)}
Now Aij x Bjk: [(1*6) + (2*8)] =22 ------- (ii)

(2, 1) =>Alist ={(A, 1, 3), (A, 2, 4)}


Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(3*5) + (4*7)] =43 ------- (iii)
24 BIG DATA ANALYTICS LABORATORY (KDS651)
25

(2, 2) =>Alist ={(A, 1, 3), (A, 2, 4)}


Blist ={(B, 1, 6), (B, 2, 8)}
Now Aij x Bjk: [(3*6) + (4*8)] =50 ------- (iv)

From (i), (ii), (iii) and (iv) we conclude that


((1, 1), 19)
((1, 2), 22)
((2, 1), 43)
((2, 2), 50)
Therefore the Final Matrix is:

Final output of Matrix multiplication.

Code for mapper and Reducer


import java.io.IOException; import java.util.*;
import java.util.AbstractMap.SimpleEntry; import
java.util.Map.Entry;

import org.apache.hadoop.fs.Path; import


org.apache.hadoop.conf.*; import
org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import
org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class TwoStepMatrixMultiplication {

public static class Map extends Mapper<LongWritable, Text, Text, Text> {


public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
String line = value.toString();
String[] indicesAndValue = line.split(","); Text outputKey = new
Text();
Text outputValue = new Text();
if (indicesAndValue[0].equals("A")) { outputKey.set(indicesAndValue[2]);
outputValue.set("A," + indicesAndValue[1] + "," +
indicesAndValue[3]);
context.write(outputKey, outputValue);
25 BIG DATA ANALYTICS LABORATORY (KDS651)
26

} else {
outputKey.set(indicesAndValue[1]); outputValue.set("B," +
indicesAndValue[2] + "," +
indicesAndValue[3]);
context.write(outputKey, outputValue);
}
}
}

public static class Reduce extends Reducer<Text, Text, Text, Text> {


public void reduce(Text key, Iterable<Text> values, Context context) throws
IOException, InterruptedException {
String[] value;
ArrayList<Entry<Integer, Float>> listA = new ArrayList<Entry<Integer, Float>>();
ArrayList<Entry<Integer, Float>> listB = new ArrayList<Entry<Integer, Float>>();
for (Text val : values) {
value = val.toString().split(",");
if (value[0].equals("A")) { listA.add(new SimpleEntry<Integer,
Float>(Integer.parseInt(value[1]), Float.parseFloat(value[2])));
} else {
listB.add(new SimpleEntry<Integer, Float>(Integer.parseInt(value[1]),
Float.parseFloat(value[2])));
}
}
String i; float a_ij; String k; float b_jk;
Text outputValue = new Text();
for (Entry<Integer, Float> a : listA) { i = Integer.toString(a.getKey()); a_ij =
a.getValue();
for (Entry<Integer, Float> b : listB) { k = Integer.toString(b.getKey()); b_jk =
b.getValue(); outputValue.set(i + "," + k + "," +
Float.toString(a_ij*b_jk));
context.write(null, outputValue);
}
}
}
}

public static void main(String[] args) throws Exception { Configuration conf = new
Configuration();

Job job = new Job(conf, "MatrixMatrixMultiplicationTwoSteps");


job.setJarByClass(TwoStepMatrixMultiplication.class);
job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class);
job.setMapperClass(Map.class); job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path("hdfs:// 127.0.0.1:9000/matrixin"));


FileOutputFormat.setOutputPath(job, new Path("hdfs:// 127.0.0.1:9000/matrixout"));

job.waitForCompletion(true);
}

26 BIG DATA ANALYTICS LABORATORY (KDS651)


27

Lab NO: 4 To run a Word Count MapReduce program to understand


the MapReduce paradigm

AIM: To Develop a MapReduce program by using Word count


FOR THIS WE NEED 3 SOFTWARE
1-HADOOP INSTALLATION
2- PUTTY CONNECT WITH HADOOP
3- ECLIPSE FOR DOWNLOAD JAR FILE (JAVA)

Step1: create text file on which map reduce woks


Ex: hi Tarun Hi
Tarun ram ram
Hi Tarun ram
Hi Hi Hi
Create directory (In case of Hadoop, create in user/maria_dev/folder name(dataflow))

Ex: Hadoop fs –ls /user/maria_dev/dataflow

Step2: upload file (wordcount) in this folder by direct in Hadoop or upload file by using command
put.

Step3: run command Hadoop fs-ls


It will show file

Step4: To show contain of file by using command cat


Ex: Hadoop dfs –cat/user/maria_dev/word_count

27 BIG DATA ANALYTICS LABORATORY (KDS651)


28

Part 2: create jar file of program of mapreduce wordcount by using ecliips software
Step1: new ->java project->next ->fnish (mapreduce)
Step2: in java project import all packages of yarn , HDFS etc
Step3: write program of word count mapreduce

28 BIG DATA ANALYTICS LABORATORY (KDS651)


29

Map Function – It takes a set of data and converts it into another set of data, where individual
elements are broken down into tuples (Key-Value pair).
Example – (Map function in Word Count)

Input
Set of data
Bus, Car, bus, car, train, car, bus, car, train, bus, TRAIN,BUS, buS, caR, CAR, car, BUS, TRAIN

Output
Convert into another set of data
(Key,Value)

(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1),
(TRAIN,1),(BUS,1), (buS,1), (caR,1), (CAR,1), (car,1), (BUS,1), (TRAIN,1)
Reduce Function – Takes the output from Map as an input and combines those data tuples
into a smaller set of tuples.

Example – (Reduce function in Word Count)


Input Set of Tuples
(output of Map function)

(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1),
(TRAIN,1),(BUS,1),

(buS,1),(caR,1),(CAR,1), (car,1), (BUS,1), (TRAIN,1)

Output Converts into smaller set of tuples


(BUS,7), (CAR,7), (TRAIN,4)

Work Flow of Program

29 BIG DATA ANALYTICS LABORATORY (KDS651)


30

Workflow of MapReduce consists of 5 steps


1. Splitting – The splitting parameter can be anything, e.g. splitting by space,
comma, semicolon, or even by a new line (‘\n’).
2. Mapping – as explained above
3. Intermediate splitting – the entire process in parallel on different clusters. In order to
group them in “Reduce Phase” the similar KEY data should be on same cluster.
4. Reduce – it is nothing but mostly group by phase
5. Combining – The last phase where all the data (individual result set from each
cluster) is combine together to form a Result

Now Let’s See the Word Count Program in Java

Make sure that Hadoop is installed on your system with java idk Steps to

follow

Step 1. Open Eclipse> File > New > Java Project > (Name it – MRProgramsDemo) > Finish
Step 2. Right Click > New > Package ( Name it - PackageDemo) > Finish Step 3.
Right Click on Package > New > Class (Name it - WordCount) Step 4. Add
Following Reference Libraries –
9

30 BIG DATA ANALYTICS LABORATORY (KDS651)


31

Right Click on Project > Build Path> Add External Archivals


• /usr/lib/hadoop-0.20/hadoop-core.jar
• Usr/lib/hadoop-0.20/lib/Commons-cli-1.2.jar

Program: Step 5. Type following Program :

package PackageDemo;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
importorg.apache.hadoop.io.IntWritable;
importorg.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
importorg.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public static void main(String [] args) throws Exception
{
Configuration c=new Configuration();
String[] files=new GenericOptionsParser(c,args).getRemainingArgs();
Path input=new Path(files[0]);
Path output=new Path(files[1]);
Job j=new Job(c,"wordcount");
j.setJarByClass(WordCount.class);
j.setMapperClass(MapForWordCount.class);
j.setReducerClass(ReduceForWordCount.class);
j.setOutputKeyClass(Text.class);
j.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(j, input);
FileOutputFormat.setOutputPath(j, output);
System.exit(j.waitForCompletion(true)?0:1);
}
public static class MapForWordCount extends Mapper<LongWritable, Text, Text,
IntWritable>{
public void map(LongWritable key, Text value, Context con) throws IOException,
InterruptedException
{
String line = value.toString();
10

31 BIG DATA ANALYTICS LABORATORY (KDS651)


32

String[] words=line.split(",");
for(String word: words )
{
Text outputKey = new Text(word.toUpperCase().trim());
IntWritable outputValue = new IntWritable(1);
con.write(outputKey, outputValue);
}
}
}
public static class ReduceForWordCount extends Reducer<Text, IntWritable, Text,
IntWritable>
{
public void reduce(Text word, Iterable<IntWritable> values, Context con) throws
IOException,
InterruptedException
{
int sum = 0;
for(IntWritable value : values)
{
sum += value.get();
}
con.write(word, new IntWritable(sum));
}
}
}

Make Jar File


Right Click on Project> Export> Select export destination as Jar File > next> Finish

11

32 BIG DATA ANALYTICS LABORATORY (KDS651)


33

12

33 BIG DATA ANALYTICS LABORATORY (KDS651)


34

To Move this into Hadoop directly, open the terminal and enter the following
commands:
[training@localhost ~]$ hadoop fs -put wordcountFile wordCountFile

Run Jar file


(Hadoop jar jarfilename.jar packageName.ClassName PathToInputTextFile
PathToOutputDirectry)

[training@localhost ~]$ Hadoop jar MRProgramsDemo.jar


PackageDemo.WordCount wordCountFile MRDir1

Result: Open Result

[training@localhost ~]$ hadoop fs -ls MRDir1


Found 3 items
-rw-r--r-- 1 training supergroup
0 2016-02-23 03:36 /user/training/MRDir1/_SUCCESS
drwxr-xr-x - training supergroup
0 2016-02-23 03:36 /user/training/MRDir1/_logs
-rw-r--r-- 1 training supergroup
20 2016-02-23 03:36 /user/training/MRDir1/part-r-00000
[training@localhost ~]$ hadoop fs -cat MRDir1/part-r-00000
BUS 7
CAR 4
TRAIN 6

13

34 BIG DATA ANALYTICS LABORATORY (KDS651)


35

Lab NO: 5
Implementation of K-mean clustering using Map reduce

AIM: Implementation of K-mean clustering using Map reduce.

K-Means is a clustering algorithm that partition a set of data point into k clusters. The k-means
clustering algorithm is commonly used on large data sets, and because of the characteristics of the
algorithm is a good candidate for parallelization. The aim of this project is to implement a
framework in java for performing k-means clustering using Hadoop MapReduce.

Pseudocode
The classical k-means algorithm works as an iterative process in which at each iteration itcomputes
the distance between the data points and the centroids, that are randomly initialized at the beginning
of the algorithm.
We decided to design such algorithm as a MapReduce workflow. A single stage of MapReduce
roughly corresponds to a single iteration of the classical algorithm. As in the classical algorithm at
the first stage the centroids are randomly sampled from the set of data points. The map function
takes as input a data point and the list of centroids, computes the distance between the point and
each centroid and emits the point and the closest centroid. The reduce function collects all the points
beloging to a cluster and computes the new centroid and emits it. At the end of each stage,it finds
a new approximation of the centroids, that are used for the next iteration. The workflow continues
until the distance from each centroid of a previuos stage and the corresponding centroids of the
current stage drops below a given threshold.
centroids = k random sampled points from the dataset.

do:

Map:
- Given a point and the set of centroids.
- Calculate the distance between the point and each centroid.
- Emit the point and the closest centroid.

Reduce:
- Given the centroid and the points belonging to its cluster.
- Calculate the new centroid as the aritmetic mean position of the points.
- Emit the new centroid.

prev_centroids = centroids.
centroids = new_centroids.

while prev_centroids - centroids > threshold.


Mapper
The mapper calculates the distance between the data point and each centroid. Then emits the index
of the closest centroid and the data point.
class MAPPER
method MAP(file_offset, point)
min_distance = POSITIVE_INFINITY
35 BIG DATA ANALYTICS LABORATORY (KDS651)
36

closest_centroid = -1
for all centroid in list_of_centroids
distance = distance(centroid, point)
if (distance < min_distance)
closest_centroid = index_of(centroid)
min_distance = distance
EMIT(closest_centroid, point)

Combiner
At each stage we need to sum the data points belonging to a cluster to calculate the centroid
(arithmetic mean of points). Since the sum is an associative and commutative function, our
algorithm can benefit from the use of a combiner to reduce the amount of data to be transmitted to
the reducers.
class COMBINER
method COMBINER(centroid_index, list_of_points)
point_sum.number_of_points = 0
point_sum = 0
for all point in list_of_points:
point_sum += point
point_sum.number_of_points += 1
EMIT(centroid_index, point_sum)
We implemented the combiner only in the Hadoop algorithm.

Reducer
The reducer calculates the new approximation of the centroid and emits it. The result of the
MapReduce stage will be the same even if the combiner is not called by the Hadoop framework.
class REDUCER
method REDUCER(centroid_index, list_of_point_sums)
number_of_points = partial_sum.number_of_points
point_sum = 0
for all partial_sum in list_of_partial_sums:
point_sum += partial_sum
point_sum.number_of_points += partial_sum.number_of_points
centroid_value = point_sum / point_sum.number_of_points
EMIT(centroid_index, centroid_value)

36 BIG DATA ANALYTICS LABORATORY (KDS651)


37

Lab NO: 6
Installation of HIVE alone practical Example

AIM: Use Hive query to find out movie rating from data set
1. Install Java: Apache Hive requires Java to run. Check that Java is installed on your
system by running the command "java -version" in the command prompt. If it's not installed,
download and install Java on your system.

2. Download Hive: Download the latest version of Apache Hive from the official website
(https://hive.apache.org/downloads.html).

3. Extract the downloaded file: Extract the downloaded file to a directory of your choice.

4. Configure Hive: Go to the conf directory in the Hive installation folder and edit the hive- site.xml
file. This file contains configuration settings for Hive. Configure the file according toyour needs,
such as setting the location of Hadoop and the metastore database.

5. Start Hadoop: Hive requires Hadoop to run. Start the Hadoop daemons by running the command
"start-all.sh" in the bin directory of your Hadoop installation.

6. Start Hive: To start Hive, run the command "hive" in the bin directory of your Hive installation.

7. Test Hive: Once you have started Hive, run some sample queries to test that everything is
working correctly.

Hive Commands:

Data Definition Language (DDL )


DDL statements are used to build and modify the tables and other objects in the database.
DDL
Function
Command
CREATE It is used to create a table or Database
SHOW It is used to show Database, Table, Properties, etc
ALTER It is used to make changes to the existing table
DESCRIBE It describes the table columns
TRUNCATE Used to permanently truncate and delete the rows of table
DELETE Deletes the table data, but, can be restored

37 BIG DATA ANALYTICS LABORATORY (KDS651)


Using Hive & Uploading Data
36
Click Here

38 BIG DATA ANALYTICS LABORATORY (KDS651)


39 BIG DATA ANALYTICS LABORATORY (KDS651)
40

1-Click: Hive View


2-Click: Upload Table

40 BIG DATA ANALYTICS LABORATORY (KDS651)


41

Click: Setting Button Ahead of File Type

38 BIG DATA ANALYTICS LABORATORY (KDS651)


1-Select Number 9: TAB(horizontal tab)
39

2-Then Close

39 BIG DATA ANALYTICS LABORATORY (KDS651)


Choose File: u.data
40

40 BIG DATA ANALYTICS LABORATORY (KDS651)


41

It will look like this:-

41 BIG DATA ANALYTICS LABORATORY (KDS651)


1- Change Table Name To – ratings
42

2- Change Column1 To – movie_id


3- Change Column2 To – ratingCount
4- Click Upload Table – After few minutes, you may need to refresh it.

42 BIG DATA ANALYTICS LABORATORY (KDS651)


1- Go to Hive
2-Paste43 this Query & Execute:-
SELECT movie_id, count(movie_id) as ratingCount
FROM ratings
GROUP BY movie_id
ORDER BY ratingCount
DESC;

43 BIG DATA ANALYTICS LABORATORY (KDS651)


44

Result will look like this:

44 BIG DATA ANALYTICS LABORATORY (KDS651)


45

Click on the graph button:

45 BIG DATA ANALYTICS LABORATORY (KDS651)


46

1- Drag movie_id to x
2- Drag ratingcount to y

46 BIG DATA ANALYTICS LABORATORY (KDS651)


47

Result:

47 BIG DATA ANALYTICS LABORATORY (KDS651)


48

Lab No: 7 Installing of HBase, Installation thrift alone with practice


example

AIM: Installing of HBase, Installation thrift alone with practice example


1. Download the HBase distribution package from the official Apache HBase website.
2. Extract the downloaded package to a directory of your choice.
3. Set the `HBASE_HOME` environment variable to the path where you extracted the package.
4. Update the `PATH` environment variable to include the `$HBASE_HOME/bin` directory.
5. Configure the `hbase-site.xml` file located in the `$HBASE_HOME/conf` directory to suit your
needs.
6. If you plan to use HBase with Hadoop, make sure that the Hadoop configuration files are also
in the `$HBASE_HOME/conf` directory.
7. Start the HBase server by running the `start-hbase.sh` script located in the
`$HBASE_HOME/bin` directory.
8. Verify that HBase is running by accessing the HBase shell using the `hbase shell` command.

Note that these are general steps and the specifics of installation may vary depending on your
operating system and the version of HBase you are installing. It is always a good idea to consult
the official documentation and any relevant installation guides specific to your setup.
7. Test Hive: Once you have started Hive, run some sample queries to test that everything is
working correctly.

General commands

In Hbase, general commands are categorized into following commands

• Status
• Version
• Table_help ( scan, drop, get, put, disable, etc.)
• Whoami

To get enter into HBase shell command, first of all, we have to execute the code as mentioned
below

48 BIG DATA ANALYTICS LABORATORY (KDS651)


49

hbase Shell

Once we get to enter into HBase shell, we can execute all shell commands mentioned below. With
the help of these commands, we can perform all type of table operations in the HBase shell mode.

Let us look into all of these commands and their usage one by one with an example.

Status
Syntax:status

This command will give details about the system status like a number of servers present in the
cluster, active server count, and average load value. You can also pass any particular parameters
depending on how detailed status you want to know about the system. The parameters can be
‘summary’, ‘simple’, or ‘detailed’, the default parameter provided is “summary”.

hbase(main):001:0>status
hbase(main):002:0>status 'simple'
hbase(main):003:0>status 'summary'
hbase(main):004:0> status 'detailed'

49 BIG DATA ANALYTICS LABORATORY (KDS651)


50

When we execute this command status, it will give information about number of server’s present,
dead servers and average load of server, here in screenshot it shows the information like- 1 live
server, 1 dead servers, and 7.0000 average load.

Version
Syntax: version

• This command will display the currently used HBase version in command mode
• If you run version command, it will give output as shown above

Table help
Syntax:table_help

This command guides

• What and how to use table-referenced commands


• It will provide different HBase shell command usages and its syntaxes
• Here in the screen shot above, its shows the syntax to “create” and “get_table” command
with its usage. We can manipulate the table via these commands once the table gets created
in HBase.
• It will give table manipulations commands like put, get and all other commands
information.

50 BIG DATA ANALYTICS LABORATORY (KDS651)


51

Syntax: Whoami

This command “whoami” is used to return the current HBase user information from the HBase
cluster.

It will provide information like

• Groups present in HBase


• The user information, for example in this case “hduser” represent the user name as shown
in screen shot

TTL(Time To Live) – Attribute

In HBase, Column families can be set to time values in seconds using TTL. HBase will
automatically delete rows once the expiration time is reached. This attribute applies to all versions
of a row – even the current version too.

The TTL time encoded in the HBase for the row is specified in UTC. This attribute used with table
management commands.

Important differences between TTL handling and Column family TTLs are below

• Cell TTLs are expressed in units of milliseconds instead of seconds.


• A cell TTLs cannot extend the effective lifetime of a cell beyond a Column Family level
TTL setting.

Tables Managements commands

These commands will allow programmers to create tables and table schemas with rows and column
families.

The following are Table Management commands

• Create
• List
• Describe
• Disable
• Disable_all
• Enable
• Enable_all
• Drop

51 BIG DATA ANALYTICS LABORATORY (KDS651)


52

• Drop_all
• Show_filters
• Alter
• Alter_status

52 BIG DATA ANALYTICS LABORATORY (KDS651)


53

Lab No: 8
Write PIG Commands. Write pig Latin script sort, group and
filter in your data

AIM: Write PIG Commands. Write pig Latin script sort, group and filter in
your data

Commonly used Pig Latin commands:

1. LOAD: Loads data from a specified data source into Pig. Example: input_data = LOAD
'data.txt' USING PigStorage(',') AS (id: int, name: chararray, age:
int);
2. FILTER: Filters the data based on a specified condition. Example: filtered_data =
FILTER input_data BY age > 25;
3. GROUP: Groups the data based on one or more fields. Example: grouped_data =
GROUP input_data BY name;
4. FOREACH: Applies transformations or computations on each record of the data.
Example: transformed_data = FOREACH input_data GENERATE id, name, age *
2 AS doubled_age;
5. ORDER: Sorts the data based on one or more fields. Example: sorted_data = ORDER
input_data BY age DESC;
6. DISTINCT: Removes duplicate records from the data. Example: distinct_data =
DISTINCT input_data;
7. JOIN: Performs a join operation between two or more relations based on a common field.
Example: joined_data = JOIN input_data BY id, other_data BY id;
8. STORE: Stores the data into a specified location or data source. Example: STORE
input_data INTO 'output.txt' USING PigStorage(',');
9. DUMP: Displays the data on the screen or console. Example: DUMP input_data;
10. DESCRIBE: Provides the schema information of a relation. Example: DESCRIBE
input_data;

These are just a few examples of Pig Latin commands. Pig provides a wide range of operators
and functions for data manipulation and analysis.

53 BIG DATA ANALYTICS LABORATORY (KDS651)


54

Write pig Latin script sort, group and filter in your data

In this script, we assume that the input data is in a CSV file called "input_data.csv" with
columns: id, name, age, and salary. The script loads the data, filters it to include only records
where age is greater than 25, groups the filtered data by name, calculates the average salary for
each group, sorts the groups based on the average salary in descending order, stores the sorted
data in a file called "output_data", and finally displays the sorted data using the DUMP
command.
-- Load the input data
input_data = LOAD 'input_data.csv' USING PigStorage(',') AS (id: int, name: chararray, age: int,
salary: float);

-- Filter the data to include only records where age is greater than 25
filtered_data = FILTER input_data BY age > 25;

-- Group the filtered data by name


grouped_data = GROUP filtered_data BY name;

-- Calculate the average salary for each group


avg_salary = FOREACH grouped_data GENERATE group AS name, AVG(filtered_data.salary)
AS average_salary;

-- Sort the grouped data by average salary in descending order


sorted_data = ORDER avg_salary BY average_salary DESC;

-- Store the sorted data


STORE sorted_data INTO 'output_data' USING PigStorage(',');

-- Display the sorted data


DUMP sorted_data;

54 BIG DATA ANALYTICS LABORATORY (KDS651)


55

Lab No: 9
Run the PIG Latin Script to find world count

AIM: Run the PIG Latin Script to find world count

In this script, we assume that the input data is in a text file called "input.txt". The script loads the
data, tokenizes it by splitting on whitespace, groups the tokenized data by word, counts the
occurrences of each word, stores the word count results in a file called "output.txt", and finally
displays the word count results using the DUMP command.

To run this script, you would typically save it in a file with a .pig extension (e.g., wordcount.pig),
and then execute it using the Pig Latin interpreter or tool provided by Apache Pig. The specific
steps to run the script may vary depending on your Pig installation and environment.

-- Load the input data


input_data = LOAD 'input.txt' USING TextLoader();

-- Tokenize the input data by splitting on whitespace


tokenized_data = FOREACH input_data GENERATE FLATTEN(TOKENIZE($0)) AS word;

-- Group the tokenized data by word


grouped_data = GROUP tokenized_data BY word;

-- Count the occurrences of each word


word_count = FOREACH grouped_data GENERATE group AS word, COUNT(tokenized_data)
AS count;

-- Store the word count results


STORE word_count INTO 'output.txt' USING PigStorage(',');

-- Display the word count results


DUMP word_count;

55 BIG DATA ANALYTICS LABORATORY (KDS651)


56

Lab No: 10
Run the PIG Latin Script to find a max temp for each and every
year.

AIM: Run the PIG Latin Script to find a max temp for each and every year.

To find the maximum temperature for each year using Pig Latin, you can follow this example
script assuming your input data is in the format (year: int, temperature: float):

In this script, we assume that the input data is in a CSV file called "input.txt" with columns: year
and temperature. The script loads the data, groups it by year, calculates the maximum temperature
for each year using the MAX function, stores the results in a file called "output.txt", and finally
displays the results using the DUMP command.

To run this script, save it in a file with a .pig extension (e.g., maxtemp.pig), and then execute it
using the Pig Latin interpreter or tool provided by Apache Pig. Make sure to replace 'input.txt' and
'output.txt' with the actual file paths in your system.

-- Load the input data


input_data = LOAD 'input.txt' USING PigStorage(',') AS (year: int, temperature:
float);

-- Group the data by year


grouped_data = GROUP input_data BY year;

-- Calculate the maximum temperature for each year


max_temp_by_year = FOREACH grouped_data GENERATE group AS year,
MAX(input_data.temperature) AS max_temperature;

-- Store the results


STORE max_temp_by_year INTO 'output.txt' USING PigStorage(',');

-- Display the results


DUMP max_temp_by_year;

56 BIG DATA ANALYTICS LABORATORY (KDS651)

You might also like