KEMBAR78
Hadoop Ecosystem & Node Communication | PDF | Apache Hadoop | Map Reduce
0% found this document useful (0 votes)
448 views18 pages

Hadoop Ecosystem & Node Communication

The document discusses how communication occurs between the namenode and datanodes in Hadoop. It explains that the datanode initiates all communication by sending heartbeats to the namenode. The heartbeats include statistics and block reports. The namenode responds with commands for the datanode like instructions for block replication. It also describes the read and write operations in HDFS and how data flows from clients to the datanode pipeline and is replicated across multiple datanodes.

Uploaded by

Mukul Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
448 views18 pages

Hadoop Ecosystem & Node Communication

The document discusses how communication occurs between the namenode and datanodes in Hadoop. It explains that the datanode initiates all communication by sending heartbeats to the namenode. The heartbeats include statistics and block reports. The namenode responds with commands for the datanode like instructions for block replication. It also describes the read and write operations in HDFS and how data flows from clients to the datanode pipeline and is replicated across multiple datanodes.

Uploaded by

Mukul Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Q1: Explain the Hadoop Ecosystem and discuss how communication is taking place between

namenode and datanodes.

What is Hadoop?

Hadoop is a framework that allows for the distributed processing of large datasets across clusters
of computers using simple programming models

 Economical: Its systems are highly economical as ordinary computers can be used for data
processing.
 Reliable: It is reliable as it stores copies of the data on different machines and is resistant
to hardware failure.
 Scalable: It is easily scalable both, horizontally and vertically. A few extra nodes help in
scaling up the framework.
 Flexible: It is flexible and you can store as much structured and unstructured data as you
need to and decide to use them later.

Hadoop Ecosystem

Hadoop Ecosystem Hadoop has an ecosystem that has evolved from its three core components processing,
resource management, and storage.

Hadoop ecosystem is continuously growing to meet the needs of Big Data. It comprises the following
twelve components:

 HDFS(Hadoop Distributed file system)- Hadoop Distributed File System


 HBase
 Sqoop
 Flume
 Spark
 Hadoop MapReduce
 Pig

1|Page
 Impala
 Hive
 Cloudera Search
 Oozie
 Hue.
 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing

HDFS:

 HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing
large data sets of structured or unstructured data across various nodes and thereby
maintaining the metadata in the form of log files.
 HDFS consists of two core components i.e.
1. Name node
2. Data Node
NameNode/Master Node

 Stores metadata for the files, like the directory structure of a typical FS.
 The server holding the NameNode instance is quite crucial, as there is only one.
 Transaction log for file deletes/adds,etc ,Does not use transaction for whole blocks or
file-streams, only metadata.
 Handles creation of more replica blocks when necessary after a DataNode failure.

DataNode/Slave Node

 Stores the actual data in HDFS


 Can run on any underlying filesystem (ext3/4, NTFS, etc)
 Notifies NameNode of what blocks it has
 NameNode replicates blocks 2x in local rack, 1x elsewhere

2|Page
YARN:

 Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to
manage the resources across the clusters. In short, it performs scheduling and resource
allocation for the Hadoop System.
 Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
 Resource manager has the privilege of allocating resources for the applications in a system
whereas Node managers work on the allocation of resources such as CPU, memory,
bandwidth per machine and later on acknowledges the resource manager. Application
manager works as an interface between the resource manager and node manager and
performs negotiations as per the requirement of the two.

MapReduce:

 By making the use of distributed and parallel algorithms, MapReduce makes it possible to
carry over the processing’s logic and helps to write applications which transform big data
sets into a manageable one.
 MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
1. Map() performs sorting and filtering of data and thereby organizing them in the form
of group. Map generates a key-value pair based result which is later on processed by
the Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating the mapped
data. In simple, Reduce() takes the output generated by Map() as input and combines
those tuples into smaller set of tuples.

3|Page
Communication takes place between NameNode and DataNode

How a NameNode and DataNode communicate with each other?

All communication between Namenode and Datanode is initiated by the Datanode, and
responded to by the Namenode. The Namenode never initiates communication to
the Datanode, although Namenode responses may include commands to the Datanode that
cause it to send further communications.
DataNode sends information to NameNode through four major interfaces defined in the

DataNodeProtocol These Four are:

1. DataNode Registration. The DataNode informs NameNode of its existence. NameNode


returns its registration id. This registration id is a parameter of other DataNode functions.
Registration is triggered when a new DataNode is initiated, an old one is re-initiated, or
when a new NameNode is initiated.

2. DataNode sends heartbeat. The DataNode sends a heartbeat message every few seconds.
This includes some information statistics about capacity and current activity. NameNode
returns a list of block oriented commands for DataNode to execute. These commands

4|Page
primarily consist of instructions to transfer blocks to other DataNodes for replication
purposes, or instructions to delete blocks. The NameNode can also command an
immediate Block Report from the DataNode, but this is only done to recover from severe
problems.
3. DataNode sends block report. DataNode periodically reports the blocks contained in its
storage. The period is typically configured to hourly.
4. DataNode notifies Block Received. DataNode reports that it has received a new block,
either from a Client (during file write) or from another DataNode (during replication). It
reports each block immediately upon receipt.

Read Operation In HDFS

1. A client initiates read request by calling 'open()' method of FileSystem object; it is an


object of type DistributedFileSystem.
2. This object connects to namenode using RPC and gets metadata information such as the
locations of the blocks of the file. Please note that these addresses are of first few blocks
of a file.
3. In response to this metadata request, addresses of the DataNodes having a copy of that
block is returned back.

5|Page
4. Once addresses of DataNodes are received, an object of type FSDataInputStream is
returned to the client. FSDataInputStream contains DFSInputStream which takes care
of interactions with DataNode and NameNode. In step 4 shown in the above diagram, a
client invokes 'read()' method which causes DFSInputStream to establish a connection
with the first DataNode with the first block of a file.
5. Data is read in the form of streams wherein client invokes 'read()' method repeatedly. This
process of read() operation continues till it reaches the end of block.
6. Once the end of a block is reached, DFSInputStream closes the connection and moves on
to locate the next DataNode for the next block
7. Once a client has done with the reading, it calls a close() method.

Write Operation In HDFS

1. A client initiates write operation by calling 'create()' method of DistributedFileSystem


object which creates a new file DistributedFileSystem object connects to the NameNode
using RPC call and initiates new file creation. However, this file creates operation does
not associate any blocks with the file. It is the responsibility of NameNode to verify that

6|Page
the file (which is being created) does not exist already and a client has correct
permissions to create a new file. If a file already exists or client does not have sufficient
permission to create a new file, then IOException is thrown to the client. Otherwise, the
operation succeeds and a new record for the file is created by the NameNode.
2. Once a new record in NameNode is created, an object of type FSDataOutputStream is
returned to the client. A client uses it to write data into the HDFS. Data write method is
invoked (step 3 in the diagram).
3. FSDataOutputStream contains DFSOutputStream object which looks after
communication with DataNodes and NameNode. While the client continues writing
data, DFSOutputStream continues creating packets with this data. These packets are
enqueued into a queue which is called as DataQueue.
4. There is one more component called DataStreamer which consumes this DataQueue.
DataStreamer also asks NameNode for allocation of new blocks thereby picking desirable
DataNodes to be used for replication.
5. Now, the process of replication starts by creating a pipeline using DataNodes. In our
case, we have chosen a replication level of 3 and hence there are 3 DataNodes in the
pipeline.
6. The DataStreamer pours packets into the first DataNode in the pipeline.
7. Every DataNode in a pipeline stores packet received by it and forwards the same to the
second DataNode in a pipeline.
8. Another queue, 'Ack Queue' is maintained by DFSOutputStream to store packets which
are waiting for acknowledgment from DataNodes.
9. Once acknowledgment for a packet in the queue is received from all DataNodes in the
pipeline, it is removed from the 'Ack Queue'. In the event of any DataNode failure,
packets from this queue are used to reinitiate the operation.
10. After a client is done with the writing data, it calls a close() method (Step 9 in the
diagram) Call to close(), results into flushing remaining data packets to the pipeline
followed by waiting for acknowledgment.
11. Once a final acknowledgment is received, NameNode is contacted to tell it that the file
write operation is complete.

7|Page
Q2: Explain the steps required for the configuration of eclipse with Apache Hadoop.
Support your answer with screenshots.

Eclipse is the most popular Integrated Development Environment (IDE) for developing Java
applications. It is robust, feature-rich, easy-to-use and powerful IDE which is the #1 choice of
almost Java programmers in the world. And it is totally FREE.

1. Download and Install Eclipse IDE

https://www.eclipse.org/downloads/download.php?file=/oomph/epp/2020

After downloading the file, we need to move this file to the home directory.

2. Extracting the file

8|Page
Two methods to extract the file either right click on the file and click on extract here, or

Unzip the file using following command:

$ tar -xzvf eclipse-java-inst-1-linux-gtk-x86_64.tar.gz

4. Choose a Workspace Directory:

Eclipse organizes projects by workspaces. A workspace is a group of related projects and it is


actually a directory on your computer. That’s why when you start Eclipse, it asks to choose a
workspace location like this:

9|Page
5. Create Java Project in Package Explorer:
By clicking in the left side of the interface

File > New > Java Project > (Name it – WordCountProject1) > Finish

The new Java Project has created

6. After that we need to create the Package, by clicking:

Right Click > New > Package (Name it – WordCountPackage1) > Finish

7. After creating the Package the third step is to create the class

Right Click on Package > New > Class (Name it – WordCount)

8. Download Hadoop Libraries

Add Following Reference Libraries –


 Hadoop-core-1.2.1.jar
 Commons-cli-1.2.jar

10 | P a g e
 Hadoop-core-1.2.1.jar

 Commons-cli-1.2.jar

11 | P a g e
9. To Add this Following libraries

Right Click on Project > Build Path> Add External Archivals

 Downloads/hadoop-core-1.2.1.jar
 Downloads/commons-cli-1.2.jar

12 | P a g e
10. The Hadoop Libraries has successfully added.

The next step is the mapper and reducer logic write in the console for the word count project.

Q3.Explain YARN model and give its comparison with that of Hadoop.

Hadoop YARN knits the storage unit of Hadoop i.e. HDFS (Hadoop Distributed File System) with
the various processing tools. YARN stands for “Yet Another Resource Negotiator”.

Why YARN?

In Hadoop version 1.0, MapReduce performed both processing and resource management
functions. It consisted of a Job Tracker which was the single master. The Job Tracker allocated the
resources, performed scheduling and monitored the processing jobs. It assigned map and reduce
tasks on a number of subordinate processes called the Task Trackers. The Task Trackers
periodically reported their progress to the Job Tracker.

This design resulted in scalability bottleneck due to a single Job Tracker. IBM mentioned in its
article that according to Yahoo!, the practical limits of such a design are reached with a cluster of
5000 nodes and 40,000 tasks running concurrently. Apart from this limitation, the utilization of
computational resources is inefficient in MRV1. Also, the Hadoop framework became limited
only to MapReduce processing paradigm.

To overcome all these issues, YARN was introduced in Hadoop version 2.0 in the year 2012 by
Yahoo and Hortonworks. The basic idea behind YARN is to relieve MapReduce by taking over

13 | P a g e
the responsibility of Resource Management and Job Scheduling. YARN started to give Hadoop
the ability to run non-MapReduce jobs within the Hadoop framework.

Introduction to Hadoop YARN

In version 2.0, YARN. YARN allows different data processing methods like graph processing,
interactive processing, stream processing as well as batch processing to run and process data stored
in HDFS. Therefore YARN opens up Hadoop to other types of distributed applications beyond
MapReduce.

YARN enabled the users to perform operations as per requirement by using a variety of tools
like Spark for real-time processing, Hive for SQL, HBase for NoSQL and others.

Apart from Resource Management, YARN also performs Job Scheduling. YARN performs all
your processing activities by allocating resources and scheduling tasks. Apache Hadoop YARN
Architecture consists of the following main components:

1. Resource Manager: Runs on a master daemon and manages the resource allocation in the
cluster.
2. Node Manager: They run on the slave daemons and are responsible for the execution of a
task on every single Data Node.

14 | P a g e
3. Application Master: Manages the user job lifecycle and resource needs of individual
applications. It works along with the Node Manager and monitors the execution of tasks.
4. Container: Package of resources including RAM, CPU, Network, HDD etc on a single
node.

Components of YARN

The first component of YARN Architecture is,

Resource Manager

 It is the ultimate authority in resource allocation.


 On receiving the processing requests, it passes parts of requests to corresponding node
managers accordingly, where the actual processing takes place.
 It is the arbitrator of the cluster resources and decides the allocation of the available
resources for competing applications.
 Optimizes the cluster utilization like keeping all resources in use all the time against
various constraints such as capacity guarantees, fairness, and SLAs.
 It has two major components: a) Scheduler b) Application Manager

15 | P a g e
a) Scheduler

 The scheduler is responsible for allocating resources to the various running applications
subject to constraints of capacities, queues etc.
 It is called a pure scheduler in ResourceManager, which means that it does not perform
any monitoring or tracking of status for the applications.
 If there is an application failure or hardware failure, the Scheduler does not guarantee to
restart the failed tasks.
 Performs scheduling based on the resource requirements of the applications.
 It has a pluggable policy plug-in, which is responsible for partitioning the cluster
resources among the various applications. There are two such plug-ins: Capacity
Scheduler and Fair Scheduler, which are currently used as Schedulers in
ResourceManager.

b) Application Manager

 It is responsible for accepting job submissions.


 Negotiates the first container from the Resource Manager for executing the application
specific Application Master.
 Manages running the Application Masters in a cluster and provides service for restarting
the Application Master container on failure.

Coming to the second component which is:

Node Manager

 It takes care of individual nodes in a Hadoop cluster and manages user jobs and workflow
on the given node.
 It registers with the Resource Manager and sends heartbeats with the health status of the
node.
 Its primary goal is to manage application containers assigned to it by the resource
manager.
 It keeps up-to-date with the Resource Manager.

16 | P a g e
 Application Master requests the assigned container from the Node Manager by sending it
a Container Launch Context(CLC) which includes everything the application needs in
order to run. The Node Manager creates the requested container process and starts it.
 Monitors resource usage (memory, CPU) of individual containers.
 Performs Log management.
 It also kills the container as directed by the Resource Manager.

The third component of Apache Hadoop YARN is,

Application Master

 An application is a single job submitted to the framework. Each such application has a
unique Application Master associated with it which is a framework specific entity.
 It is the process that coordinates an application’s execution in the cluster and also
manages faults.
 Its task is to negotiate resources from the Resource Manager and work with the Node
Manager to execute and monitor the component tasks.
 It is responsible for negotiating appropriate resource containers from the
ResourceManager, tracking their status and monitoring progress.
 Once started, it periodically sends heartbeats to the Resource Manager to affirm its health
and to update the record of its resource demands.

The fourth component is:

Container

 It is a collection of physical resources such as RAM, CPU cores, and disks on a single
node.
 YARN containers are managed by a container launch context which is container life-
cycle(CLC). This record contains a map of environment variables, dependencies stored in
a remotely accessible storage, security tokens, payload for Node Manager services and
the command necessary to create the process.

17 | P a g e
 It grants rights to an application to use a specific amount of resources (memory, CPU
etc.) on a specific host.

18 | P a g e

You might also like