KEMBAR78
Hadoop Configuration Guide | PDF
0% found this document useful (0 votes)
56 views22 pages

Hadoop Configuration Guide

The document discusses how HDFS stores and manages large datasets in a distributed file system. It describes how files are split into blocks and replicated across nodes for reliability. The roles of the NameNode and DataNodes are explained in managing and storing file metadata and data blocks.

Uploaded by

Nisha Pundir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views22 pages

Hadoop Configuration Guide

The document discusses how HDFS stores and manages large datasets in a distributed file system. It describes how files are split into blocks and replicated across nodes for reliability. The roles of the NameNode and DataNodes are explained in managing and storing file metadata and data blocks.

Uploaded by

Nisha Pundir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Unit 2

Configure Hadoop

Prerequisites

• Install Java.
• Download a stable version of Hadoop from Apache mirrors.

Configure

• Site-specific configuration - etc/hadoop/core-site.xml, etc/hadoop/hdfs-site.xml, etc/hadoop/yarn-


site.xml and etc/hadoop/mapred-site.xml.

To configure the Hadoop cluster you will need to configure the environment in which the Hadoop daemons
execute as well as the configuration parameters for the Hadoop daemons.

HDFS daemons are NameNode, SecondaryNameNode, and DataNode. YARN daemons are ResourceManager,
NodeManager, and WebAppProxy. If MapReduce is to be used, then the MapReduce Job History Server will
also be running. For large installations, these are generally running on separate hosts.

Storing data with HDFS

HDFS (Hadoop Distributed File System) is a vital component of the Apache Hadoop project. Hadoop is an
ecosystem of software that work together to help you manage big data. The two main elements of Hadoop are:

• MapReduce – responsible for executing tasks


• HDFS – responsible for maintaining data

Hadoop Distributed File System is a fault-tolerant data storage file system that runs on commodity hardware. It
was designed to overcome challenges traditional databases couldn’t. Therefore, its full potential is only utilized
when handling big data.

What are the Benefits of HDFS?

The benefits of HDFS are, in fact, solutions that the file system provides for the previously mentioned
challenges:

1. It is fast. It can deliver more than 2 GB of data per second thanks to its cluster architecture.
2. It is free. HDFS is an open-source software that comes with no licensing or support cost.
3. It is reliable. The file system stores multiple copies of data in separate systems to ensure it is always
accessible.

These advantages are especially significant when dealing with big data and were made possible with the
particular way HDFS handles data.

How Does HDFS Store Data?

HDFS divides files into blocks and stores each block on a DataNode. Multiple DataNodes are linked to the
master node in the cluster, the NameNode. The master node distributes replicas of these data blocks across the
cluster. It also instructs the user where to locate wanted information.

However, before the NameNode can help you store and manage the data, it first needs to partition the file into
smaller, manageable data blocks. This process is called data block splitting.
Data Block Splitting

By default, a block can be no more than 128 MB in size. The number of blocks depends on the initial size of the
file. All but the last block are the same size (128 MB), while the last one is what remains of the file.

For example, an 800 MB file is broken up into seven data blocks. Six of the seven blocks are 128 MB, while the
seventh data block is the remaining 32 MB.

Then, each block is replicated into several copies.

Data Replication

Based on the cluster’s configuration, the NameNode creates a number of copies of each data block using
the replication method.

It is recommended to have at least three replicas, which is also the default setting. The master node stores them
onto separate DataNodes of the cluster. The state of nodes is closely monitored to ensure the data is always
available.

To ensure high accessibility, reliability, and fault-tolerance, developers advise setting up the three replicas using
the following topology:

• Store the first replica on the node where the client is located.
• Then, store the second replica on a different rack.
• Finally, store the third replica on the same rack as the second replica, but on a different node.
The active NameNode keeps track of the metadata of each data block and its replicas. This includes the file
name, permission, ID, location, and number of replicas. It keeps all the information in an fsimage, a namespace
image stored on the local memory of the file system. Additionally, it maintains transaction logs called EditLogs,
which record all changes made on the system.

The main purpose of the Stanby NameNode is to solve the problem of the single point of failure. It reads any
changes made to the EditLogs and applies it to its NameSpace (the files and the directories in the data). If the
master node fails, the Zookeeper service carries out the failover allowing the standby to maintain an active
session.

DataNodes

DataNodes are slave daemons that store data blocks assigned by the NameNode. As mentioned above, the
default settings ensure each data block has three replicas. You can change the number of replicas, however, it is
not advisable to go under three.

The replicas should be distributed in accordance with Hadoop’s Rack Awareness policy which notes that:

• The number of replicas has to be larger than the number of racks.


• One DataNode can store only one replica of a data block.
• One rack cannot store more than two replicas of a data block.

By following these guidelines, you can:

• Maximize network bandwidth.


• Protect against data loss.
• Improve performance and reliability.

Key Features of HDFS

These are the main characteristics of the Hadoop Distributed File System:

1. Manages big data. HDFS is excellent in handling large datasets and provides a solution that traditional file
systems could not. It does this by segregating the data into manageable blocks which allow fast processing
times.
2. Rack-aware. It follows the guidelines of rack awareness which ensures a system is highly available and
efficient.

3. Fault tolerant. As data is stored across multiple racks and nodes, it is replicated. This means that if any of the
machines within a cluster fails, a replica of that data will be available from a different node.

4. Scalable. You can scale resources according to the size of your file system. HDFS includes vertical and
horizontal scalability mechanisms.

Let's say we need to move a 1 Gig text file to HDFS.

1. HDFS will split the file into 64 MB blocks.

a. The size of the blocks can be configured.


b. An entire block of data will be used in the computation.
c. Think of it as a sector on a hard disk.

2. Each block will be sent to 3 machines (data nodes) for storage.

a. This provides reliability and efficient data processing.


b. Replication factor of 3 is configurable.
c. RAID configuration to store the data is not required.
d. Since data is replicated 3 times the overall storage space is reduced a third.

3. The accounting of each block is stored in a central server, called a Name Node.

a. A Name Node is a master node that keeps track of each file and its corresponding
blocks and the data node locations.
b. Map Reduce will talk with the Name Node and send the computation to the
corresponding data nodes.

c. The Name Node is the key to all the data and hence the Secondary Name node is
used to improve the reliability of the cluster.
HDFS Commands

Let us now start with the HDFS commands.

1. version

Hadoop HDFS version Command Usage:


version
Hadoop HDFS version Command Example:
Before working with HDFS you need to Deploy Hadoop, follow this guide to Install and configure Hadoop 3.
hadoop version

Hadoop HDFS version Command Description:


The Hadoop fs shell command version prints the Hadoop version.
2. mkdir

Hadoop HDFS mkdir Command Usage:


hadoop fs –mkdir /path/directory_name
Hadoop HDFS mkdir Command Example 1:
In this example, we are trying to create a newDataFlair named directory in HDFS using the mkdir command.

Using the ls command, we can check for the directories in HDFS.


Example 2:

Hadoop HDFS mkdir Command Description:


This command creates the directory in HDFS if it does not already exist.

Note: If the directory already exists in HDFS, then we will get an error message that file already exists.
Use hadoop fs mkdir -p /path/directoryname, so not to fail even if directory exists.
Learn various features of Hadoop HDFS from this HDFS features guide.
3. ls

Hadoop HDFS ls Command Usage:


hadoop fs -ls /path
Hadoop HDFS ls Command Example 1:
Here in the below example, we are using the ls command to enlist the files and directories present in HDFS.

Hadoop HDFS ls Command Description:


The Hadoop fs shell command ls displays a list of the contents of a directory specified in the path provided by
the user. It shows the name, permissions, owner, size, and modification date for each file or directories in the
specified directory.
Hadoop HDFS ls Command Example 2:

Hadoop HDFS ls Description:


This Hadoop fs command behaves like -ls, but recursively displays entries in all subdirectories of a path.
4. put

Hadoop HDFS put Command Usage:


haoop fs -put <localsrc> <dest>
Hadoop HDFS put Command Example:
Here in this example, we are trying to copy localfile1 of the local file system to the Hadoop filesystem.

Hadoop HDFS put Command Description:


The Hadoop fs shell command put is similar to the copyFromLocal, which copies files or directory from the
local filesystem to the destination in the Hadoop filesystem.
5. copyFromLocal

Hadoop HDFS copyFromLocal Command Usage:


hadoop fs -copyFromLocal <localsrc> <hdfs destination>
Hadoop HDFS copyFromLocal Command Example:
Here in the below example, we are trying to copy the ‘test1’ file present in the local file system to the
newDataFlair directory of Hadoop.

Hadoop HDFS copyFromLocal Command Description:


This command copies the file from the local file system to HDFS.

Learn Internals of HDFS Data Read Operation, How Data flows in HDFS while reading the file.
Any Doubt yet in Hadoop HDFS Commands? Please Comment.
6. get

Hadoop HDFS get Command Usage:


hadoop fs -get <src> <localdest>
Hadoop HDFS get Command Example:
In this example, we are trying to copy the ‘testfile’ of the hadoop filesystem to the local file system.

Hadoop HDFS get Command Description:


The Hadoop fs shell command get copies the file or directory from the Hadoop file system to the local file
system.
Learn: Rack Awareness, High Availability
7. copyToLocal

Hadoop HDFS copyToLocal Command Usage:


hadoop fs -copyToLocal <hdfs source> <localdst>
Hadoop HDFS copyToLocal Command Example:
Here in this example, we are trying to copy the ‘sample’ file present in the newDataFlair directory of HDFS to
the local file system.

We can cross-check whether the file is copied or not using the ls command.

Hadoop HDFS copyToLocal Description:


copyToLocal command copies the file from HDFS to the local file system.
8. cat

Hadoop HDFS cat Command Usage:


hadoop fs –cat /path_to_file_in_hdfs
Hadoop HDFS cat Command Example:
Here in this example, we are using the cat command to display the content of the ‘sample’ file present in
newDataFlair directory of HDFS.

Hadoop HDFS cat Command Description:


The cat command reads the file in HDFS and displays the content of the file on console or stdout.
9. mv

Hadoop HDFS mv Command Usage:


hadoop fs -mv <src> <dest>
Hadoop HDFS mv Command Example:
In this example, we have a directory ‘DR1’ in HDFS. We are using mv command to move the DR1 directory to
the DataFlair directory in HDFS.
Hadoop HDFS mv Command Description:
The HDFS mv command moves the files or directories from the source to a destination within HDFS.
10. cp

Hadoop HDFS cp Command Usage:


hadoop fs -cp <src> <dest>
Hadoop HDFS cp Command Example:
In the below example we are copying the ‘file1’ present in newDataFlair directory in HDFS to the dataflair
directory of HDFS.

Hadoop HDFS cp Command Description:


The cp command copies a file from one directory to another directory within the HDFS.

Hadoop Distributions

Table of Contents

11.1. The Case for Distributions

11.2. Overview of Hadoop Distributions

11.3. Hadoop in the Cloud

11.1. The Case for Distributions

Hadoop is Apache software so it is freely available for download and use. So why do we need distributions at all?
This is very akin to Linux a few years back and Linux distributions like RedHat, Suse and Ubuntu. The software
is free to download and use but distributions offer an easier to use bundle.

So what do Hadoop distros offer?

Distributions provide easy to install mediums like RPMs

The Apache version of Hadoop is just TAR balls. Distros actually package it nicely into easy to install
packages which make it easy for system administrators to manage effectively.

Distros package multiple components that work well together

The Hadoop ecosystem contains a lot of components (HBase, Pig, Hive, Zookeeper, etc.) which are being
developed independently and have their own release schedules. Also, there are version dependencies
among the components. For example version 0.92 of HBase needs a particular version of HDFS.

Distros bundle versions of components that work well together. This provides a working Hadoop
installation right out of the box.

Tested

Distro makers strive to ensure good quality components.

Performance patches

Sometimes, distros lead the way by including performance patches to the 'vanilla' versions.

Predictable upgrade path

Distros have predictable product release road maps. This ensures they keep up with developments and
bug fixes.

And most importantly . . SUPPORT

Lot of distros come with support, which could be very valuable for a production critical cluster.

11.2. Overview of Hadoop Distributions

Table 11.1. Hadoop Distributions

Distro Remarks Free / Premium

Apache o The Hadoop Source


o Completely free and open source
hadoop.apache.org No packaging except TAR balls
o No extra tools
o Oldest distro
Cloudera o Very polished Free / Premium model (depending on
www.cloudera.com o Comes with good tools to install and manage a Hadoop cluster size)
cluster
HortonWorks o Newer distro
o Completely open source
www.hortonworks.com Tracks Apache Hadoop closely
o Comes with tools to manage and administer a cluster
o MapR has their own file system (alternative to HDFS)
MapR o Boasts higher performance Free / Premium model
www.mapr.com o Nice set of tools to manage and administer a cluster
o Does not suffer from Single Point of Failure
Distro Remarks Free / Premium

o Offer some cool features like mirroring, snapshots, etc.


o Encryption support
Intel o Hardware acceleration added to some layers of stack to Premium
hadoop.intel.com boost performance
o Admin tools to deploy and manage Hadoop
Pivotal HD o fast SQL on Hadoop Premium
gopivotal.com o software only or appliance

11.3. Hadoop in the Cloud

Elephants can really fly in the clouds! Most cloud providers offer Hadoop.

Hadoop clusters in the Cloud

Hadoop clusters can be set up in any cloud service that offers suitable
machines.

However, in line with the cloud mantra 'only pay for what you use', Hadoop
can be run 'on demand' in the cloud.

Amazon Elastic Map Reduce

Amazon offers 'On Demand Hadoop', which means there is no permanent


Hadoop cluster. A cluster is spun up to do a job and after that it is shut down
- 'pay for usage'.

Amazon offers a slightly customized version of Apache Hadoop and also


offers MapR's distribution.

Google's Compute Engine

Google offers MapR's Hadoop distribution in their Compute Engine Cloud.


SkyTab Cloud

SkyTap offers deploy-able Hadoop templates

What is a rack?

The Rack is the collection of around 40-50 DataNodes connected using the same network switch. If the network
goes down, the whole rack will be unavailable. A large Hadoop cluster is deployed in multiple racks.
What is Rack Awareness in Hadoop HDFS?

In a large Hadoop cluster, there are multiple racks. Each rack consists of DataNodes. Communication between
the DataNodes on the same rack is more efficient as compared to the communication between DataNodes
residing on different racks.

To reduce the network traffic during file read/write, NameNode chooses the closest DataNode for serving the
client read/write request. NameNode maintains rack ids of each DataNode to achieve this rack information.
This concept of choosing the closest DataNode based on the rack information is known as Rack Awareness.
Therefore, NameNode on multiple rack cluster maintains block replication by using inbuilt Rack awareness
policies which says:
• Not more than one replica be placed on one node.
• Not more than two replicas are placed on the same rack.
• Also, the number of racks used for block replication should always be smaller than the number of
replicas.
For the common case where the replication factor is three, the block replication policy put the first replica on the
local rack, a second replica on the different DataNode on the same rack, and a third replica on the different rack.

Also, while re-replicating a block, if the existing replica is one, place the second replica on a different rack. If
the existing replicas are two and are on the same rack, then place the third replica on a different rack.
What is MapReduce?

MapReduce is a programming model or pattern within the Hadoop framework that is used to access big data stored
in the Hadoop File System (HDFS). It is a core component, integral to the functioning of the Hadoop framework.

MapReduce facilitates concurrent processing by splitting petabytes of data into smaller chunks, and processing
them in parallel on Hadoop commodity servers. In the end, it aggregates all the data from multiple servers to
return a consolidated output back to the application.

For example, a Hadoop cluster with 20,000 inexpensive commodity servers and 256MB block of data in each,
can process around 5TB of data at the same time. This reduces the processing time as compared to sequential
processing of such a large data set.

With MapReduce, rather than sending data to where the application or logic resides, the logic is executed on the
server where the data already resides, to expedite processing.

MapReduce was once the only method through which the data stored in the HDFS could be retrieved, but that is
no longer the case. Today, there are other query-based systems such as Hive and Pig that are used to retrieve data
from the HDFS using SQL-like statements.

MapReduce Architecture:

The MapReduce task is mainly divided into two phases Map Phase and Reduce Phase. They are sequenced one
after the other.

The Map function takes input from the disk as <key,value> pairs, processes them, and produces another set of
intermediate <key,value> pairs as output.

The Reduce function also takes inputs as <key,value> pairs, and produces <key,value> pairs as output.

Input Output

Map <k1, v1> list (<k2, v2>)

Reduce <k2, list(v2)> list (<k3, v3>)

Map

The input data is first split into smaller blocks. Each block is then assigned to a mapper for processing.

For example, if a file has 100 records to be processed, 100 mappers can run together to process one record each.
Or maybe 50 mappers can run together to process two records each. The Hadoop framework decides how many
mappers to use, based on the size of the data to be processed and the memory block available on each mapper
server.

Reduce

After all the mappers complete processing, the framework shuffles and sorts the results before passing them on to
the reducers. A reducer cannot start while a mapper is still in progress. All the map output values that have the
same key are assigned to a single reducer, which then aggregates the values for that key.

Combine and Partition


There are two intermediate steps between Map and Reduce.

Combine is an optional process. The combiner is a reducer that runs individually on each mapper server. It reduces
the data on each mapper further to a simplified form before passing it downstream.

This makes shuffling and sorting easier as there is less data to work with. Often, the combiner class is set to the
reducer class itself, due to the cumulative and associative functions in the reduce function. However, if needed,
the combiner can be a separate class as well.

Partition is the process that translates the <key, value> pairs resulting from mappers to another set of <key, value>
pairs to feed into the reducer. It decides how the data has to be presented to the reducer and also assigns it to a
particular reducer.

The default partitioner determines the hash value for the key, resulting from the mapper, and assigns a partition
based on this hash value. There are as many partitions as there are reducers. So, once the partitioning is complete,
the data from each partition is sent to a specific reducer.

A MapReduce Example

Consider an ecommerce system that receives a million requests every day to process payments. There may be
several exceptions thrown during these requests such as "payment declined by a payment gateway," "out of
inventory," and "invalid address." A developer wants to analyze last four days' logs to understand which exception
is thrown how many times.

Example Use Case

The objective is to isolate use cases that are most prone to errors, and to take appropriate action. For example, if
the same payment gateway is frequently throwing an exception, is it because of an unreliable service or a badly
written interface? If the "out of inventory" exception is thrown often, does it mean the inventory calculation service
has to be improved, or does the inventory stocks need to be increased for certain products?

The developer can ask relevant questions and determine the right course of action. To perform this analysis on
logs that are bulky, with millions of records, MapReduce is an apt programming model. Multiple mappers can
process these logs simultaneously: one mapper could process a day's log or a subset of it based on the log size and
the memory block available for processing in the mapper server.

Map

For simplification, let's assume that the Hadoop framework runs just four mappers. Mapper 1, Mapper 2, Mapper
3, and Mapper 4.

The value input to the mapper is one record of the log file. The key could be a text string such as "file name + line
number." The mapper, then, processes each record of the log file to produce key value pairs. Here, we will just
use a filler for the value as '1.' The output from the mappers look like this:
Mapper 1 -> <Exception A, 1>, <Exception B, 1>, <Exception A, 1>, <Exception C, 1>, <Exception A, 1>

Mapper 2 -> <Exception B, 1>, <Exception B, 1>, <Exception A, 1>, <Exception A, 1>

Mapper 3 -> <Exception A, 1>, <Exception C, 1>, <Exception A, 1>, <Exception B, 1>, <Exception A, 1>

Mapper 4 -> <Exception B, 1>, <Exception C, 1>, <Exception C, 1>, <Exception A, 1>

Assuming that there is a combiner running on each mapper—Combiner 1 … Combiner 4—that calculates the
count of each exception (which is the same function as the reducer), the input to Combiner 1 will be:

<Exception A, 1>, <Exception B, 1>, <Exception A, 1>, <Exception C, 1>, <Exception A, 1>

Combine

The output of Combiner 1 will be:

<Exception A, 3>, <Exception B, 1>, <Exception C, 1>

The output from the other combiners will be:

Combiner 2: <Exception A, 2> <Exception B, 2>

Combiner 3: <Exception A, 3> <Exception B, 1> <Exception C, 1>

Combiner 4: <Exception A, 1> <Exception B, 1> <Exception C, 2>

Partition

After this, the partitioner allocates the data from the combiners to the reducers. The data is also sorted for the
reducer.

The input to the reducers will be as below:

Reducer 1: <Exception A> {3,2,3,1}

Reducer 2: <Exception B> {1,2,1,1}

Reducer 3: <Exception C> {1,1,2}

If there were no combiners involved, the input to the reducers will be as below:

Reducer 1: <Exception A> {1,1,1,1,1,1,1,1,1}

Reducer 2: <Exception B> {1,1,1,1,1}

Reducer 3: <Exception C> {1,1,1,1}

Here, the example is a simple one, but when there are terabytes of data involved, the combiner process’
improvement to the bandwidth is significant.

Reduce

Now, each reducer just calculates the total count of the exceptions as:

Reducer 1: <Exception A, 9>

Reducer 2: <Exception B, 5>

Reducer 3: <Exception C, 4>

The data shows that Exception A is thrown more often than others and requires more attention. When there are
more than a few weeks' or months' of data to be processed together, the potential of the MapReduce program can
be truly exploited.
Introduction

Let's assume that, you have 100 TB of data to store and process with Hadoop. The configuration of each available
DataNode is as follows:

• 8 GB RAM

• 10 TB HDD

• 100 MB/s read-write speed

You have a Hadoop Cluster with replication factor = 3 and block size = 64 MB.

In this case, the number of DataNodes required to store would be:

• Total amount of Data * Replication Factor / Disk Space available on each DataNode

• 100 * 3 / 10

• 30 DataNodes

Now, let's assume you need to process this 100 TB of data using MapReduce.

And, reading 100 TB data at a speed of 100 MB/s using only 1 node would take:

• Total data / Read-write speed

• 100 * 1024 * 1024 / 100

• 1048576 seconds

• 291.27 hours

So, with 30 DataNodes you would be able to finish this MapReduce job in:

• 291.27 / 30

• 9.70 hours

1. Problem Statement

How many such Data Nodes you would need to read 100TB data in 5 minutes in your hadoop cluster?
Scheduling and Managing Task

Introduction

Apache Yarn – “Yet Another Resource Negotiator” is the resource management layer of Hadoop. The Yarn was
introduced in Hadoop 2.x.
Apart from resource management, Yarn also does job Scheduling.

Apache Yarn Framework consists of a master daemon known as “Resource Manager”, slave daemon called node
manager (one per slave node) and Application Master (one per application).

1. Resource Manager (RM)

It is the master daemon of Yarn. RM manages the global assignments of resources (CPU and memory) among
all the applications. It arbitrates system resources between competing applications. follow Resource
Manager guide to learn Yarn Resource manager in great detail.

Resource Manager has two Main components:

• Scheduler
• Application manager
a) Scheduler

The scheduler is responsible for allocating the resources to the running application. The scheduler is pure
scheduler it means that it performs no monitoring no tracking for the application and even doesn’t guarantees
about restarting failed tasks either due to application failure or hardware failures.

b) Application Manager

It manages running Application Masters in the cluster, i.e., it is responsible for starting application masters and
for monitoring and restarting them on different nodes in case of failures.

2. Node Manager (NM)

It is the slave daemon of Yarn. NM is responsible for containers monitoring their resource usage and reporting
the same to the ResourceManager. Manage the user process on that machine. Yarn NodeManager also tracks the
health of the node on which it is running. The design also allows plugging long-running auxiliary services to the
NM; these are application-specific services, specified as part of the configurations and loaded by the NM during
startup. A shuffle is a typical auxiliary service by the NMs for MapReduce applications on YARN

3. Application Master (AM)

One application master runs per application. It negotiates resources from the resource manager and works with
the node manager. It Manages the application life cycle.
The AM acquires containers from the RM’s Scheduler before contacting the corresponding NMs to start the
application’s individual tasks.

Resource Manager Restart

Resource Manager is the central authority that manages resources and schedules applications running on YARN.
Hence, it is potentially an SPOF in an Apache YARN cluster.
There are two types of restart for Resource Manager:

• Non-work-preserving RM restart – This restart enhances RM to persist application/attempt


state in a pluggable state-store. Resource Manager will reload the same info from state-store on
the restart and re-kick the previously running apps. Users does not need to re-submit the
applications. Node manager and clients during down time of RM will keep polling RM until RM
comes up, when RM comes up, it will send a re-sync command to all the NM and AM it was
talking to via heartbeats. The NMs will kill all its manager’s containers and re-register with RM
• Work-preserving RM restart – This focuses on reconstructing the running state of RM by
combining the container status from Node Managers and container requests from Application
Masters on restart. The key difference from Non-work-preserving RM restart is that
already running apps will not be stopped after master restarts, so applications will not lose its
processed data because of RM/master outage. RM recovers its running state by taking advantage
of container status which is sent from all the node managers. NM will not kill the containers when
it re-syncs with the restarted RM. It continues managing the containers and sends the container
status across to RM when it re-registers.
Yarn Resource Manager High availability

The ResourceManager (master) is responsible for handling the resources in a cluster, and scheduling multiple
applications (e.g., spark apps or MapReduce). Before to Hadoop v2.4, the master (RM) was the SPOF (single
point of failure).
The High Availability feature adds redundancy in the form of an Active/Standby ResourceManager pair to
remove this otherwise single point of failure.

ResourceManager HA is realized through an Active/Standby architecture – at any point in time, one in the
masters is Active, and other Resource Managers are in Standby mode, they are waiting to take over when
anything happens to the Active.

The trigger to transition-to-active comes from either the admin (through CLI) or through the integrated failover-
controller when automatic failover is enabled.

1. Manual transitions and failover

When automatic failover is not configured, admins have to manually transit one of the Resource managers to the
active state.

Failover from active master to the other, they are expected to transmit the active master to standby and transmit
a Standby-RM to Active. Hence, this activity can be done using the yarn.

2. Automatic failover

In this case, there is no need for any manual intervention. The master has an option to embed the Zookeeper (a
coordination engine) based ActiveStandbyElector to decide which Resource Manager should be the Active.

When the active fails, another Resource Manager is automatically selected to be active. Note that, there is no
need to run a separate zookeeper daemon because ActiveStandbyElector embedded in Resource Managers acts
as a failure detector and a leader elector instead of a separate ZKFC daemon.
Introduction to Hadoop Scheduler

Prior to Hadoop 2, Hadoop MapReduce is a software framework for writing applications that process huge
amounts of data (terabytes to petabytes) in-parallel on the large Hadoop cluster. This framework is responsible
for scheduling tasks, monitoring them, and re-executes the failed task.
In Hadoop 2, a YARN called Yet Another Resource Negotiator was introduced. The basic idea behind the
YARN introduction is to split the functionalities of resource management and job scheduling or monitoring into
separate daemons that are ResorceManager, ApplicationMaster, and NodeManager.
ResorceManager is the master daemon that arbitrates resources among all the applications in the system.
NodeManager is the slave daemon responsible for containers, monitoring their resource usage, and reporting the
same to ResourceManager or Schedulers. ApplicationMaster negotiates resources from the ResourceManager
and works with NodeManager in order to execute and monitor the task.

The ResourceManager has two main components that are Schedulers and ApplicationsManager.

Schedulers in YARN ResourceManager is a pure scheduler which is responsible for allocating resources to the
various running applications.
It is not responsible for monitoring or tracking the status of an application. Also, the scheduler does not
guarantee about restarting the tasks that are failed either due to hardware failure or application failure.

The scheduler performs scheduling based on the resource requirements of the applications.

It has some pluggable policies that are responsible for partitioning the cluster resources among the various
queues, applications, etc.

The FIFO Scheduler, CapacityScheduler, and FairScheduler are such pluggable policies that are responsible for
allocating resources to the applications.

1. FIFO Scheduler

First In First Out is the default scheduling policy used in Hadoop. FIFO Scheduler gives more preferences to
the application coming first than those coming later. It places the applications in a queue and executes them in
the order of their submission (first in, first out).
Here, irrespective of the size and priority, the request for the first application in the queue are allocated first.
Once the first application request is satisfied, then only the next application in the queue is served.

Advantage:
• It is simple to understand and doesn’t need any configuration.
• Jobs are executed in the order of their submission.
Disadvantage:
• It is not suitable for shared clusters. If the large application comes before the shorter one, then the
large application will use all the resources in the cluster, and the shorter application has to wait
for its turn. This leads to starvation.
• It does not take into account the balance of resource allocation between the long applications and
short applications.
2. Capacity Scheduler

The CapacityScheduler allows multiple-tenants to securely share a large Hadoop cluster. It is designed to
run Hadoop applications in a shared, multi-tenant cluster while maximizing the throughput and the utilization of
the cluster.
It supports hierarchical queues to reflect the structure of organizations or groups that utilizes the cluster
resources. A queue hierarchy contains three types of queues that are root, parent, and leaf.

The root queue represents the cluster itself, parent queue represents organization/group or sub-organization/sub-
group, and the leaf accepts application submission.

The Capacity Scheduler allows the sharing of the large cluster while giving capacity guarantees to each
organization by allocating a fraction of cluster resources to each queue.

Also, when there is a demand for the free resources that are available on the queue who has completed its task,
by the queues running below capacity, then these resources will be assigned to the applications on queues
running below capacity. This provides elasticity for the organization in a cost-effective manner.

Apart from it, the CapacityScheduler provides a comprehensive set of limits to ensure that a single
application/user/queue cannot use a disproportionate amount of resources in the cluster.

To ensure fairness and stability, it also provides limits on initialized and pending apps from a single user and
queue.

Advantages:
• It maximizes the utilization of resources and throughput in the Hadoop cluster.
• Provides elasticity for groups or organizations in a cost-effective manner.
• It also gives capacity guarantees and safeguards to the organization utilizing cluster.
Disadvantage:
• It is complex amongst the other scheduler.

3. Fair Scheduler

FairScheduler allows YARN applications to fairly share resources in large Hadoop clusters. With FairScheduler,
there is no need for reserving a set amount of capacity because it will dynamically balance resources between all
running applications.

It assigns resources to applications in such a way that all applications get, on average, an equal amount of
resources over time.

The FairScheduler, by default, takes scheduling fairness decisions only on the basis of memory. We can
configure it to schedule with both memory and CPU.

When the single application is running, then that app uses the entire cluster resources. When other applications
are submitted, the free up resources are assigned to the new apps so that every app eventually gets roughly the
same amount of resources. FairScheduler enables short apps to finish in a reasonable time without starving the
long-lived apps.

Similar to CapacityScheduler, the FairScheduler supports hierarchical queue to reflect the structure of the long
shared cluster.

Apart from fair scheduling, the FairScheduler allows for assigning minimum shares to queues for ensuring that
certain users, production, or group applications always get sufficient resources. When an app is present in the
queue, then the app gets its minimum share, but when the queue doesn’t need its full guaranteed share, then the
excess share is split between other running applications.

Advantages:
• It provides a reasonable way to share the Hadoop Cluster between the number of users.
• Also, the FairScheduler can work with app priorities where the priorities are used as weights in
determining the fraction of the total resources that each application should get.
Disadvantage:
• It requires configuration.

Hadoop 1 vs Hadoop 2

1. Components: In Hadoop 1 we have MapReduce but Hadoop 2 has YARN(Yet Another Resource
Negotiator) and MapReduce version 2.
Hadoop 1 Hadoop 2

HDFS HDFS

Map Reduce YARN / MRv2

2. Daemons:
Hadoop 1 Hadoop 2

Namenode Namenode

Datanode Datanode

Secondary Namenode Secondary Namenode

Job Tracker Resource Manager

Task Tracker Node Manager

3. Working:
• In Hadoop 1, there is HDFS which is used for storage and top of it, Map Reduce which works as
Resource Management as well as Data Processing. Due to this workload on Map Reduce, it will
affect the performance.
• In Hadoop 2, there is again HDFS which is again used for storage and on the top of HDFS, there
is YARN which works as Resource Management. It basically allocates the resources and keeps
all the things going on.
4. Limitations: Hadoop 1 is a Master-Slave architecture. It consists of a single master and multiple slaves.
Suppose if master node got crashed then irrespective of your best slave nodes, your cluster will be destroyed.
Again for creating that cluster means copying system files, image files, etc. on another system is too much
time consuming which will not be tolerated by organizations in today’s time. Hadoop 2 is also a Master-Slave
architecture. But this consists of multiple masters (i.e active namenodes and standby namenodes) and
multiple slaves. If here master node got crashed then standby master node will take over it. You can make
multiple combinations of active-standby nodes. Thus Hadoop 2 will eliminate the problem of a single point of
failure.
5. Ecosystem:

• Oozie is basically Work Flow Scheduler. It decides the particular time of jobs to execute
according to their dependency.
• Pig, Hive and Mahout are data processing tools that are working on the top of Hadoop.
• Sqoop is used to import and export structured data. You can directly import and export the data
into HDFS using SQL database.
• Flume is used to import and export the unstructured data and streaming data.
6. Windows Support:
in Hadoop 1 there is no support for Microsoft Windows provided by Apache whereas in Hadoop 2 there is
support for Microsoft windows.

You might also like