KEMBAR78
BDA Unit2 Notes | PDF | Apache Hadoop | Map Reduce
0% found this document useful (0 votes)
14 views23 pages

BDA Unit2 Notes

Hadoop is an open-source framework designed for storing and processing large datasets in a distributed environment, utilizing the MapReduce programming model. It consists of key components such as HDFS for storage and YARN for resource management, enabling fault tolerance and scalability. The document also compares Hadoop with RDBMS, highlighting Hadoop's advantages in handling both structured and unstructured data efficiently.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views23 pages

BDA Unit2 Notes

Hadoop is an open-source framework designed for storing and processing large datasets in a distributed environment, utilizing the MapReduce programming model. It consists of key components such as HDFS for storage and YARN for resource management, enabling fault tolerance and scalability. The document also compares Hadoop with RDBMS, highlighting Hadoop's advantages in handling both structured and unstructured data efficiently.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

UNIT-II

Introduction to Hadoop: Introducing hadoop, Why hadoop, Why not RDBMS, RDBMS Vs Hadoop,
History of Hadoop, Hadoop overview, Use case of Hadoop, HDFS (Hadoop Distributed File
System),Processing data with Hadoop, Managing resources and applications with Hadoop YARN(Yet
Another Resource Negotiator).
Introduction to Map Reduce Programming: Introduction, Mapper, Reducer, Combiner, Partitioner,
Searching, Sorting, Compression.

INTRODUCTION:
Hadoop is an open-source software framework that is used for storing and processing large
amounts of data in a distributed computing environment. It is designed to handle big data and is
based on the MapReduce programming model, which allows for the parallel processing of large
datasets.
What is Hadoop?
Hadoop is an open source software programming framework for storing a large amount of data
and performing the computation. Its framework is based on Java programming with some native
code in C and shell scripts.
Hadoop is an open-source software framework that is used for storing and processing large
amounts of data in a distributed computing environment. It is designed to handle big data and is
based on the MapReduce programming model, which allows for the parallel processing of large
datasets.
Hadoop has two main components:
 HDFS (Hadoop Distributed File System): This is the storage component of Hadoop, which
allows for the storage of large amounts of data across multiple machines. It is designed to
work with commodity hardware, which makes it cost-effective.
 YARN (Yet Another Resource Negotiator): This is the resource management component of
Hadoop, which manages the allocation of resources (such as CPU and memory) for
processing the data stored in HDFS.
 Hadoop also includes several additional modules that provide additional functionality, such
as Hive (a SQL-like query language), Pig (a high-level platform for creating MapReduce
programs), and HBase (a non-relational, distributed database).
 Hadoop is commonly used in big data scenarios such as data warehousing, business
intelligence, and machine learning. It’s also used for data processing, data analysis, and data
mining. It enables the distributed processing of large data sets across clusters of computers
using a simple programming model.
History of Hadoop
Apache Software Foundation is the developers of Hadoop, and it’s co-founders are Doug
Cutting and Mike Cafarella. It’s co-founder Doug Cutting named it on his son’s toy elephant.
In October 2003 the first paper release was Google File System. In January 2006, MapReduce
development started on the Apache Nutch which consisted of around 6000 lines coding for it
and around 5000 lines coding for HDFS. In April 2006 Hadoop 0.1.0 was released.
Hadoop is an open-source software framework for storing and processing big data. It was created
by Apache Software Foundation in 2006, based on a white paper written by Google in 2003 that
described the Google File System (GFS) and the MapReduce programming model. The Hadoop
framework allows for the distributed processing of large data sets across clusters of computers
using simple programming models. It is designed to scale up from single servers to thousands of
machines, each offering local computation and storage. It is used by many organizations,
including Yahoo, Facebook, and IBM, for a variety of purposes such as data warehousing, log
processing, and research. Hadoop has been widely adopted in the industry and has become a key
technology for big data processing.
Features of hadoop:
1. it is fault tolerance.
2. it is highly available.
3. it’s programming is easy.
4. it have huge flexible storage.
5. it is low cost.

Hadoop has several key features that make it well-suited for big data processing:

 Distributed Storage: Hadoop stores large data sets across multiple machines, allowing for the
storage and processing of extremely large amounts of data.
 Scalability: Hadoop can scale from a single server to thousands of machines, making it easy
to add more capacity as needed.
 Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it can continue to
operate even in the presence of hardware failures.
 Data locality: Hadoop provides data locality feature, where the data is stored on the same
node where it will be processed, this feature helps to reduce the network traffic and improve
the performance
 High Availability: Hadoop provides High Availability feature, which helps to make sure that
the data is always available and is not lost.
 Flexible Data Processing: Hadoop’s MapReduce programming model allows for the
processing of data in a distributed fashion, making it easy to implement a wide variety of
data processing tasks.
 Data Integrity: Hadoop provides built-in checksum feature, which helps to ensure that the
data stored is consistent and correct.
 Data Replication: Hadoop provides data replication feature, which helps to replicate the data
across the cluster for fault tolerance.
 Data Compression: Hadoop provides built-in data compression feature, which helps to
reduce the storage space and improve the performance.
 YARN: A resource management platform that allows multiple data processing engines like
real-time streaming, batch processing, and interactive SQL, to run and process data stored in
HDFS.
Hadoop Distributed File System
It has distributed file system known as HDFS and this HDFS splits files into blocks and sends
them across various nodes in form of large clusters. Also in case of a node failure, the system
operates and data transfer takes place between the nodes which are facilitated by HDFS.
HDFS

Advantages of HDFS: It is inexpensive, immutable in nature, stores data reliably, ability to


tolerate faults, scalable, block structured, can process a large amount of data simultaneously and
many more.
Disadvantages of HDFS: It’s the biggest disadvantage is that it is not fit for small quantities of
data. Also, it has issues related to potential stability, restrictive and rough in nature. Hadoop also
supports a wide range of software packages such as Apache Flumes, Apache Oozie, Apache
HBase, Apache Sqoop, Apache Spark, Apache Storm, Apache Pig, Apache Hive, Apache
Phoenix, Cloudera Impala.
Why hadoop, Why not RDBMS, RDBMS Vs Hadoop
RDBMS and Hadoop both are used for data handling, storing, and processing data but they are
different in terms of design, implementation, and use cases. In RDBMS, store primarily
structured data and processing by SQL while in Hadoop, store or handle structured and
unstructured data and processing using Map-Reduce or Spark. In this article, we will go into
detail about RDBMS and Hadoop and also look into the difference between RDBMS and
Hadoop.
What is the RDMS?
RDBMS is an information management system, which is based on a data model. In RDBMS
tables are used for information storage. Each row of the table represents a record and the column
represents an attribute of data. The organization of data and their manipulation processes are
different in RDBMS from other databases. RDBMS ensures ACID (atomicity, consistency,
integrity, durability) properties required for designing a database. The purpose of RDBMS is to
store, manage, and retrieve data as quickly and reliably as possible.
Advantages
 Support High data integrity.
 Provide multiple levels of security.
 Create a replicate of data so that the data can be used in case of disasters.
 Normalization is present.
Disadvantages
 It is less scalable as compared to Hadoop.
 Huge costs are required.
 Complex design.
 Structure of database is fixed.
What is the Hadoop?
It is an open-source software framework used for storing data and running applications on a
group of commodity hardware. It has a large storage capacity and high processing power. It can
manage multiple concurrent processes at the same time. It is used in predictive analysis, data
mining, and machine learning. It can handle both structured and unstructured forms of data. It is
more flexible in storing, processing, and managing data than traditional RDBMS. Unlike
traditional systems, Hadoop enables multiple analytical processes on the same data at the same
time. It supports scalability very flexibly.
Advantages
 Highly scalable.
 Cost effective because it is an open source software.
 Store any kind of data (structured, semi-structured and un-structured ).
 High throughput.
Disadvantages
 Not effective for small files.
 Security feature is not available.
 High up processing.
 Supports only batch processing (Map reduce).

The Hadoop Architecture Mainly consists of 4 components.


 MapReduce
 HDFS(Hadoop Distributed File System)
 YARN(Yet Another Resource Negotiator)
 Common Utilities or Hadoop Common

Let’s understand the role of each one of this component in detail.


1. MapReduce: MapReduce nothing but just like an Algorithm or a data structure that is based on the
YARN framework. The major feature of MapReduce is to perform the distributed processing in parallel
in a Hadoop cluster which Makes Hadoop working so fast. When you are dealing with Big Data, serial
processing is no more of any use. MapReduce has mainly 2 tasks which are divided phase-wise:
In first phase, Map is utilized and in next phase Reduce is utilized.
Here, we can see that the Input is provided to the Map() function then it’s output is used as an input to
the Reduce function and after that, we receive our final output. Let’s understand What this Map() and
Reduce() does.
As we can see that an Input is provided to the Map(), now as we are using Big Data. The Input is a set of
Data. The Map() function here breaks this DataBlocks into Tuples that are nothing but a key-value pair.
These key-value pairs are now sent as input to the Reduce(). The Reduce() function then combines this
broken Tuples or key-value pair based on its Key value and form set of Tuples, and perform some
operation like sorting, summation type job, etc. which is then sent to the final Output Node. Finally, the
Output is Obtained.
The data processing is always done in Reducer depending upon the business requirement of that industry.
This is How First Map() and then Reduce is utilized one by one.
Let’s understand the Map Task and Reduce Task in detail.
Map Task:
 RecordReader The purpose of recordreader is to break the records. It is responsible for providing
key-value pairs in a Map() function. The key is actually is its locational information and value is the
data associated with it.
 Map: A map is nothing but a user-defined function whose work is to process the Tuples obtained
from record reader. The Map() function either does not generate any key-value pair or generate
multiple pairs of these tuples.
 Combiner: Combiner is used for grouping the data in the Map workflow. It is similar to a Local
reducer. The intermediate key-value that are generated in the Map is combined with the help of this
combiner. Using a combiner is not necessary as it is optional.
 Partitionar: Partitional is responsible for fetching key-value pairs generated in the Mapper Phases.
The partitioner generates the shards corresponding to each reducer. Hashcode of each key is also
fetched by this partition. Then partitioner performs it’s(Hashcode) modulus with the number of
reducers(key.hashcode()%(number of reducers)).
Reduce Task

 Shuffle and Sort: The Task of Reducer starts with this step, the process in which the Mapper
generates the intermediate key-value and transfers them to the Reducer task is known as Shuffling.
Using the Shuffling process the system can sort the data using its key value.
Once some of the Mapping tasks are done Shuffling begins that is why it is a faster process and does
not wait for the completion of the task performed by Mapper.
 Reduce: The main function or task of the Reduce is to gather the Tuple generated from Map and then
perform some sorting and aggregation sort of process on those key-value depending on its key
element.
 OutputFormat: Once all the operations are performed, the key-value pairs are written into the file
with the help of record writer, each record in a new line, and the key and value in a space-separated
manner.
2. HDFS
HDFS(Hadoop Distributed File System) is utilized for storage permission. It is mainly designed for
working on commodity Hardware devices(inexpensive devices), working on a distributed file system
design. HDFS is designed in such a way that it believes more in storing the data in a large chunk of blocks
rather than storing small data blocks.
HDFS in Hadoop provides Fault-tolerance and High availability to the storage layer and the other devices
present in that Hadoop cluster. Data storage Nodes in HDFS.
 NameNode(Master)
 DataNode(Slave)
NameNode:NameNode works as a Master in a Hadoop cluster that guides the Datanode(Slaves).
Namenode is mainly used for storing the Metadata i.e. the data about the data. Meta Data can be the
transaction logs that keep track of the user’s activity in a Hadoop cluster.
Meta Data can also be the name of the file, size, and the information about the location(Block number,
Block ids) of Datanode that Namenode stores to find the closest DataNode for Faster Communication.
Namenode instructs the DataNodes with the operation like delete, create, Replicate, etc.
DataNode: DataNodes works as a Slave DataNodes are mainly utilized for storing the data in a Hadoop
cluster, the number of DataNodes can be from 1 to 500 or even more than that. The more number of
DataNode, the Hadoop cluster will be able to store more data. So it is advised that the DataNode should
have High storing capacity to store a large number of file blocks.
High Level Architecture Of Hadoop

File Block In HDFS: Data in HDFS is always stored in terms of blocks. So the single block of data is
divided into multiple blocks of size 128MB which is default and you can also change it manually.
Let’s understand this concept of breaking down of file in blocks with an example. Suppose you have
uploaded a file of 400MB to your HDFS then what happens is this file got divided into blocks of
128MB+128MB+128MB+16MB = 400MB size. Means 4 blocks are created each of 128MB except the
last one. Hadoop doesn’t know or it doesn’t care about what data is stored in these blocks so it considers
the final file blocks as a partial record as it does not have any idea regarding it. In the Linux file system,
the size of a file block is about 4KB which is very much less than the default size of file blocks in the
Hadoop file system. As we all know Hadoop is mainly configured for storing the large size data which
is in petabyte, this is what makes Hadoop file system different from other file systems as it can be scaled,
nowadays file blocks of 128MB to 256MB are considered in Hadoop.
Replication In HDFS Replication ensures the availability of the data. Replication is making a copy of
something and the number of times you make a copy of that particular thing can be expressed as it’s
Replication Factor. As we have seen in File blocks that the HDFS stores the data in the form of various
blocks at the same time Hadoop is also configured to make a copy of those file blocks.
By default, the Replication Factor for Hadoop is set to 3 which can be configured means you can change
it manually as per your requirement like in above example we have made 4 file blocks which means that
3 Replica or copy of each file block is made means total of 4×3 = 12 blocks are made for the backup
purpose.
This is because for running Hadoop we are using commodity hardware (inexpensive system hardware)
which can be crashed at any time. We are not using the supercomputer for our Hadoop setup. That is why
we need such a feature in HDFS which can make copies of that file blocks for backup purposes, this is
known as fault tolerance.
Now one thing we also need to notice that after making so many replica’s of our file blocks we are
wasting so much of our storage but for the big brand organization the data is very much important than
the storage so nobody cares for this extra storage. You can configure the Replication factor in your hdfs-
site.xml file.
Rack Awareness The rack is nothing but just the physical collection of nodes in our Hadoop cluster
(maybe 30 to 40). A large Hadoop cluster is consists of so many Racks . with the help of this Racks
information Namenode chooses the closest Datanode to achieve the maximum performance while
performing the read/write information which reduces the Network Traffic.
HDFS Architecture
3. YARN(Yet Another Resource Negotiator) : YARN is a Framework on which MapReduce works.
YARN performs 2 operations that are Job scheduling and Resource Management. The Purpose of Job
schedular is to divide a big task into small jobs so that each job can be assigned to various slaves in a
Hadoop cluster and Processing can be Maximized. Job Scheduler also keeps track of which job is
important, which job has more priority, dependencies between the jobs and all the other information like
job timing, etc. And the use of Resource Manager is to manage all the resources that are made available
for running a Hadoop cluster.

Features of YARN
 Multi-Tenancy
 Scalability
 Cluster-Utilization
 Compatibility
4. Hadoop common or Common Utilities: Hadoop common or Common utilities are nothing but our
java library and java files or we can say the java scripts that we need for all the other components present
in a Hadoop cluster. these utilities are used by HDFS, YARN, and MapReduce for running the cluster.
Hadoop Common verify that Hardware failure in a Hadoop cluster is common so it needs to be solved
automatically in software by Hadoop Framework.

Differences Between RDBMS and Hadoop


RDBMS Hadoop

Traditional row-column based databases, An open-source software used for


basically used for data storage, manipulation storing data and running applications or
and retrieval. processes concurrently.

In this both structured and unstructured


In this structured data is mostly processed.
data is processed.

It is best suited for OLTP environment. It is best suited for BIG data.

It is less scalable than Hadoop. It is highly scalable.

Data normalization is not required in


Data normalization is required in RDBMS.
Hadoop.

It stores transformed and aggregated data. It stores huge volume of data.

It has no latency in response. It has some latency in response.

The data schema of Hadoop is dynamic


The data schema of RDBMS is static type.
type.
RDBMS Hadoop

Low data integrity available than


High data integrity available.
RDBMS.

Free of cost, as it is an open source


Cost is applicable for licensed software.
software.

Data is process using SQL. Data is process using Map-Reduce.

Follow ACID properties. Does not follow ACID properties.

Conclusion
RDBMS and Hadoop both are use for handle the data and store the data. If requirement is that
store the structured data and real time processing then RDBMS is ideal choice. If process the
large volumes of data (unstructured, structured data etc.) then prefer hadoop because it is suitable
for the big data applications.

Data Processing in Hadoop

Data processing in Hadoop follows a structured flow that ensures large datasets are efficiently
processed across distributed systems. The process starts with raw data being divided into smaller
chunks, processed in parallel, and finally aggregated to generate meaningful output.

Understanding this step-by-step workflow is essential to optimizing performance and managing


large-scale data effectively.

InputSplit and RecordReader

Before processing begins, Hadoop logically divides the dataset into manageable parts. This
ensures that data is efficiently read and distributed across the cluster, optimizing resource
utilization and parallel execution. Logical splitting prevents unnecessary data fragmentation,
reducing processing overhead and enhancing cluster efficiency.

Below is how it works:

InputSplit divides data into logical chunks: These chunks do not physically split files but create
partitions for parallel execution. The size of each split depends on the HDFS block size (default
128MB) and can be customized based on cluster configuration. For example, a 10GB file might
be divided into five 2GB splits, allowing multiple nodes to process different parts simultaneously
without excessive disk seeks.

RecordReader converts InputSplit data into key-value pairs: Hadoop processes data in key-value
pairs. The RecordReader reads raw data from an InputSplit and structures it for the mapper. For
instance, in a text processing job, it may convert each line into a key-value pair where the key is
the byte offset and the value is the line content.With data split into smaller logical units and
structured into key-value pairs, the next step involves processing this data to extract meaningful
information. This is where the mapper and combiner come into play.

Mapper and Combiner

Once data is split and formatted, it enters the mapper phase. The mapper plays a critical role in
processing and transforming input data before passing it to the next stage.A combiner, an
optional step, optimizes performance by reducing data locally, minimizing the volume of
intermediate data that needs to be transferred to reducers.

Below is how this stage functions:

The mapper processes each key-value pair and transforms it: It extracts meaningful information
by applying logic on input data. For example, in a word count program, the mapper receives
lines of text, breaks them into words, and assigns each word a count of one. If the input is
"Hadoop is powerful. Hadoop is scalable.", the mapper emits key-value pairs like ("Hadoop",
1), ("is", 1), ("powerful", 1), and so on.

The combiner performs local aggregation to reduce intermediate data: Since mappers generate
large amounts of intermediate data, the combiner helps by merging values locally before sending
them to reducers. For example, in the same word count program, if the mapper processes 1,000
occurrences of "Hadoop" across different InputSplits, the combiner sums up word counts locally,
reducing multiple ("Hadoop", 1) pairs to a single ("Hadoop", 1000) before the shuffle phase.
This minimizes data transfer and speeds up processing.

Now that the data has been processed and locally aggregated, it needs to be efficiently distributed
to ensure balanced workload distribution. The partitioner and shuffle step handle this crucial
process. Let’s take a close look in the next section.

Partitioner and Shuffle

Once the mapper and optional combiner complete processing, data must be organized efficiently
before reaching the reducer. The partitioner and shuffle phase ensures a smooth and evenly
distributed data flow in Hadoop.

The partitioner assigns key-value pairs to reducers based on keys – It ensures that related data
reaches the same reducer. For example, in a word count job, words starting with 'A' may go to
Reducer 1, while words starting with 'B' go to Reducer 2.

Shuffling transfers and sorts intermediate data before reduction – After partitioning, data is
shuffled across nodes, sorting it by key. This ensures that all values associated with the same
key are sent to the same reducer. It prevents duplicate processing by grouping related data before
reduction. For instance, if "Hadoop" appears in multiple splits, all occurrences are grouped
together before reaching the reducer.

With data properly partitioned and transferred to reducers, the final stage focuses on aggregation
and output formatting. This ensures that the results are structured and stored appropriately for
further analysis.

Reducer and OutputFormat

The reducer aggregates and finalizes data, producing the final output. The OutputFormat ensures
that processed data is stored in the required format for further use, offering flexibility for
integration with various systems.

Below is how this stage works:

The reducer processes grouped key-value pairs and applies aggregation logic: It takes data from
multiple mappers, processes it, and generates the final output. For example, in a word count
program, if the word "Hadoop" appears 1,500 times across different splits, the reducer sums up
all occurrences and outputs "Hadoop: 1500".

OutputFormat determines how final data is stored and structured: Hadoop provides built-in
options like TextOutputFormat, SequenceFileOutputFormat, and AvroOutputFormat, allowing
data to be stored in various formats. Additionally, custom OutputFormats can be defined for
specific needs, such as structured storage in databases or exporting results in JSON for log
processing jobs. This flexibility allows seamless integration with data lakes, BI tools, and
analytics platforms.
With the entire data flow in Hadoop completed, it’s important to understand the essential components
that power this ecosystem. These building blocks ensure efficient data storage, processing, and
retrieval.

What are the Building Blocks of Hadoop?

Hadoop’s ecosystem is built on several essential components that work together to enable
efficient data storage, processing, and management. These building blocks ensure that large
datasets are processed in a distributed manner, allowing organizations to handle massive
volumes of structured and unstructured data.

1. HDFS (The Storage Layer)

As the name suggests, Hadoop Distributed File System is the storage layer of Hadoop and is
responsible for storing the data in a distributed environment (master and slave configuration). It
splits the data into several blocks of data and stores them across different data nodes. These data
blocks are also replicated across different data nodes to prevent loss of data when one of the
nodes goes down.
It has two main processes running for processing of the data: –

a. NameNode

It is running on the master machine. It saves the locations of all the files stored in the file system
and tracks where the data resides across the cluster i.e. it stores the metadata of the files. When
the client applications want to make certain operations on the data, it interacts with the
NameNode. When the NameNode receives the request, it responds by returning a list of Data
Node servers where the required data resides.

b. DataNode

This process runs on every slave machine. One of its functionalities is to store each HDFS data
block in a separate file in its local file system. In other words, it contains the actual data in form
of blocks. It sends heartbeat signals periodically and waits for the request from the NameNode
to access the data.

2. MapReduce (The processing layer)

It is a programming technique based on Java that is used on top of the Hadoop framework for
faster processing of huge quantities of data. It processes this huge data in a distributed
environment using many Data Nodes which enables parallel processing and faster execution of
operations in a fault-tolerant way. A MapReduce job splits the data set into multiple chunks of
data which are further converted into key-value pairs in order to be processed by the mappers.
The raw format of the data may not be suitable for processing. Thus, the input data compatible
with the map phase is generated using the InputSplit function and RecordReader. InputSplit is
the logical representation of the data which is to be processed by an individual mapper.
RecordReader converts these splits into records which take the form of key-value pairs. It
basically converts the byte-oriented representation of the input into a record-oriented
representation. These records are then fed to the mappers for further processing the data.
MapReduce jobs primarily consist of three phases namely the Map phase, the Shuffle phase, and
the Reduce phase.

a. Map Phase

It is the first phase in the processing of the data. The main task in the map phase is to process
each input from the RecordReader and convert it into intermediate tuples (key-value pairs). This
intermediate output is stored in the local disk by the mappers. The values of these key-value
pairs can differ from the ones received as input from the RecordReader. The map phase can also
contain combiners which are also called as local reducers. They perform aggregations on the
data but only within the scope of one mapper.

As the computations are performed across different data nodes, it is essential that all the values
associated with the same key are merged together into one reducer. This task is performed by
the partitioner. It performs a hash function over these key-value pairs to merge them together.It
also ensures that all the tasks are partitioned evenly to the reducers. Partitioners generally come
into the picture when we are working with more than one reducer.

b. Shuffle and Sort Phase

This phase transfers the intermediate output obtained from the mappers to the reducers. This
process is called shuffling. The output from the mappers is also sorted before transferring it to
the reducers. The sorting is done on the basis of the keys in the key-value pairs. It helps the
reducers to perform the computations on the data even before the entire data is received and
eventually helps in reducing the time required for computations.As the keys are sorted, whenever
the reducer gets a different key as the input it starts to perform the reduced tasks on the previously
received data.

c. Reduce Phase

The output of the map phase serves as an input to the reduce phase. It takes these key-value pairs
and applies the reduce function on them to produce the desired result. The keys and the values
associated with the key are passed on to the reduce function to perform certain operations.We
can filter the data or combine it to obtain the aggregated output. Post the execution of the reduce
function, it can create zero or more key-value pairs. This result is written back in the Hadoop
Distributed File System.

3. YARN (The Management Layer):

Yet Another Resource Navigator is the resource managing component of Hadoop. There are
background processes running at each node (Node Manager on the slave machines and Resource
Manager on the master node) that communicate with each other for the allocation of resources.
The Resource Manager is the centrepiece of the YARN layer which manages resources among
all the applications and passes on the requests to the Node Manager.The Node Manager monitors
the resource utilization like memory, CPU, and disk of the machine and conveys the same to the
Resource Manager. It is installed on every Data Node and is responsible for executing the tasks
on the Data Nodes.From distributed storage to parallel data processing, each component plays a
key role in maintaining a smooth data flow in Hadoop. The next section explores the key benefits
of this data flow and its real-world applications.

Key Benefits and Use Cases of Data Flow in Hadoop

Hadoop enables scalable, cost-effective, and fault-tolerant data processing across distributed
environments. Below are some of the major benefits and practical use cases of data processing
in Hadoop.

Scalability for large-scale data processing – Hadoop’s distributed architecture allows you to
scale storage and processing power horizontally. For example, companies handling petabytes of
data, such as Facebook and Twitter, use Hadoop clusters to process user interactions, ad
targeting, and recommendation algorithms.
Cost-effective storage and processing – Traditional relational databases can be expensive for
storing and processing big data. Hadoop provides a low-cost alternative by using commodity
hardware, making it ideal for businesses dealing with vast data volumes. A retail giant like
Walmart, for instance, uses Hadoop to analyze consumer purchasing behavior at scale.

Fault tolerance and high availability – Hadoop replicates data across multiple nodes, ensuring
no single point of failure. If one node fails, another automatically takes over, maintaining
uninterrupted processing. This feature is critical for financial institutions that cannot afford data
loss or downtime.

Efficient handling of structured and unstructured data – Unlike traditional databases that struggle
with unstructured data, Hadoop seamlessly processes images, videos, social media posts, and
sensor data. Companies like Netflix rely on Hadoop to analyze streaming preferences and
enhance user recommendations.

Real-time and batch data processing – Hadoop supports both real-time and batch processing
through its ecosystem tools like Apache Spark and MapReduce. This flexibility is essential for
cybersecurity firms detecting fraudulent transactions in real-time while also analyzing historical
trends.

Optimized data flow for analytics and machine learning – With frameworks like Apache Mahout
and Spark MLlib, Hadoop simplifies large-scale machine learning applications. Organizations
use this capability to build predictive models for healthcare, finance, and e-commerce.

Real-time fraud detection in financial services – Hadoop enables banks and financial institutions
to detect fraudulent transactions in real time by analyzing large volumes of customer transaction
data. PayPal and Mastercard use Hadoop to identify anomalies and trigger fraud alerts within
milliseconds.

IoT data processing in smart cities – Hadoop helps smart cities process vast IoT sensor data from
traffic systems, surveillance cameras, and environmental monitors. Barcelona and Singapore use
Hadoop-based analytics to optimize urban planning, reduce congestion, and improve public
safety.

Managing resources and applications with Hadoop YARN(Yet Another Resource


Negotiator).
YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop 2.0 to
remove the bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was described
as a “Redesigned Resource Manager” at the time of its launching, but it has now evolved to be
known as large-scale distributed operating system used for Big Data processing.
YARN architecture basically separates resource management layer from the processing layer. In
Hadoop 1.0 version, the responsibility of Job tracker is split between the resource manager and
application manager.

YARN also allows different data processing engines like graph processing, interactive
processing, stream processing as well as batch processing to run and process data stored in HDFS
(Hadoop Distributed File System) thus making the system much more efficient. Through its
various components, it can dynamically allocate various resources and schedule the application
processing. For large volume data processing, it is quite necessary to manage the available
resources properly so that every application can leverage them.
YARN Features: YARN gained popularity because of the following features-

 Scalability: The scheduler in Resource manager of YARN architecture allows Hadoop to


extend and manage thousands of nodes and clusters.
 Compatibility: YARN supports the existing map-reduce applications without disruptions
thus making it compatible with Hadoop 1.0 as well.
 Cluster Utilization:Since YARN supports Dynamic utilization of cluster in Hadoop, which
enables optimized Cluster Utilization.
 Multi-tenancy: It allows multiple engine access thus giving organizations a benefit of multi-
tenancy.
Hadoop YARN Architecture

The main components of YARN architecture include:

 Client: It submits map-reduce jobs.


 Resource Manager: It is the master daemon of YARN and is responsible for resource
assignment and management among all the applications. Whenever it receives a processing
request, it forwards it to the corresponding node manager and allocates resources for the
completion of the request accordingly. It has two major components:
o Scheduler: It performs scheduling based on the allocated application and
available resources. It is a pure scheduler, means it does not perform other tasks
such as monitoring or tracking and does not guarantee a restart if a task fails. The
YARN scheduler supports plugins such as Capacity Scheduler and Fair Scheduler
to partition the cluster resources.
o Application manager: It is responsible for accepting the application and
negotiating the first container from the resource manager. It also restarts the
Application Master container if a task fails.
 Node Manager: It take care of individual node on Hadoop cluster and manages application
and workflow and that particular node. Its primary job is to keep-up with the Resource
Manager. It registers with the Resource Manager and sends heartbeats with the health status
of the node. It monitors resource usage, performs log management and also kills a container
based on directions from the resource manager. It is also responsible for creating the
container process and start it on the request of Application master.
 Application Master: An application is a single job submitted to a framework. The
application master is responsible for negotiating resources with the resource manager,
tracking the status and monitoring progress of a single application. The application master
requests the container from the node manager by sending a Container Launch Context(CLC)
which includes everything an application needs to run. Once the application is started, it
sends the health report to the resource manager from time-to-time.
 Container: It is a collection of physical resources such as RAM, CPU cores and disk on a
single node. The containers are invoked by Container Launch Context(CLC) which is a
record that contains information such as environment variables, security tokens,
dependencies etc.
Application workflow in Hadoop YARN:
1.Client submits an application
2.The Resource Manager allocates a container to start the Application Manager
3.The Application Manager registers itself with the Resource Manager
4.The Application Manager negotiates containers from the Resource Manager
5.The Application Manager notifies the Node Manager to launch containers
6.Application code is executed in the container
7.Client contacts Resource Manager/Application Manager to monitor application’s status
8.Once the processing is complete, the Application Manager un-registers with the Resource
Manager
Advantages :
 Flexibility: YARN offers flexibility to run various types of distributed processing systems
such as Apache Spark, Apache Flink, Apache Storm, and others. It allows multiple
processing engines to run simultaneously on a single Hadoop cluster.
 Resource Management: YARN provides an efficient way of managing resources in the
Hadoop cluster. It allows administrators to allocate and monitor the resources required by
each application in a cluster, such as CPU, memory, and disk space.
 Scalability: YARN is designed to be highly scalable and can handle thousands of nodes in
a cluster. It can scale up or down based on the requirements of the applications running on
the cluster.
 Improved Performance: YARN offers better performance by providing a centralized
resource management system. It ensures that the resources are optimally utilized, and
applications are efficiently scheduled on the available resources.
 Security: YARN provides robust security features such as Kerberos authentication, Secure
Shell (SSH) access, and secure data transmission. It ensures that the data stored and
processed on the Hadoop cluster is secure.

Disadvantages:
 Complexity: YARN adds complexity to the Hadoop ecosystem. It requires additional
configurations and settings, which can be difficult for users who are not familiar with YARN.
 Overhead: YARN introduces additional overhead, which can slow down the performance
of the Hadoop cluster. This overhead is required for managing resources and scheduling
applications.
 Latency: YARN introduces additional latency in the Hadoop ecosystem. This latency can
be caused by resource allocation, application scheduling, and communication between
components.
 Single Point of Failure: YARN can be a single point of failure in the Hadoop cluster. If
YARN fails, it can cause the entire cluster to go down. To avoid this, administrators need to
set up a backup YARN instance for high availability.
 Limited Support: YARN has limited support for non-Java programming languages.
Although it supports multiple processing engines, some engines have limited language
support, which can limit the usability of YARN in certain environments.

Introduction to Map Reduce Programming:


MapReduce is a programming model for writing applications that can process Big Data in parallel
on multiple nodes. MapReduce provides analytical capabilities for analyzing huge volumes of
complex data.
Why MapReduce?
Traditional Enterprise Systems normally have a centralized server to store and process data. The
following illustration depicts a schematic view of a traditional enterprise system. Traditional model
is certainly not suitable to process huge volumes of scalable data and cannot be accommodated by
standard database servers. Moreover, the centralized system creates too much of a bottleneck while
processing multiple files simultaneously.

Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce divides a
task into small parts and assigns them to many computers. Later, the results are collected at one
place and integrated to form the result dataset.

How MapReduce Works?

The MapReduce algorithm contains two important tasks, namely Map and Reduce.

 The Map task takes a set of data and converts it into another set of data, where individual elements
are broken down into tuples (key-value pairs).
 The Reduce task takes the output from the Map as an input and combines those data tuples (key-
value pairs) into a smaller set of tuples.

The reduce task is always performed after the map job.

Let us now take a close look at each of the phases and try to understand their significance.
 Input Phase − Here we have a Record Reader that translates each record in an input file and sends
the parsed data to the mapper in the form of key-value pairs.
 Map − Map is a user-defined function, which takes a series of key-value pairs and processes each
one of them to generate zero or more key-value pairs.
 Intermediate Keys − They key-value pairs generated by the mapper are known as intermediate
keys.
 Combiner − A combiner is a type of local Reducer that groups similar data from the map phase
into identifiable sets. It takes the intermediate keys from the mapper as input and applies a user-
defined code to aggregate the values in a small scope of one mapper. It is not a part of the main
MapReduce algorithm; it is optional.
 Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads the
grouped key-value pairs onto the local machine, where the Reducer is running. The individual key-
value pairs are sorted by key into a larger data list. The data list groups the equivalent keys together
so that their values can be iterated easily in the Reducer task.
 Reducer − The Reducer takes the grouped key-value paired data as input and runs a Reducer
function on each one of them. Here, the data can be aggregated, filtered, and combined in a number
of ways, and it requires a wide range of processing. Once the execution is over, it gives zero or
more key-value pairs to the final step.
 Output Phase − In the output phase, we have an output formatter that translates the final key-
value pairs from the Reducer function and writes them onto a file using a record writer.

MapReduce-Example

Let us take a real-world example to comprehend the power of MapReduce. Twitter receives around
500 million tweets per day, which is nearly 3000 tweets per second. The following illustration
shows how Tweeter manages its tweets with the help of MapReduce.
As shown in the illustration, the MapReduce algorithm performs the following actions −

 Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value pairs.
 Filter − Filters unwanted words from the maps of tokens and writes the filtered maps as key-value
pairs.
 Count − Generates a token counter per word.
 Aggregate Counters − Prepares an aggregate of similar counter values into small manageable
units.

The MapReduce algorithm contains two important tasks, namely Map and Reduce.

 The map task is done by means of Mapper Class


 The reduce task is done by means of Reducer Class.

Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper class is used
as input by Reducer class, which in turn searches matching pairs and reduces them.

MapReduce implements various mathematical algorithms to divide a task into small parts and
assign them to multiple systems. In technical terms, MapReduce algorithm helps in sending the
Map & Reduce tasks to appropriate servers in a cluster.

These mathematical algorithms may include the following −

 Sorting
 Searching
 Indexing
 TF-IDF

Sorting
Sorting is one of the basic MapReduce algorithms to process and analyze data. MapReduce
implements sorting algorithm to automatically sort the output key-value pairs from the mapper
by their keys.

 Sorting methods are implemented in the mapper class itself.


 In the Shuffle and Sort phase, after tokenizing the values in the mapper class, the Context class
(user-defined class) collects the matching valued keys as a collection.
 To collect similar key-value pairs (intermediate keys), the Mapper class takes the help
of RawComparator class to sort the key-value pairs.
 The set of intermediate key-value pairs for a given Reducer is automatically sorted by Hadoop to
form key-values (K2, {V2, V2, …}) before they are presented to the Reducer.

Searching

Searching plays an important role in MapReduce algorithm. It helps in the combiner phase
(optional) and in the Reducer phase. Let us try to understand how Searching works with the help
of an example.

Example

The following example shows how MapReduce employs Searching algorithm to find out the
details of the employee who draws the highest salary in a given employee dataset.

 Let us assume we have employee data in four different files − A, B, C, and D. Let us also assume
there are duplicate employee records in all four files because of importing the employee data from
all database tables repeatedly. See the following illustration.

 The Map phase processes each input file and provides the employee data in key-value pairs (<k,
v> : <emp name, salary>). See the following illustration.

 The combiner phase (searching technique) will accept the input from the Map phase as a key-
value pair with employee name and salary. Using searching technique, the combiner will check all
the employee salary to find the highest salaried employee in each file. See the following snippet.
<k: employee name, v: salary>
Max= the salary of an first employee. Treated as max salary
if(v(second employee).salary > Max){
Max = v(salary);
}

else{
Continue checking;
}

The expected result is as follows −

<satish, <gopal, <kiran, <manisha,


26000> 50000> 45000> 45000>

 Reducer phase − Form each file, you will find the highest salaried employee. To avoid
redundancy, check all the <k, v> pairs and eliminate duplicate entries, if any. The same algorithm
is used in between the four <k, v> pairs, which are coming from four input files. The final output
should be as follows −
<gopal, 50000>

Indexing:Normally indexing is used to point to a particular data and its address. It performs
batch indexing on the input files for a particular Mapper.

The indexing technique that is normally used in MapReduce is known as inverted index. Search
engines like Google and Bing use inverted indexing technique. Let us try to understand how
Indexing works with the help of a simple example.

Example: The following text is the input for inverted indexing. Here T[0], T[1], and t[2] are the
file names and their content are in double quotes.
T[0] = "it is what it is"
T[1] = "what is it"
T[2] = "it is a banana"

After applying the Indexing algorithm, we get the following output −


"a": {2}
"banana": {2}
"is": {0, 1, 2}
"it": {0, 1, 2}
"what": {0, 1}
Here "a": {2} implies the term "a" appears in the T[2] file. Similarly, "is": {0, 1, 2} implies the
term "is" appears in the files T[0], T[1], and T[2].
TF-IDF: TF-IDF is a text processing algorithm which is short for Term Frequency − Inverse
Document Frequency. It is one of the common web analysis algorithms. Here, the term
'frequency' refers to the number of times a term appears in a document.

Term Frequency (TF): It measures how frequently a particular term occurs in a document. It is
calculated by the number of times a word appears in a document divided by the total number of
words in that document.
TF(the) = (Number of times term the ‘the’ appears in a document) / (Total number of terms in
the document)

Inverse Document Frequency (IDF): It measures the importance of a term. It is calculated by


the number of documents in the text database divided by the number of documents where a
specific term appears.

While computing TF, all the terms are considered equally important. That means, TF counts the
term frequency for normal words like “is”, “a”, “what”, etc. Thus we need to know the frequent
terms while scaling up the rare ones, by computing the following −
IDF(the) = log_e(Total number of documents / Number of documents with term ‘the’ in it).

The algorithm is explained below with the help of a small example.

Example

Consider a document containing 1000 words, wherein the word hive appears 50 times. The TF
for hive is then (50 / 1000) = 0.05.

Now, assume we have 10 million documents and the word hive appears in 1000 of these. Then,
the IDF is calculated as log(10,000,000 / 1,000) = 4.

The TF-IDF weight is the product of these quantities − 0.05 × 4 = 0.20.

You might also like