BDA Unit2 Notes
BDA Unit2 Notes
Introduction to Hadoop: Introducing hadoop, Why hadoop, Why not RDBMS, RDBMS Vs Hadoop,
History of Hadoop, Hadoop overview, Use case of Hadoop, HDFS (Hadoop Distributed File
System),Processing data with Hadoop, Managing resources and applications with Hadoop YARN(Yet
Another Resource Negotiator).
Introduction to Map Reduce Programming: Introduction, Mapper, Reducer, Combiner, Partitioner,
Searching, Sorting, Compression.
INTRODUCTION:
Hadoop is an open-source software framework that is used for storing and processing large
amounts of data in a distributed computing environment. It is designed to handle big data and is
based on the MapReduce programming model, which allows for the parallel processing of large
datasets.
What is Hadoop?
Hadoop is an open source software programming framework for storing a large amount of data
and performing the computation. Its framework is based on Java programming with some native
code in C and shell scripts.
Hadoop is an open-source software framework that is used for storing and processing large
amounts of data in a distributed computing environment. It is designed to handle big data and is
based on the MapReduce programming model, which allows for the parallel processing of large
datasets.
Hadoop has two main components:
HDFS (Hadoop Distributed File System): This is the storage component of Hadoop, which
allows for the storage of large amounts of data across multiple machines. It is designed to
work with commodity hardware, which makes it cost-effective.
YARN (Yet Another Resource Negotiator): This is the resource management component of
Hadoop, which manages the allocation of resources (such as CPU and memory) for
processing the data stored in HDFS.
Hadoop also includes several additional modules that provide additional functionality, such
as Hive (a SQL-like query language), Pig (a high-level platform for creating MapReduce
programs), and HBase (a non-relational, distributed database).
Hadoop is commonly used in big data scenarios such as data warehousing, business
intelligence, and machine learning. It’s also used for data processing, data analysis, and data
mining. It enables the distributed processing of large data sets across clusters of computers
using a simple programming model.
History of Hadoop
Apache Software Foundation is the developers of Hadoop, and it’s co-founders are Doug
Cutting and Mike Cafarella. It’s co-founder Doug Cutting named it on his son’s toy elephant.
In October 2003 the first paper release was Google File System. In January 2006, MapReduce
development started on the Apache Nutch which consisted of around 6000 lines coding for it
and around 5000 lines coding for HDFS. In April 2006 Hadoop 0.1.0 was released.
Hadoop is an open-source software framework for storing and processing big data. It was created
by Apache Software Foundation in 2006, based on a white paper written by Google in 2003 that
described the Google File System (GFS) and the MapReduce programming model. The Hadoop
framework allows for the distributed processing of large data sets across clusters of computers
using simple programming models. It is designed to scale up from single servers to thousands of
machines, each offering local computation and storage. It is used by many organizations,
including Yahoo, Facebook, and IBM, for a variety of purposes such as data warehousing, log
processing, and research. Hadoop has been widely adopted in the industry and has become a key
technology for big data processing.
Features of hadoop:
1. it is fault tolerance.
2. it is highly available.
3. it’s programming is easy.
4. it have huge flexible storage.
5. it is low cost.
Hadoop has several key features that make it well-suited for big data processing:
Distributed Storage: Hadoop stores large data sets across multiple machines, allowing for the
storage and processing of extremely large amounts of data.
Scalability: Hadoop can scale from a single server to thousands of machines, making it easy
to add more capacity as needed.
Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it can continue to
operate even in the presence of hardware failures.
Data locality: Hadoop provides data locality feature, where the data is stored on the same
node where it will be processed, this feature helps to reduce the network traffic and improve
the performance
High Availability: Hadoop provides High Availability feature, which helps to make sure that
the data is always available and is not lost.
Flexible Data Processing: Hadoop’s MapReduce programming model allows for the
processing of data in a distributed fashion, making it easy to implement a wide variety of
data processing tasks.
Data Integrity: Hadoop provides built-in checksum feature, which helps to ensure that the
data stored is consistent and correct.
Data Replication: Hadoop provides data replication feature, which helps to replicate the data
across the cluster for fault tolerance.
Data Compression: Hadoop provides built-in data compression feature, which helps to
reduce the storage space and improve the performance.
YARN: A resource management platform that allows multiple data processing engines like
real-time streaming, batch processing, and interactive SQL, to run and process data stored in
HDFS.
Hadoop Distributed File System
It has distributed file system known as HDFS and this HDFS splits files into blocks and sends
them across various nodes in form of large clusters. Also in case of a node failure, the system
operates and data transfer takes place between the nodes which are facilitated by HDFS.
HDFS
Shuffle and Sort: The Task of Reducer starts with this step, the process in which the Mapper
generates the intermediate key-value and transfers them to the Reducer task is known as Shuffling.
Using the Shuffling process the system can sort the data using its key value.
Once some of the Mapping tasks are done Shuffling begins that is why it is a faster process and does
not wait for the completion of the task performed by Mapper.
Reduce: The main function or task of the Reduce is to gather the Tuple generated from Map and then
perform some sorting and aggregation sort of process on those key-value depending on its key
element.
OutputFormat: Once all the operations are performed, the key-value pairs are written into the file
with the help of record writer, each record in a new line, and the key and value in a space-separated
manner.
2. HDFS
HDFS(Hadoop Distributed File System) is utilized for storage permission. It is mainly designed for
working on commodity Hardware devices(inexpensive devices), working on a distributed file system
design. HDFS is designed in such a way that it believes more in storing the data in a large chunk of blocks
rather than storing small data blocks.
HDFS in Hadoop provides Fault-tolerance and High availability to the storage layer and the other devices
present in that Hadoop cluster. Data storage Nodes in HDFS.
NameNode(Master)
DataNode(Slave)
NameNode:NameNode works as a Master in a Hadoop cluster that guides the Datanode(Slaves).
Namenode is mainly used for storing the Metadata i.e. the data about the data. Meta Data can be the
transaction logs that keep track of the user’s activity in a Hadoop cluster.
Meta Data can also be the name of the file, size, and the information about the location(Block number,
Block ids) of Datanode that Namenode stores to find the closest DataNode for Faster Communication.
Namenode instructs the DataNodes with the operation like delete, create, Replicate, etc.
DataNode: DataNodes works as a Slave DataNodes are mainly utilized for storing the data in a Hadoop
cluster, the number of DataNodes can be from 1 to 500 or even more than that. The more number of
DataNode, the Hadoop cluster will be able to store more data. So it is advised that the DataNode should
have High storing capacity to store a large number of file blocks.
High Level Architecture Of Hadoop
File Block In HDFS: Data in HDFS is always stored in terms of blocks. So the single block of data is
divided into multiple blocks of size 128MB which is default and you can also change it manually.
Let’s understand this concept of breaking down of file in blocks with an example. Suppose you have
uploaded a file of 400MB to your HDFS then what happens is this file got divided into blocks of
128MB+128MB+128MB+16MB = 400MB size. Means 4 blocks are created each of 128MB except the
last one. Hadoop doesn’t know or it doesn’t care about what data is stored in these blocks so it considers
the final file blocks as a partial record as it does not have any idea regarding it. In the Linux file system,
the size of a file block is about 4KB which is very much less than the default size of file blocks in the
Hadoop file system. As we all know Hadoop is mainly configured for storing the large size data which
is in petabyte, this is what makes Hadoop file system different from other file systems as it can be scaled,
nowadays file blocks of 128MB to 256MB are considered in Hadoop.
Replication In HDFS Replication ensures the availability of the data. Replication is making a copy of
something and the number of times you make a copy of that particular thing can be expressed as it’s
Replication Factor. As we have seen in File blocks that the HDFS stores the data in the form of various
blocks at the same time Hadoop is also configured to make a copy of those file blocks.
By default, the Replication Factor for Hadoop is set to 3 which can be configured means you can change
it manually as per your requirement like in above example we have made 4 file blocks which means that
3 Replica or copy of each file block is made means total of 4×3 = 12 blocks are made for the backup
purpose.
This is because for running Hadoop we are using commodity hardware (inexpensive system hardware)
which can be crashed at any time. We are not using the supercomputer for our Hadoop setup. That is why
we need such a feature in HDFS which can make copies of that file blocks for backup purposes, this is
known as fault tolerance.
Now one thing we also need to notice that after making so many replica’s of our file blocks we are
wasting so much of our storage but for the big brand organization the data is very much important than
the storage so nobody cares for this extra storage. You can configure the Replication factor in your hdfs-
site.xml file.
Rack Awareness The rack is nothing but just the physical collection of nodes in our Hadoop cluster
(maybe 30 to 40). A large Hadoop cluster is consists of so many Racks . with the help of this Racks
information Namenode chooses the closest Datanode to achieve the maximum performance while
performing the read/write information which reduces the Network Traffic.
HDFS Architecture
3. YARN(Yet Another Resource Negotiator) : YARN is a Framework on which MapReduce works.
YARN performs 2 operations that are Job scheduling and Resource Management. The Purpose of Job
schedular is to divide a big task into small jobs so that each job can be assigned to various slaves in a
Hadoop cluster and Processing can be Maximized. Job Scheduler also keeps track of which job is
important, which job has more priority, dependencies between the jobs and all the other information like
job timing, etc. And the use of Resource Manager is to manage all the resources that are made available
for running a Hadoop cluster.
Features of YARN
Multi-Tenancy
Scalability
Cluster-Utilization
Compatibility
4. Hadoop common or Common Utilities: Hadoop common or Common utilities are nothing but our
java library and java files or we can say the java scripts that we need for all the other components present
in a Hadoop cluster. these utilities are used by HDFS, YARN, and MapReduce for running the cluster.
Hadoop Common verify that Hardware failure in a Hadoop cluster is common so it needs to be solved
automatically in software by Hadoop Framework.
It is best suited for OLTP environment. It is best suited for BIG data.
Conclusion
RDBMS and Hadoop both are use for handle the data and store the data. If requirement is that
store the structured data and real time processing then RDBMS is ideal choice. If process the
large volumes of data (unstructured, structured data etc.) then prefer hadoop because it is suitable
for the big data applications.
Data processing in Hadoop follows a structured flow that ensures large datasets are efficiently
processed across distributed systems. The process starts with raw data being divided into smaller
chunks, processed in parallel, and finally aggregated to generate meaningful output.
Before processing begins, Hadoop logically divides the dataset into manageable parts. This
ensures that data is efficiently read and distributed across the cluster, optimizing resource
utilization and parallel execution. Logical splitting prevents unnecessary data fragmentation,
reducing processing overhead and enhancing cluster efficiency.
InputSplit divides data into logical chunks: These chunks do not physically split files but create
partitions for parallel execution. The size of each split depends on the HDFS block size (default
128MB) and can be customized based on cluster configuration. For example, a 10GB file might
be divided into five 2GB splits, allowing multiple nodes to process different parts simultaneously
without excessive disk seeks.
RecordReader converts InputSplit data into key-value pairs: Hadoop processes data in key-value
pairs. The RecordReader reads raw data from an InputSplit and structures it for the mapper. For
instance, in a text processing job, it may convert each line into a key-value pair where the key is
the byte offset and the value is the line content.With data split into smaller logical units and
structured into key-value pairs, the next step involves processing this data to extract meaningful
information. This is where the mapper and combiner come into play.
Once data is split and formatted, it enters the mapper phase. The mapper plays a critical role in
processing and transforming input data before passing it to the next stage.A combiner, an
optional step, optimizes performance by reducing data locally, minimizing the volume of
intermediate data that needs to be transferred to reducers.
The mapper processes each key-value pair and transforms it: It extracts meaningful information
by applying logic on input data. For example, in a word count program, the mapper receives
lines of text, breaks them into words, and assigns each word a count of one. If the input is
"Hadoop is powerful. Hadoop is scalable.", the mapper emits key-value pairs like ("Hadoop",
1), ("is", 1), ("powerful", 1), and so on.
The combiner performs local aggregation to reduce intermediate data: Since mappers generate
large amounts of intermediate data, the combiner helps by merging values locally before sending
them to reducers. For example, in the same word count program, if the mapper processes 1,000
occurrences of "Hadoop" across different InputSplits, the combiner sums up word counts locally,
reducing multiple ("Hadoop", 1) pairs to a single ("Hadoop", 1000) before the shuffle phase.
This minimizes data transfer and speeds up processing.
Now that the data has been processed and locally aggregated, it needs to be efficiently distributed
to ensure balanced workload distribution. The partitioner and shuffle step handle this crucial
process. Let’s take a close look in the next section.
Once the mapper and optional combiner complete processing, data must be organized efficiently
before reaching the reducer. The partitioner and shuffle phase ensures a smooth and evenly
distributed data flow in Hadoop.
The partitioner assigns key-value pairs to reducers based on keys – It ensures that related data
reaches the same reducer. For example, in a word count job, words starting with 'A' may go to
Reducer 1, while words starting with 'B' go to Reducer 2.
Shuffling transfers and sorts intermediate data before reduction – After partitioning, data is
shuffled across nodes, sorting it by key. This ensures that all values associated with the same
key are sent to the same reducer. It prevents duplicate processing by grouping related data before
reduction. For instance, if "Hadoop" appears in multiple splits, all occurrences are grouped
together before reaching the reducer.
With data properly partitioned and transferred to reducers, the final stage focuses on aggregation
and output formatting. This ensures that the results are structured and stored appropriately for
further analysis.
The reducer aggregates and finalizes data, producing the final output. The OutputFormat ensures
that processed data is stored in the required format for further use, offering flexibility for
integration with various systems.
The reducer processes grouped key-value pairs and applies aggregation logic: It takes data from
multiple mappers, processes it, and generates the final output. For example, in a word count
program, if the word "Hadoop" appears 1,500 times across different splits, the reducer sums up
all occurrences and outputs "Hadoop: 1500".
OutputFormat determines how final data is stored and structured: Hadoop provides built-in
options like TextOutputFormat, SequenceFileOutputFormat, and AvroOutputFormat, allowing
data to be stored in various formats. Additionally, custom OutputFormats can be defined for
specific needs, such as structured storage in databases or exporting results in JSON for log
processing jobs. This flexibility allows seamless integration with data lakes, BI tools, and
analytics platforms.
With the entire data flow in Hadoop completed, it’s important to understand the essential components
that power this ecosystem. These building blocks ensure efficient data storage, processing, and
retrieval.
Hadoop’s ecosystem is built on several essential components that work together to enable
efficient data storage, processing, and management. These building blocks ensure that large
datasets are processed in a distributed manner, allowing organizations to handle massive
volumes of structured and unstructured data.
As the name suggests, Hadoop Distributed File System is the storage layer of Hadoop and is
responsible for storing the data in a distributed environment (master and slave configuration). It
splits the data into several blocks of data and stores them across different data nodes. These data
blocks are also replicated across different data nodes to prevent loss of data when one of the
nodes goes down.
It has two main processes running for processing of the data: –
a. NameNode
It is running on the master machine. It saves the locations of all the files stored in the file system
and tracks where the data resides across the cluster i.e. it stores the metadata of the files. When
the client applications want to make certain operations on the data, it interacts with the
NameNode. When the NameNode receives the request, it responds by returning a list of Data
Node servers where the required data resides.
b. DataNode
This process runs on every slave machine. One of its functionalities is to store each HDFS data
block in a separate file in its local file system. In other words, it contains the actual data in form
of blocks. It sends heartbeat signals periodically and waits for the request from the NameNode
to access the data.
It is a programming technique based on Java that is used on top of the Hadoop framework for
faster processing of huge quantities of data. It processes this huge data in a distributed
environment using many Data Nodes which enables parallel processing and faster execution of
operations in a fault-tolerant way. A MapReduce job splits the data set into multiple chunks of
data which are further converted into key-value pairs in order to be processed by the mappers.
The raw format of the data may not be suitable for processing. Thus, the input data compatible
with the map phase is generated using the InputSplit function and RecordReader. InputSplit is
the logical representation of the data which is to be processed by an individual mapper.
RecordReader converts these splits into records which take the form of key-value pairs. It
basically converts the byte-oriented representation of the input into a record-oriented
representation. These records are then fed to the mappers for further processing the data.
MapReduce jobs primarily consist of three phases namely the Map phase, the Shuffle phase, and
the Reduce phase.
a. Map Phase
It is the first phase in the processing of the data. The main task in the map phase is to process
each input from the RecordReader and convert it into intermediate tuples (key-value pairs). This
intermediate output is stored in the local disk by the mappers. The values of these key-value
pairs can differ from the ones received as input from the RecordReader. The map phase can also
contain combiners which are also called as local reducers. They perform aggregations on the
data but only within the scope of one mapper.
As the computations are performed across different data nodes, it is essential that all the values
associated with the same key are merged together into one reducer. This task is performed by
the partitioner. It performs a hash function over these key-value pairs to merge them together.It
also ensures that all the tasks are partitioned evenly to the reducers. Partitioners generally come
into the picture when we are working with more than one reducer.
This phase transfers the intermediate output obtained from the mappers to the reducers. This
process is called shuffling. The output from the mappers is also sorted before transferring it to
the reducers. The sorting is done on the basis of the keys in the key-value pairs. It helps the
reducers to perform the computations on the data even before the entire data is received and
eventually helps in reducing the time required for computations.As the keys are sorted, whenever
the reducer gets a different key as the input it starts to perform the reduced tasks on the previously
received data.
c. Reduce Phase
The output of the map phase serves as an input to the reduce phase. It takes these key-value pairs
and applies the reduce function on them to produce the desired result. The keys and the values
associated with the key are passed on to the reduce function to perform certain operations.We
can filter the data or combine it to obtain the aggregated output. Post the execution of the reduce
function, it can create zero or more key-value pairs. This result is written back in the Hadoop
Distributed File System.
Yet Another Resource Navigator is the resource managing component of Hadoop. There are
background processes running at each node (Node Manager on the slave machines and Resource
Manager on the master node) that communicate with each other for the allocation of resources.
The Resource Manager is the centrepiece of the YARN layer which manages resources among
all the applications and passes on the requests to the Node Manager.The Node Manager monitors
the resource utilization like memory, CPU, and disk of the machine and conveys the same to the
Resource Manager. It is installed on every Data Node and is responsible for executing the tasks
on the Data Nodes.From distributed storage to parallel data processing, each component plays a
key role in maintaining a smooth data flow in Hadoop. The next section explores the key benefits
of this data flow and its real-world applications.
Hadoop enables scalable, cost-effective, and fault-tolerant data processing across distributed
environments. Below are some of the major benefits and practical use cases of data processing
in Hadoop.
Scalability for large-scale data processing – Hadoop’s distributed architecture allows you to
scale storage and processing power horizontally. For example, companies handling petabytes of
data, such as Facebook and Twitter, use Hadoop clusters to process user interactions, ad
targeting, and recommendation algorithms.
Cost-effective storage and processing – Traditional relational databases can be expensive for
storing and processing big data. Hadoop provides a low-cost alternative by using commodity
hardware, making it ideal for businesses dealing with vast data volumes. A retail giant like
Walmart, for instance, uses Hadoop to analyze consumer purchasing behavior at scale.
Fault tolerance and high availability – Hadoop replicates data across multiple nodes, ensuring
no single point of failure. If one node fails, another automatically takes over, maintaining
uninterrupted processing. This feature is critical for financial institutions that cannot afford data
loss or downtime.
Efficient handling of structured and unstructured data – Unlike traditional databases that struggle
with unstructured data, Hadoop seamlessly processes images, videos, social media posts, and
sensor data. Companies like Netflix rely on Hadoop to analyze streaming preferences and
enhance user recommendations.
Real-time and batch data processing – Hadoop supports both real-time and batch processing
through its ecosystem tools like Apache Spark and MapReduce. This flexibility is essential for
cybersecurity firms detecting fraudulent transactions in real-time while also analyzing historical
trends.
Optimized data flow for analytics and machine learning – With frameworks like Apache Mahout
and Spark MLlib, Hadoop simplifies large-scale machine learning applications. Organizations
use this capability to build predictive models for healthcare, finance, and e-commerce.
Real-time fraud detection in financial services – Hadoop enables banks and financial institutions
to detect fraudulent transactions in real time by analyzing large volumes of customer transaction
data. PayPal and Mastercard use Hadoop to identify anomalies and trigger fraud alerts within
milliseconds.
IoT data processing in smart cities – Hadoop helps smart cities process vast IoT sensor data from
traffic systems, surveillance cameras, and environmental monitors. Barcelona and Singapore use
Hadoop-based analytics to optimize urban planning, reduce congestion, and improve public
safety.
YARN also allows different data processing engines like graph processing, interactive
processing, stream processing as well as batch processing to run and process data stored in HDFS
(Hadoop Distributed File System) thus making the system much more efficient. Through its
various components, it can dynamically allocate various resources and schedule the application
processing. For large volume data processing, it is quite necessary to manage the available
resources properly so that every application can leverage them.
YARN Features: YARN gained popularity because of the following features-
Disadvantages:
Complexity: YARN adds complexity to the Hadoop ecosystem. It requires additional
configurations and settings, which can be difficult for users who are not familiar with YARN.
Overhead: YARN introduces additional overhead, which can slow down the performance
of the Hadoop cluster. This overhead is required for managing resources and scheduling
applications.
Latency: YARN introduces additional latency in the Hadoop ecosystem. This latency can
be caused by resource allocation, application scheduling, and communication between
components.
Single Point of Failure: YARN can be a single point of failure in the Hadoop cluster. If
YARN fails, it can cause the entire cluster to go down. To avoid this, administrators need to
set up a backup YARN instance for high availability.
Limited Support: YARN has limited support for non-Java programming languages.
Although it supports multiple processing engines, some engines have limited language
support, which can limit the usability of YARN in certain environments.
Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce divides a
task into small parts and assigns them to many computers. Later, the results are collected at one
place and integrated to form the result dataset.
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
The Map task takes a set of data and converts it into another set of data, where individual elements
are broken down into tuples (key-value pairs).
The Reduce task takes the output from the Map as an input and combines those data tuples (key-
value pairs) into a smaller set of tuples.
Let us now take a close look at each of the phases and try to understand their significance.
Input Phase − Here we have a Record Reader that translates each record in an input file and sends
the parsed data to the mapper in the form of key-value pairs.
Map − Map is a user-defined function, which takes a series of key-value pairs and processes each
one of them to generate zero or more key-value pairs.
Intermediate Keys − They key-value pairs generated by the mapper are known as intermediate
keys.
Combiner − A combiner is a type of local Reducer that groups similar data from the map phase
into identifiable sets. It takes the intermediate keys from the mapper as input and applies a user-
defined code to aggregate the values in a small scope of one mapper. It is not a part of the main
MapReduce algorithm; it is optional.
Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads the
grouped key-value pairs onto the local machine, where the Reducer is running. The individual key-
value pairs are sorted by key into a larger data list. The data list groups the equivalent keys together
so that their values can be iterated easily in the Reducer task.
Reducer − The Reducer takes the grouped key-value paired data as input and runs a Reducer
function on each one of them. Here, the data can be aggregated, filtered, and combined in a number
of ways, and it requires a wide range of processing. Once the execution is over, it gives zero or
more key-value pairs to the final step.
Output Phase − In the output phase, we have an output formatter that translates the final key-
value pairs from the Reducer function and writes them onto a file using a record writer.
MapReduce-Example
Let us take a real-world example to comprehend the power of MapReduce. Twitter receives around
500 million tweets per day, which is nearly 3000 tweets per second. The following illustration
shows how Tweeter manages its tweets with the help of MapReduce.
As shown in the illustration, the MapReduce algorithm performs the following actions −
Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value pairs.
Filter − Filters unwanted words from the maps of tokens and writes the filtered maps as key-value
pairs.
Count − Generates a token counter per word.
Aggregate Counters − Prepares an aggregate of similar counter values into small manageable
units.
The MapReduce algorithm contains two important tasks, namely Map and Reduce.
Mapper class takes the input, tokenizes it, maps and sorts it. The output of Mapper class is used
as input by Reducer class, which in turn searches matching pairs and reduces them.
MapReduce implements various mathematical algorithms to divide a task into small parts and
assign them to multiple systems. In technical terms, MapReduce algorithm helps in sending the
Map & Reduce tasks to appropriate servers in a cluster.
Sorting
Searching
Indexing
TF-IDF
Sorting
Sorting is one of the basic MapReduce algorithms to process and analyze data. MapReduce
implements sorting algorithm to automatically sort the output key-value pairs from the mapper
by their keys.
Searching
Searching plays an important role in MapReduce algorithm. It helps in the combiner phase
(optional) and in the Reducer phase. Let us try to understand how Searching works with the help
of an example.
Example
The following example shows how MapReduce employs Searching algorithm to find out the
details of the employee who draws the highest salary in a given employee dataset.
Let us assume we have employee data in four different files − A, B, C, and D. Let us also assume
there are duplicate employee records in all four files because of importing the employee data from
all database tables repeatedly. See the following illustration.
The Map phase processes each input file and provides the employee data in key-value pairs (<k,
v> : <emp name, salary>). See the following illustration.
The combiner phase (searching technique) will accept the input from the Map phase as a key-
value pair with employee name and salary. Using searching technique, the combiner will check all
the employee salary to find the highest salaried employee in each file. See the following snippet.
<k: employee name, v: salary>
Max= the salary of an first employee. Treated as max salary
if(v(second employee).salary > Max){
Max = v(salary);
}
else{
Continue checking;
}
Reducer phase − Form each file, you will find the highest salaried employee. To avoid
redundancy, check all the <k, v> pairs and eliminate duplicate entries, if any. The same algorithm
is used in between the four <k, v> pairs, which are coming from four input files. The final output
should be as follows −
<gopal, 50000>
Indexing:Normally indexing is used to point to a particular data and its address. It performs
batch indexing on the input files for a particular Mapper.
The indexing technique that is normally used in MapReduce is known as inverted index. Search
engines like Google and Bing use inverted indexing technique. Let us try to understand how
Indexing works with the help of a simple example.
Example: The following text is the input for inverted indexing. Here T[0], T[1], and t[2] are the
file names and their content are in double quotes.
T[0] = "it is what it is"
T[1] = "what is it"
T[2] = "it is a banana"
Term Frequency (TF): It measures how frequently a particular term occurs in a document. It is
calculated by the number of times a word appears in a document divided by the total number of
words in that document.
TF(the) = (Number of times term the ‘the’ appears in a document) / (Total number of terms in
the document)
While computing TF, all the terms are considered equally important. That means, TF counts the
term frequency for normal words like “is”, “a”, “what”, etc. Thus we need to know the frequent
terms while scaling up the rare ones, by computing the following −
IDF(the) = log_e(Total number of documents / Number of documents with term ‘the’ in it).
Example
Consider a document containing 1000 words, wherein the word hive appears 50 times. The TF
for hive is then (50 / 1000) = 0.05.
Now, assume we have 10 million documents and the word hive appears in 1000 of these. Then,
the IDF is calculated as log(10,000,000 / 1,000) = 4.