KEMBAR78
MapReduce Unit3 | PDF | Map Reduce | Apache Hadoop
0% found this document useful (0 votes)
13 views27 pages

MapReduce Unit3

MapReduce is a programming model developed by Google for processing large datasets in parallel across distributed clusters, forming a core component of the Hadoop ecosystem. It consists of two primary functions, Map and Reduce, which transform input data into key-value pairs and aggregate them, respectively, while offering features like scalability, fault tolerance, and data locality. Common use cases include log analysis, data transformation, and machine learning, with alternatives like Apache Spark and Apache Flink available for specific needs.

Uploaded by

FANZ3R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views27 pages

MapReduce Unit3

MapReduce is a programming model developed by Google for processing large datasets in parallel across distributed clusters, forming a core component of the Hadoop ecosystem. It consists of two primary functions, Map and Reduce, which transform input data into key-value pairs and aggregate them, respectively, while offering features like scalability, fault tolerance, and data locality. Common use cases include log analysis, data transformation, and machine learning, with alternatives like Apache Spark and Apache Flink available for specific needs.

Uploaded by

FANZ3R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

MapReduce

Introduction
 MapReduce is a popular programming model for processing large
datasets in parallel across a distributed cluster of computers.
Developed by Google, it has become an essential component of the
Hadoop ecosystem, enabling efficient data processing and analysis.

 MapReduce is designed to address the challenges associated with


processing massive amounts of data by breaking the problem into
smaller, more manageable tasks. It consists of two primary functions:
Map and Reduce, which work together to process and analyze data.
The Map Function
The Map function takes input data and
processes it into intermediate key-value pairs.
It applies a user-defined function to each input
record, generating output pairs that are then
sorted and grouped by key.

The Reduce Function


The Reduce function processes the
intermediate key-value pairs generated by the
Map function. It aggregates, filters, or combines
the data based on a user-defined function,
generating the final output.
Key Features of
MapReduce
Scalability
MapReduce can scale to process vast amounts of data by distributing tasks across a large
number of nodes in a cluster. This allows it to handle massive datasets, making it suitable for Big
Data applications.
Fault Tolerance
MapReduce incorporates built-in fault tolerance to ensure the reliable processing of data. It
automatically detects and handles node failures, rerunning tasks on available nodes as needed.
Data Locality
MapReduce takes advantage of data locality by processing data on the same node where it is
stored, minimizing data movement across the network and improving overall performance.
Simplicity
The MapReduce programming model abstracts away many complexities associated with
distributed computing, allowing developers to focus on their data processing logic rather than
low-level details.
Cost-Effective Solution
Hadoop's scalable architecture and MapReduce programming framework make storing and
processing extensive data sets very economical.
Parallel Programming
Tasks are divided into programming models to allow for the simultaneous execution of
independent operations. As a result, programs run faster due to parallel processing, making it
easier for a process to handle each job. Thanks to parallel processing, these distributed tasks
can be performed by multiple processors. Therefore, all software runs faster.
MapReduce in the Hadoop Ecosystem
Hadoop, an open-source Big Data processing framework, utilizes MapReduce as its
core component for data processing. Hadoop's implementation of MapReduce
provides additional features and capabilities, such as:
Hadoop Distributed File System (HDFS)
HDFS is a distributed file system designed to store and manage large datasets
across multiple nodes in a cluster. It provides high throughput access to data,
facilitating efficient MapReduce processing.

JobTracker and TaskTracker


In the Hadoop ecosystem, the JobTracker and TaskTracker components manage the
scheduling, distribution, and monitoring of MapReduce jobs. The JobTracker assigns
tasks to available TaskTrackers, which in turn execute the Map and Reduce functions
on their respective nodes.
YARN (Yet Another Resource Negotiator)
YARN is a resource management layer in Hadoop that separates the concerns of
resource management and job scheduling, allowing for better scalability and
flexibility in the cluster.
Benefits of Using MapReduce
Enhanced Data Processing Efficiency
MapReduce's ability to process data in parallel across a distributed
cluster enables it to handle large datasets quickly and efficiently.
Cost-Effective Solution
By utilizing commodity hardware, MapReduce offers a cost-effective
solution for organizations looking to process and analyze Big Data
without investing in expensive infrastructure.
Adaptability
MapReduce can be applied to a wide range of data processing tasks,
making it an adaptable solution for various industries and use
cases.
Robustness
With built-in fault tolerance and data locality features, MapReduce
ensures that data processing remains robust and reliable, even in
the face of hardware failures or network issues.
Common Use Cases for MapReduce
•Log Analysis: MapReduce is widely used for log analysis, processing and
aggregating data from server logs to identify trends, patterns, or anomalies in
user behavior or system performance.

•Data Transformation: MapReduce can perform large-scale data


transformations, such as converting raw data into a more structured format,
filtering out irrelevant data, or extracting specific features for further analysis.

•Machine Learning: MapReduce can be employed in machine learning tasks,


including training models, feature extraction, and model evaluation. Its ability to
parallelize tasks makes it well-suited for processing massive datasets commonly
encountered in machine learning applications.

•Text Analysis: Text analysis tasks, such as sentiment analysis, topic modeling,
or keyword extraction, can be efficiently performed using MapReduce, enabling
organizations to derive insights from unstructured textual data.
MapReduce Alternatives and Complementary Technologies
While MapReduce has proven effective for many data processing tasks, other
technologies have emerged to address specific needs or offer improved performance
in certain scenarios:
Apache Spark
Apache Spark is a fast, in-memory data processing engine that provides an
alternative to MapReduce for certain use cases. Spark's Resilient Distributed Datasets
(RDDs) enable more efficient iterative processing, making it particularly suitable for
machine learning and graph processing tasks.

Apache Flink
Apache Flink is a stream-processing framework that offers low-latency, high-
throughput data processing. While MapReduce focuses on batch processing, Flink's
ability to process data in real-time makes it an attractive option for time-sensitive
applications.
Apache Hive
Apache Hive is a data warehousing solution built on top of Hadoop
that provides an SQL-like interface for querying large datasets.
While not a direct replacement for MapReduce, Hive can simplify
data processing tasks for users familiar with SQL.
Difference Between MapReduce, Apache Spark, Apache Flink,
and Apache Hive

Feature MapReduce Apache Spark Apache Flink Apache Hive

Primary Focus Batch Processing In-Memory Processing Stream Processing Data Warehousing

Resilient Distributed
Data Processing Model Map and Reduce Data Streaming SQL-like Querying
Datasets (RDDs)

Fault Tolerance Yes Yes Yes Yes

Scalability High High High High

Log Analysis, Data


Iterative Machine Real-time Data Large-scale Data
Transformation,
Use Cases Learning, Graph Processing, Time- Querying, Data
Machine Learning,
Processing sensitive Applications Analytics
Text Analysis

Java, Python, Ruby,


Language Support Java, Scala, Python, R Java, Scala, Python HiveQL (SQL-like)
and others
•Job – A Job in the context of Hadoop MapReduce is the unit of
work to be performed as requested by the client / user. The
information associated with the Job includes the data to be
processed (input data), MapReduce logic / program / algorithm,
and any other relevant configuration information necessary to
execute the Job.

•Task – Hadoop MapReduce divides a Job into multiple sub-jobs


known as Tasks. These tasks can be run independent of each
other on various nodes across the cluster. There are primarily
two types of Tasks – Map Tasks and Reduce Tasks.
JobTracker
 JobTracker – Just like the storage (HDFS), the computation
(MapReduce) also works in a master-slave / master-worker fashion. A
JobTracker node acts as the Master and is responsible for scheduling /
executing Tasks on appropriate nodes, coordinating the execution of
tasks, sending the information for the execution of tasks, getting the
results back after the execution of each task, re-executing the failed
Tasks, and monitors / maintains the overall progress of the Job. Since a
Job consists of multiple Tasks, a Job’s progress depends on the status /
progress of Tasks associated with it. There is only one JobTracker node
per Hadoop Cluster.
• TaskTracker :
A TaskTracker node acts as the Slave and is responsible for
executing a Task assigned to it by the JobTracker. There is no
restriction on the number of TaskTracker nodes that can exist in
a Hadoop Cluster. TaskTracker receives the information
necessary for execution of a Task from JobTracker, Executes the
Task, and Sends the Results back to JobTracker.
Map()
• Map Task in MapReduce is performed using the Map()
function. This part of the MapReduce is responsible for
processing one or more chunks of data and producing the
output results
Reduce()
The next part / component / stage of the MapReduce programming model is the
Reduce() function. This part of the MapReduce is responsible for consolidating
the results produced by each of the Map() functions/tasks.

Data Locality

MapReduce tries to place the data and the compute as close as possible. First, it
tries to put the compute on the same node where data resides, if that cannot be
done (due to reasons like compute on that node is down, compute on that node
is performing some other computation, etc.), then it tries to put the compute on
the node nearest to the respective data node(s) which contains the data to be
processed. This feature of MapReduce is “Data Locality”.
The following diagram shows the logical flow of a MapReduce programming
model.

The stages depicted above are

Input:
This is the input data / file to be processed.

Split:
Hadoop splits the incoming data into smaller pieces called “splits”.

Map:
this step, MapReduce processes each split according to the logic defined in map() function.
Each mapper works on each split at a time. Each mapper is treated as a task and multiple
tasks are executed across different TaskTrackers and coordinated by the JobTracker.
Combine:
This is an optional step and is used to improve the performance by reducing the
amount of data transferred across the network. Combiner is the same as the
reduce step and is used for aggregating the output of the map() function before it
is passed to the subsequent steps.

Shuffle & Sort:


In this step, outputs from all the mappers is shuffled, sorted to put them in order,
and grouped before sending them to the next step.

Reduce:
This step is used to aggregate the outputs of mappers using the reduce()
function. Output of reducer is sent to the next and final step. Each reducer is
treated as a task and multiple tasks are executed across different TaskTrackers
and coordinated by the JobTracker.

Output:
Finally the output of reduce step is written to a file in HDFS.
Word Count Example
For the purpose of understanding MapReduce, let us consider a simple
example. Let us assume that we have a file which contains the following four
lines of text.

In this file, we need to count the number of occurrences of each word. For
instance, DW appears twice, BI appears once, SSRS appears twice, and so
on. Let us see how this counting operation is performed when this file is
input to MapReduce.
Below is a simplified representation of the data flow for Word Count Example.
•Input: In this step, the sample file is input to MapReduce.
•Split: In this step, Hadoop splits / divides our sample input file into four parts, each
part made up of one line from the input file. Note that, for the purpose of this
example, we are considering one line as each split. However, this is not necessarily
true in a real-time scenario.
•Map: In this step, each split is fed to a mapper which is the map() function
containing the logic on how to process the input data, which in our case is the line
of text present in the split. For our scenario, the map() function would contain the
logic to count the occurrence of each word and each occurrence is captured /
arranged as a (key, value) pair, which in our case is like (SQL, 1), (DW, 1), (SQL, 1),
and so on.
•Combine: This is an optional step and is often used to improve the performance by
reducing the amount of data transferred across the network. This is essentially the
same as the reducer (reduce() function) and acts on output from each mapper. In
our example, the key value pairs from first mapper “(SQL, 1), (DW, 1), (SQL, 1)” are
combined and the output of the corresponding combiner becomes “(SQL, 2), (DW,
1)”.
•Shuffle and Sort: In this step, output of all the mappers is collected,
shuffled, and sorted and arranged to be sent to reducer.
•Reduce: In this step, the collective data from various mappers, after being
shuffled and sorted, is combined / aggregated and the word counts are
produced as (key, value) pairs like (BI, 1), (DW, 2), (SQL, 5), and so on.
•Output: In this step, the output of the reducer is written to a file on HDFS.
The following image is the output of our word count example.
Types of InputFormat in MapReduce
There are different types of MapReduce InputFormat in Hadoop which are used for
different purpose. Let’s discuss the Hadoop InputFormat types below:

1. FileInputFormat
It is the base class for all file-based InputFormats. FileInputFormat also specifies
input directory which has data files location. When we start a MapReduce job
execution, FileInputFormat provides a path containing files to read.
This InpuFormat will read all files. Then it divides these files into one or more
InputSplits.

2. TextInputFormat
It is the default InputFormat. This InputFormat treats each line of each input file as a
separate record. It performs no parsing. TextInputFormat is useful for unformatted
data or line-based records like log files. Hence,
•Key – It is the byte offset of the beginning of the line within the file (not whole file
one split). So it will be unique if combined with the file name.
•Value – It is the contents of the line. It excludes line terminators.
3. KeyValueTextInputFormat
It is similar to TextInputFormat. This InputFormat also treats each line of input
as a separate record. While the difference is that TextInputFormat treats entire
line as the value, but the KeyValueTextInputFormat breaks the line itself into
key and value by a tab character (‘/t’). Hence,
•Key – Everything up to the tab character.
•Value – It is the remaining part of the line after tab character.

4. SequenceFileInputFormat
It is an InputFormat which reads sequence files. Sequence files are binary files.
These files also store sequences of binary key-value pairs. These are block-
compressed and provide direct serialization and deserialization of several
arbitrary data. Hence,
Key & Value both are user-defined.
5. SequenceFileAsTextInputFormat
It is the variant of SequenceFileInputFormat. This format converts the
sequence file key values to Text objects. So, it performs conversion by
calling ‘tostring()’ on the keys and values. Hence,
SequenceFileAsTextInputFormat makes sequence files suitable input for
streaming.

6. SequenceFileAsBinaryInputFormat
By using SequenceFileInputFormat we can extract the sequence file’s
keys and values as an opaque binary object.
7. NlineInputFormat
It is another form of TextInputFormat where the keys are byte offset of
the line. And values are contents of the line. So, each mapper receives a
variable number of lines of input with TextInputFormat and
KeyValueTextInputFormat.
The number depends on the size of the split. Also, depends on the
length of the lines. So, if want our mapper to receive a fixed number of
lines of input, then we use NLineInputFormat.
N- It is the number of lines of input that each mapper receives.
By default (N=1), each mapper receives exactly one line of input.
Suppose N=2, then each split contains two lines. So, one mapper
receives the first two Key-Value pairs. Another mapper receives the
second two key-value pairs.
8. DBInputFormat
This InputFormat reads data from a relational database, using JDBC.
It also loads small datasets, perhaps for joining with large datasets
from HDFS using MultipleInputs. Hence,
•Key – LongWritables
•Value – DBWritables.
What is Interactive Analytics?

 Businesses are collecting more data than ever, but if you don’t know how to effectively
interpret it, data is just facts and statistics. The value in data doesn’t come from
collecting it, but in how you make it actionable to drive business strategy. In order to
make better-informed decisions, your business needs to be able to effectively analyze
data, and that data needs to be comprehensible for as many decision-makers as
possible.
 Interactive analytics is a way to make real-time data more intelligible for non-technical
users through the use of tools that visualize and crunch the data, enabling users to
quickly and easily run complex queries and interpret them to gain the valuable insights
that factor into critical business decisions.

You might also like