Introduction: MapReduce
• The concept of MapReduce was pioneered by Google.
• The original paper titled "MapReduce: Simplified Data Processing on Large
Clusters" was written by Jeffrey Dean and Sanjay Ghemawat, and it was
published in 2004.
• In the paper, they introduced the MapReduce programming model and
described its implementation at Google for processing large-scale data across
distributed clusters.
• MapReduce became a fundamental framework for distributed computing and
played a significant role in the development of big data technologies.
• While Google introduced the concept, the open-source Apache Hadoop project
later implemented its own version of MapReduce, making it accessible to a
broader community of developers and organizations.
Prerequisites that can help you grasp MapReduce more effectively
1. Programming Languages:
• Proficiency in a programming language is crucial.
• Java is commonly used in the Hadoop ecosystem, and many MapReduce
examples are written in Java.
• Knowledge of Python can also be useful.
2. Distributed Systems:
• Understanding the basics of distributed computing is essential.
• Familiarize yourself with concepts like nodes, clusters, parallel processing, and
the challenges associated with distributed systems.
3. Hadoop Ecosystem:
• MapReduce is often associated with the Hadoop framework.
• Therefore, it's helpful to have a basic understanding of Hadoop and its
ecosystem components, such as HDFS (Hadoop Distributed File System) and
YARN (Yet Another Resource Negotiator).
4. Basic Understanding of Big Data:
• MapReduce is commonly used in the context of big data processing.
• It's beneficial to have a foundational understanding of what constitutes "big
data," the challenges associated with large datasets, and the motivation behind
distributed computing for big data.
5. Linux/Unix Commands:
• Many big data platforms, including Hadoop, are typically deployed on Unix-
like systems.
• Familiarity with basic command-line operations in a Unix environment can be
helpful for interacting with Hadoop clusters.
6. SQL (Structured Query Language):
• If you are planning to use tools like Apache Hive, which provides a SQL-like
interface for querying data in Hadoop, a basic understanding of SQL can be
beneficial.
7. Concepts of Data Storage and Retrieval:
• Understanding how data is stored and retrieved in a distributed environment
is crucial.
• Concepts like Sharding, replication, and indexing are relevant.
8. Algorithmic and Problem-Solving Skills:
• MapReduce involves breaking down problems into smaller tasks that can be
executed in parallel.
• Strong algorithmic and problem-solving skills are valuable for designing
efficient MapReduce jobs.
Explanation
Q: Describe MapReduce Execution steps with a neat diagram 12 M
• MapReduce is a programming model and processing technique designed for
processing and generating large datasets that can be parallelized across a
distributed cluster of computers
• A job means a MapReduce Program.
• Each job consists of several smaller unit, called MapReduce Tasks.
• The basic idea behind MapReduce is to divide a large computation into smaller
tasks that can be performed in parallel across multiple nodes in a cluster.
In a MapReduce job
1. The data is split into smaller chunks, and a "map" function is applied to each
chunk independently.
2. The results are then shuffled and sorted, and a "reduce" function is applied to
combine the intermediate results into the final output.
MapReduce Programing approach allows for efficient processing of large datasets in
a distributed computing environment.
JobTracker and Task Tracker
• MapReduce consists of a single master JobTracker and one slave TaskTracker
per cluster node.
• The master is responsible for scheduling the component tasks in a job onto the
slaves, monitoring them and re-executing the failed tasks.
• The slaves execute the tasks as directed by the master.
• The MapReduce framework operates entirely on key, value-pairs.
• * The framework views the input to the task as a set of (key, value) pairs and
produces a set of (key, value) pairs as the output of the task, with different
types.
Map-Tasks
Map task means a task that implements a map( ) function.
which runs user application codes for each key-value pair (kl, vl).
• Key kl is a set of keys.
• Key kl maps to group of data values.
• Values vl are a large string which is read from the input file(s).
• The output of map( ) would be zero (when no values are found) or intermediate
key-value pairs (k2, v2).
Reduce Task
• Refers to a task which takes the output v2 from the map as an input and
combines those data pieces into a smaller set of data using a combiner.
• The reduce task is always performed after the map task.
Key-Value Pair
Each phase (Map phase and Reduce phase) of MapReduce has key-value pairs as
input and output.
Data should be first converted into key-value pairs before it is passed to the Mapper,
as the Mapper only understands key-value pairs of data.
Key-value pairs in Hadoop MapReduce are generated as follows:
• InputSplit - Defines a logical representation of data and presents a Split data
for processing at individual map ().
• RecordReader - Communicates with the Input Split and converts the Split into
records which are in the form of key-value pairs in a format suitable for reading
by the Mapper.
• RecordReader uses TextlnputFormat by default for converting data into key-
value pairs.
• RecordReader communicates with the InputSplit until the file is read.
Grouping by Key
• When a map task completes, Shuffle process aggregates (combines) all the
Mapper outputs by grouping the key-values of the Mapper output, and the
value v2 append in a list of values.
• A "Group By" operation on intermediate keys creates v2.
Shuffle and Sorting Phase
• All pairs with the same group key (k2) collect and group together, creating one
group for each key.
• Shuffle output format will be a List of. Thus, a different subset of the
intermediate key space assigns to each reduce node.
Reduced Tasks
• Implements reduce () that takes the Mapper output (which shuffles and sorts),
which is grouped by key-values (k2, v2) and applies it in parallel to each
group.
• Reduce function iterates over the list of values associated with a key and
produces outputs such as aggregations and statistics.
• The reduce function sends output zero or another set of key-value pairs (k3,
v3) to the final the output file. Reduce: {(k2, list (v2) -> list (k3, v3)}
MapReduce Implementation
• MapReduce is a programming model and processing technique for handling
large datasets in a parallel and distributed fashion.
• The word count problem is a classic example of a task that can be solved using
MapReduce.
• Mathematical representation of the MapReduce algorithm for the word count
problem.
Example:
Step 1: Input Document:
D="hello Hadoop, hi Hadoop, Hello MongoDB, hi Cassandra Hadoop"
Step 2: Map Function:
The Map function processes each word in the document and emits key-value pairs
where the key is the word, and the value is 1 (indicating the count).
Map("hello”) →{("hello",1)},
Map("Hadoop”) →{("Hadoop",1), ("Hadoop",1), ("Hadoop",1)},
Map("hi”) →{("hi",1), ("hi",1)}, …
Step 3: Shuffle and Sort (Grouping by Key):
Group and sort the intermediate key-value pairs by key.
("hello", [1]), ("Hadoop", [1,1,1]), ("hi", [1,1]), …
Step 4: Reduce Function:
The Reduce function takes each unique key and the list of values and calculates the
sum.
Reduce ("hello", [1]) →{("hello",1)},
Reduce ("Hadoop", [1,1,1]) →{("Hadoop",3)},
Reduce ("hi", [1,1]) →{("hi",2)}, …
Step 5: Final Output:
{("hello",1), ("Hadoop",3), ("hi",2), ("Hello",1), ("MongoDB",1), ("Cassandra",1)}