Big Data
Map Reduce
Table of Contents
Key approach to work with Big Data............................................................................................................2
Mapping......................................................................................................................................................2
The Map Step..........................................................................................................................................2
Appling the Reduce Step.............................................................................................................................3
Reduce Step.............................................................................................................................................3
Map Reduce Data Flow................................................................................................................................4
A Closer Look at the map and partition Step...............................................................................................5
1
Big Data
Map Reduce
Key approach to work with Big Data
MapReduce is a programing model for processing large data sets, and the name of an
implementation of the model by Google.
MapReduce is typically used to do distribute computing of large datasets on clusters of
computers.
Worker Node1
Map
Problem Data
Master Node Worker Node2
Problem Data Worker Node3
Mapping
The Map Step
The master node takes the input, divides it into smaller sub-problems, and distributed them to
worker nodes.
This process is iterative which can lead to a multi-level tree structure.
2
Big Data
Map Reduce
The worker nodes process their small problem and hand their result back to their parent node.
INPUT LIST
MAPPING FUNCTION
OUTPUT LIST
Appling the Reduce Step
Reduce Step
The master node will then collect the answer from all the child nodes and combine them in a meaningful
way to from the primary output, which is the answer to the problem that was put to the system.
Input List
MAPPING FUNCTION
Output List
3
Big Data
Map Reduce
Map Reduce Data Flow
Input Format
Split Split Split File
File
RR RR RR
Map Map Map
Partitioner
(Short)
Reduce
Output Format
If we zoom in on each part of the MapReduce framework, we see this is a large distributed sort.
The most important steps are defined as follows.
An input function
A Map Function
A Partition function
A compare/sort function
4
Big Data
Map Reduce
A reduce function
An output writer
A Closer Look at the map and partition Step
The map function takes a series of key/value pairs; it will then subdivide these further creating the
full structure.
Each Map node output is assigned to a particular reducer by the application’s partition function for
sharing purpose.
The partition function is given the key and the number of reduce and return the index.
The input for each reduces is pulled from the machine where the map ran and sorted using the
application’s comparison function.
The framework calls the applications reduce function once for each unique key in the sorted
order. The reduce can iterate through the values that are associated with the key and produce
zero or more outputs.
The output writer writes the output of the reduce of the stable storage, usually a distributed file
system.
5
Big Data
Map Reduce
Input List
MAPPING FUNCTION
Output List