KEMBAR78
Unit-2 MapReduce2024 | PDF | Map Reduce | Apache Hadoop
0% found this document useful (0 votes)
32 views41 pages

Unit-2 MapReduce2024

Heyy

Uploaded by

patelgaming0901
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views41 pages

Unit-2 MapReduce2024

Heyy

Uploaded by

patelgaming0901
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

2CEIT702: Big Data Analytics

MapReduce
(Hadoop Processing Component)
Hadoop

Challenge: Data is too big store in one computer


Hadoop Solution: Data is stored in multiple computer.
Challenge: Very high end machines are expensive
Hadoop solution: Run on commodity hardware
Challenge: commodity hardware will fail.
Hadoop Solution: Software is intelligent enough to deal with hardware failure.
Challenge: hardware failure may lead to data loss
Hadoop Solution: Replicate data
Challenge: how will the distributed nodes co-ordinate among themselves
Hadoop solution: Master node that co-ordinates all the worker nodes
Distributed processing (Traditional way)
• When the MapReduce framework was not there, how parallel and distributed processing
used to happen in a traditional way.
• Example-1: Weather dataset (Daily average temperature of the years from 2000 to
2018). Query: want to calculate the day having the highest temperature in each year.
• In traditional way:
Example-2: Counting election votes
Example
Challenges-Distributed processing (Traditional way)

• If, any of the machines delay the job, the whole work gets delayed (Critical path
problem).
• What if, any of the machines which are working with a part of data fails? The
management of this failover becomes a challenge (Reliability problem).
• how to equally divide the data such that no individual machine is overloaded or
underutilized (Equal split issue).
• There should be a mechanism to aggregate the result generated by each of the
machines to produce the final output (Aggregation of the result).

Above are the issues which we will have to take care individually while performing
parallel processing of huge data sets when using traditional approaches.
To overcome these issues, we have the MapReduce framework which allows us to
perform such parallel computations without bothering about the issues like reliability,
fault tolerance etc.
Main Hadoop Components
MapReduce
MapReduce: What It Is and Why It Is Important
• MapReduce is a Distributed Data Processing Algorithm, introduced by Google in it’s
MapReduce Tech Paper (2004).

• MapReduce is a programming framework that allows us to perform distributed and


parallel processing on large data sets in a distributed environment.

• MapReduce is a programming model that allows us to perform parallel and distributed


processing on huge data sets.
Example: Counting the votes
How does MapReduce work?
• Three main steps of MapReduce: Map Function, Shuffle Function, Reduce Function
• Map Function
• Map Function is the first step in MapReduce Algorithm.

• Map Function performs the following two sub-steps:

• Splitting (takes input DataSet from Source and divide into smaller Sub-DataSets)
• Mapping (On Sub-DataSets, perform required computation on each Sub-DataSet.)

• The output of this Map Function is a set of key and value pairs as <Key, Value>.
How does MapReduce work?

MapReduce First Step Output:


How does MapReduce work?

• Shuffle Function
• It is the second step in MapReduce Algorithm. Shuffle Function is also know as
“Combine Function”.
• It takes a list of outputs coming from “Map Function” and perform following two sub-
steps on each and every key-value pair.

• Merging: this step combines all key-value pairs which have same keys (that is
grouping key-value pairs by comparing “Key”). This step returns <Key,
List<Value>>.

• Sorting: this step takes input from Merging step and sort all key-value pairs by
using Keys. This step also returns <Key, List<Value>> output but with sorted key-
value pairs.

• Finally, Shuffle Function returns a list of <Key, List<Value>> sorted pairs to next step.
How does MapReduce work?

MapReduce Second Step Output:


How does MapReduce work?
• Reduce Function
• It is the final step in MapReduce Algorithm. It performs only one step : Reduce step.
• It takes list of <Key, List<Value>> sorted pairs from Shuffle Function and perform
reduce operation as shown below.

MapReduce Final Step Output:

• Final step output looks like first step output.


• However final step <Key, Value> pairs are different
than first step <Key, Value> pairs.
• Final step <Key, Value> pairs are computed and
sorted pairs.
MapReduce Example – Word Count
• Problem Statement:
• Count the number of occurrences of each word available in a DataSet.
• Input DataSet:

• Client Required Final Result:


• MapReduce – Map Function (Split Step)
• MapReduce – Map Function (Mapping Step)
• MapReduce – Shuffle Function (Merge Step)
• MapReduce – Shuffle Function (Merge Step)
• MapReduce – Shuffle Function (Sorting Step)
• MapReduce – Reduce Function (Reduce Step)
Example-2
Example-3
Example-4
How does MapReduce Work ?
From start to finish, there are four fundamental transformations.
Data is :
• 1. Transformed from the input files and fed into the mappers.
• 2. Transformed by the mappers.
• 3. Sorted, merged, and presented to the reducer.
• 4. Transform by reducers and written to output files.
• Most of the computing takes place on nodes with data on local disks that reduces
the network traffic. After completion of the given tasks, the cluster collects and
reduces the data to form an appropriate result, and sends it back to the Hadoop
server.
Input Output
Map <k1, v1> list (<k2, v2>)
Reduce <k2, list(v2)> list (<k3, v3>)
Map and Reduce Phases:
• A MapReduce program in Hadoop is called a Hadoop job.

• Map task is broken into Phases like:


• RecordReader,
• Mapper,
• Combiner,
• Partitioner

• Reducer task is broken into phases like:


• Shuffle,
• Sort,
• Reducer,
• O/P format
Technical details of Map and Reduce Phases
• Input Files: The data for a Map Reduce task is stored in input files and these input files
are generally stored in HDFS.
• InputFormat:
• Take the file from HDFS and divides the data into the number of splits (typically
64/128mb).
• InputFormat defines how the input files are split and read.
• InputFormat create the InputSplit.
• InputSplit:
• InputSplit is the logical representation of data.
• It represents the data which is processed by an individual Mapper.
• One map task is created for each Input Split. (Note: Number of InputSplits = Number
of map tasks).
• The split is divided into records and each record will be processed by the mapper.
Technical details of Map and Reduce Phases
• RecordReader:
• It communicates with InputSplit in and converts the data into key-value pairs
suitable for reading by the mapper.
• By default, it uses TextInputFormat for converting data into a key-value pair.
• It assigns byte offset (unique number) to each line present in the file. Then, these
key-value pairs are sent to the mapper for further processing.
Map and Reduce Phases:
• Mapper
• Map function works on the key-value pair produced by RecordReader and generate
zero or more intermediate key-value pairs.
• Mappers output is passed to the combiner for further process.

• Combiner
• It is optional function but provide high performance in terms of n/w bandwidth and
disk space.
• Like map output in some stage is <1,10>, <1,15>, <1,20>, <2,5>, <2,60> and the
purpose of map-reduce job is to find the maximum value corresponding to each
key.
• In combiner you can reduce this data to <1,20> , <2,60> as 20 and 60 are maximum
value for key 1 and key 2 respectively.
• It is an optimization technique for MapReduce Job.
• Output generated by combiner is intermediate data and it is passed to the reducer.
Map and Reduce Phases:
• Partitioner
• It happens after map phase and before reduce phase.
• Map task returns output in <key,value> form.
• Partitioner is club the data which should go to the same reducer based on keys.

• Example:
• If map output is <1,10> , <1,15> , <1,20> , <2, 13> , <2,6> , <4,8> , <4,20> etc.
• We can see that there are 3 different keys which are 1, 2 and 4.
• In map-reduce number of reduce tasks are fixed and each reduce task
should handle all the data related to one key.
• That means map output like <1,10>, <1,15>, <1,20> should be handled by same
reduce tasks.
• It is not possible that <1,10> is handled by one reduce task and <1,15> is handled
by another reduce task as key which is 1 is same.
Entire workflow of a job in Hadoop.
How MapReduce Organizes Work?
 Hadoop divides the job into two tasks. 1) Map-tasks(Splits & Mapping) 2)
Reduce-tasks(Shuffling, Reducing). This two tasks are controlled by two types of
entities:
 JobTracker: Act as single-master (responsible
for complete execution of submitted job). Each
job, there is one Jobtracker that resides on
Namenode.

 TaskTrackers: Act as one slaves/Cluster


Node, each of them performing the job as
directed by the master. Each job, there are
multiple Tasktrackers which reside on
Datanode.

Job submission process is as follows:


 Client (i.e., driver program) creates a job, configures it, and submits it to the
JobTracker
 JobClient computes input splits (on client end)
 Job data (jar, configuration XML) are sent to JobTracker
 JobTracker puts job data in a shared location, enqueues tasks
 TaskTrackers poll for tasks
Hadoop Map Reduce Architecture
How MapReduce Organizes Work?
• A job is divided into multiple tasks which are then run onto multiple data nodes in a cluster.

• It is the responsibility of job tracker to coordinate the activity by scheduling tasks to run on
different data nodes.

• Execution of individual task is then to look after by task tracker, which resides on every data
node executing part of the job.

• Task tracker's responsibility is to send the progress report to the job tracker.

• In addition, task tracker periodically sends 'heartbeat' signal to the Jobtracker so as to notify
him of the current state of the system.

• Thus job tracker keeps track of the overall progress of each job. In the event of task failure, the
job tracker can reschedule it on a different task tracker.
MapReduce Architecture explained in detail
• One map task is created for each split which then executes map function for each record
in the split.

• Execution of map tasks results into writing output to a local disk on the respective node
and not to HDFS.

• Map output is intermediate output which is processed by reduce tasks to produce the
final output.

• Once the job is complete, the map output can be thrown away. So, storing it in HDFS with
replication becomes overkill.
MapReduce Architecture explained in detail
• In the event of node failure, before the map output is consumed by the reduce task,
Hadoop reruns the map task on another node and re-creates the map output.

• An output of every map task is fed to the reduce task. Map output is transferred to the
machine where reduce task is running.

• On this machine, the output is merged and then passed to the user-defined reduce
function.

• Unlike the map output, reduce output is stored in HDFS (the first replica is stored on the
local node and other replicas are stored on off-rack nodes). So, writing the reduce
output
Terminology used in Map Reduce
Applications implement the Map and the Reduce functions, and form the core of
Pay Load
the job.
Mapper Mapper maps the input key/value pairs to a set of intermediate key/value pair.

Named Node Node that manages the Hadoop Distributed File System (HDFS).

DataNode Node where data is presented in advance before any processing takes place.

Master Node Node where JobTracker runs and which accepts job requests from clients.

Slave Node Node where Map and Reduce program runs.

Job Tracker Schedules jobs and tracks the assign jobs to Task tracker.

Task Tracker Tracks the task and reports status to JobTracker.

Job A program is an execution of a Mapper and Reducer across a dataset.

Task An execution of a Mapper or a Reducer on a slice of data.

Task Attempt A particular instance of an attempt to execute a task on a SlaveNode.


MapReduce key benefits

Benefit Description

Simplicity Developers have freedom to write applications in their language of choice, such as Java, C++
or Python, and MapReduce jobs are easy to run.
Scalability MapReduce can process huge amount of data (in terms of petabytes and more) , stored in
HDFS on one cluster
Speed It uses parallel processing it means that MapReduce can take problems that used to take
days to solve which can be solved in hours or minutes

MapReduce also takes care of failures. If a machine with one copy of the data is unavailable
Recovery than another machine has a copy of the same key/value pair, which can be used to solve the
same sub-task. The JobTracker keeps track of these all processes.

MapReduce moves are able to compute processes to the data on HDFS and not the other way
Minimal around. Processing tasks can occur on the physical node where the data resides. This
data motion significantly reduces the network I/O patterns and contributes to Hadoop’s processing
speed.
• Business and other organizations run calculations to:

• Determine the price for their products that yields the highest profits.

• Know precisely how effective their advertising is and where they should spend their
ad dollars.

• Make long-range weather predictions.

• Web clicks, sales records purchased from retailers, and Twitter trending topics to
determine what new products the company should produce in the upcoming season.
MapReduce Use Case: Global Warming
Query: we want to know how much global warming has raised the ocean’s temperature.
• Input temperature readings from thousands of OceanSignals all over the globe.
(OceanSignals, DateTime, longitude, latitude, lowTemp, highTemp)

• Run a map over every OceanSignals -dateTime reading and add the average temperature as a
field:
(OceanSignals, DateTime, longitude, latitude, lowTemp, highTemp, Average)

• Drop DateTime column and produce one average temperature for each OceanSignals
(OceanSignal N, Average) //like, (OceanSignal 1, Average) , (OceanSignal 2, Average)

• Then the reduce operation runs.


Ocean Average Temperature = Average (OceanSignal n) + Average (OceanSignal n-1) + … +
Average (OceanSignal 2) + Average (OceanSignal 1) / number of OceanSignals

You might also like