0% found this document useful (0 votes)

135 views34 pages

MapReduce Introduction

The document describes MapReduce, a programming model and associated implementation for processing and generating large data sets on a distributed computing environment. It addresses common complexities in distributed computing like parallelization, fault tolerance, load balancing and bandwidth usage through the Map and Reduce functions. MapReduce has been widely adopted including by Google, Yahoo, Facebook and Amazon to solve problems involving large scale clustering, searching and analytics.

Uploaded by

WarunikaRanaweera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

135 views34 pages

MapReduce Introduction

Uploaded by

WarunikaRanaweera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

By: Jeffrey Dean & Sanjay Ghemawat

Presented by: Warunika Ranaweera

Supervised by: Dr. Nalin Ranasinghe

MapReduce: Simplified Data Processing

on Large Clusters
In Proceedings of the 6th Symposium on Operating Systems
Design and Implementation (OSDI' 04)
Also appears in the Communications of the ACM (2008)

Ph.D. in Computer Science University of Washington

Google Fellow in Systems and Infrastructure Group

ACM Fellow

Research Areas: Distributed Systems and Parallel Computing

Ph.D. in Computer Science Massachusetts Institute of

Technology

Google Fellow

Research Areas: Distributed Systems and Parallel Computing

Calculate 30*50
Easy?

3050 + 3151 + 3252 + 3352 + .... + 40*60

Little bit hard?

Simple computation, but huge data set

Real world example for large computations

20+ billion web pages * 20kB webpage
One computer reads 30/35 MB/sec from disc
Nearly four months to read the web

Parallelize tasks in a distributed computing

environment
Web page problem solved in 3 hours with
1000 machines

Complexities in Distributed Computing

o How to parallelize the computation?
o Coordinate with other nodes
o Handling failures
o Preserve bandwidth
o Load balancing

A platform to hide the messy details of distributed

computing

Which are,
Parallelization
Fault-tolerance
Data distribution
Load Balancing

A programming model

An implementation

Example: Word count

the quick
brown fox

the fox ate

the mouse

Document

the
quick
brown
fox
the
fox
ate
the
mouse

1
1
1
1
1
1
1
1
1

Mapped

the
quick
brown
fox
ate
mouse

3
1
1
2
1
1

Reduced

Eg: Word count using MapReduce

the
quick
brown
fox

Map

the, 3

the, 1
quick, 1
brown, 1
fox, 1

quick, 1
brown, 1
Reduce

the fox
ate
the
mouse

Input

Map

the, 1
fox, 1
ate,1
the, 1
mouse, 1

fox, 2
ate, 1
mouse, 1

Reduce

Output

Input Text file

Output (fox, 1)
Document Name

Document Contents

map(String key, String value):

for each word w in value:
EmitIntermediate(w, "1");
Intermediate key/value pair Eg: (fox, 1)

Input (fox, {1, 1})

Output (fox, 2)
Word

List of Counts (Output from Map)

reduce(String key, Iterator values):

int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
Accumulated Count

Reverse Web-Link Graph

Source
Web
page 1
Source
Web
page 5

Source
Web
page 4

Target
(My web
page)

Source
Web
page 2

Source
Web
page 3

Reverse Web-Link Graph

(My Web, Source 1)
(Not My Web, Source 2)
(My Web, Source 3)

Map

(My Web, Source 4)

(My Web, Source 5)

Source web pages

Target

(My Web, {Source 1, Source 3,.....})

Source pointing
to the target

Reduce

User Program
(1) Fork
(1) Fork

Master

(2) Assign Map

Split 0
Split 1

Split 2

(1) Fork

(2) Assign Reduce

Worker
(3) Read

(4) Local Write

Worker
(6) Write

Worker
(5) Remote Read

Split 3
Split 4

Worker

Input Layer

Map Layer

O/P File 0

Intermediate
Files

Worker

Reduce Layer

O/P File 1

Output Layer

Complexities in Distributed Computing, to be solved

parallelization
using Map & Reduce
o Automatic
How to parallelize
the computation?
o Coordinate with other nodes

o Handling failures
o Preserve bandwidth
o Load balancing

Restricted Programming model

User specified Map & Reduce functions

1000s of workers, different data sets

Data

Worker1

Worker2

Worker3

User-defined
Map/Reduce
Instruction

Complexities in Distributed Computing, solving..

o Automatic parallelization using Map & Reduce
o Coordinate with
nodesother
usingnodes
a master node

o Handling failures
o Preserve bandwidth
o Load balancing

Master data structure

Pushing information (meta-data) between
workers
Master
Information
Map
Worker

Information
Reduce
Worker

Complexities in Distributed Computing , solving..

o Automatic parallelization using Map & Reduce

o Coordinate nodes using a master node

o Fault
Handling
failures
tolerance
(Re-execution) & back up tasks
o Preserve bandwidth
o Load balancing

No response from a worker task?

If an ongoing Map or Reduce task: Re-execute

If a completed Map task: Re-execute

If a completed Reduce task: Remain untouched

Master failure (unlikely)

Restart

Straggler: machine that takes a long time

to complete the last steps in the computation

Solution: Redundant Execution

Near end of phase, spawn backup copies
Task that finishes first "wins"

Complexities in Distributed Computing , solving..

o Automatic parallelization using Map & Reduce

o Coordinate nodes using a master node

o Fault tolerance (Re-execution) & back up tasks
Preserve
bandwidth
o Saves
bandwidth
through locality
o Load balancing

Same data set in different machines

If a task has data locally, no need to access

other nodes

Complexities in Distributed Computing , solving..

solved

o Automatic parallelization using Map & Reduce

o Coordinate nodes using a master node

o Fault tolerance & back up tasks
o Saves bandwidth through locality
o Load balancing through granularity

Fine granularity tasks: map tasks > machines

1 worker several tasks

Idle workers are quickly assigned to work

Partitioning

Combining

Skipping bad records

Debuggers local execution

Counters

891 S

Normal Execution

1283 S

No backup tasks

44% increment in
time

Very long tail

Stragglers take
>300s to finish

891 S

933 S

5% increment in
time
Quick failure
recovery

Normal Execution

200 processes killed

Clustering for Google News and Google Product Search

Google Maps
Locating addresses
Map tiles rendering

Google PageRank

Localized Search

Apache Hadoop MapReduce

Hadoop Distributed File System (HDFS)

Used in,
Yahoo! Search
Facebook

Amazon
Twitter
Google

Higher level languages/systems based on Hadoop

Amazon Elastic MapReduce

Available for general public
Process data in the cloud

Pig and Hive

Large variety of problems can be expressed as Map

& Reduce

Restricted programming model

Easy to hide details of distributed computing

Achieved scalability & programming efficiency

NoSQL Intro
No ratings yet
NoSQL Intro
26 pages
10
No ratings yet
10
4 pages
How Sqoop Works?: Sqoop "SQL To Hadoop and Hadoop To SQL"
No ratings yet
How Sqoop Works?: Sqoop "SQL To Hadoop and Hadoop To SQL"
27 pages
No SQL
No ratings yet
No SQL
32 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
Hortonworks Cluster Config Guide.1.0
No ratings yet
Hortonworks Cluster Config Guide.1.0
15 pages
Distributed Systems REPORT
No ratings yet
Distributed Systems REPORT
39 pages
HBase for Big Data Professionals
No ratings yet
HBase for Big Data Professionals
100 pages
MapR Sandbox For Hadoop DocUpdateFor3.1.1
No ratings yet
MapR Sandbox For Hadoop DocUpdateFor3.1.1
7 pages
Consensus
No ratings yet
Consensus
77 pages
Distributed Database Systems Guide
No ratings yet
Distributed Database Systems Guide
3 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
CH 23
No ratings yet
CH 23
126 pages
How To Set Up A Hadoop Cluster in Docker
No ratings yet
How To Set Up A Hadoop Cluster in Docker
13 pages
PySpark RDD Assignment
No ratings yet
PySpark RDD Assignment
1 page
Cassandra Quick Guide
No ratings yet
Cassandra Quick Guide
60 pages
Big Data - S
No ratings yet
Big Data - S
79 pages
Chapter 2 Requirements and Technology
100% (1)
Chapter 2 Requirements and Technology
28 pages
Exams 2024 Python For Beginners
No ratings yet
Exams 2024 Python For Beginners
22 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
30 pages
BigData Exam C2122 PDF
100% (1)
BigData Exam C2122 PDF
6 pages
Perry Wolf
No ratings yet
Perry Wolf
16 pages
Hadoop Lab
100% (2)
Hadoop Lab
6 pages
Parallel Distributed Architecture For Storage and Sharing (PDash)
No ratings yet
Parallel Distributed Architecture For Storage and Sharing (PDash)
6 pages
Apache Flume Tutorial PDF
No ratings yet
Apache Flume Tutorial PDF
43 pages
Blockchain Databases: Practice Solutions
0% (1)
Blockchain Databases: Practice Solutions
4 pages
Sparql: Parql Rotocol ND DF Uery Anguage
No ratings yet
Sparql: Parql Rotocol ND DF Uery Anguage
22 pages
Big Data Mock Exam: Right or Wrong
No ratings yet
Big Data Mock Exam: Right or Wrong
11 pages
YARN: Advanced Cluster Management
No ratings yet
YARN: Advanced Cluster Management
34 pages
Chapter 5 Hive
No ratings yet
Chapter 5 Hive
69 pages
Unit 5.2 Issues With and Limitations of Hadoop v1 and MapReduce v1
No ratings yet
Unit 5.2 Issues With and Limitations of Hadoop v1 and MapReduce v1
15 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
41 pages
Mysql Replication With Heartbeat and DRBD
100% (3)
Mysql Replication With Heartbeat and DRBD
16 pages
Distributed Systems Coordination
No ratings yet
Distributed Systems Coordination
63 pages
Hive Join
No ratings yet
Hive Join
6 pages
Java RMI
No ratings yet
Java RMI
10 pages
ETL - PPT v0.2
No ratings yet
ETL - PPT v0.2
20 pages
Hive Using Hiveql
No ratings yet
Hive Using Hiveql
38 pages
Cloudera Administration Study Guide
No ratings yet
Cloudera Administration Study Guide
3 pages
BDA Lab Manual-1
No ratings yet
BDA Lab Manual-1
60 pages
Hdfs Part 1
No ratings yet
Hdfs Part 1
72 pages
8.2. CQL Exercises
100% (1)
8.2. CQL Exercises
16 pages
Full Stack UNIT 3
No ratings yet
Full Stack UNIT 3
36 pages
Cs9152 DBT Unit I Notes
100% (1)
Cs9152 DBT Unit I Notes
53 pages
Big Data Analytics Exam 2020
100% (1)
Big Data Analytics Exam 2020
10 pages
Hadoop Python MapReduce Tutorial For Beginners
No ratings yet
Hadoop Python MapReduce Tutorial For Beginners
15 pages
Big Data and Apache Spark Overview
No ratings yet
Big Data and Apache Spark Overview
211 pages
SPARK
No ratings yet
SPARK
125 pages
DBSCAN Algorithm
No ratings yet
DBSCAN Algorithm
15 pages
Spark-GraphX and Neo4j
No ratings yet
Spark-GraphX and Neo4j
32 pages
Hive Queries
No ratings yet
Hive Queries
5 pages
System Models For Distributed and Cloud Computing
No ratings yet
System Models For Distributed and Cloud Computing
15 pages
Distributed Databases: Solutions To Practice Exercises
No ratings yet
Distributed Databases: Solutions To Practice Exercises
4 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Lecture 2.1
No ratings yet
Lecture 2.1
13 pages
Paper Map Reduce
No ratings yet
Paper Map Reduce
16 pages
Lecture 3 - Big Data
No ratings yet
Lecture 3 - Big Data
43 pages
Map Reduce: Simplified Processing On Large Clusters
No ratings yet
Map Reduce: Simplified Processing On Large Clusters
29 pages
DX Diag
No ratings yet
DX Diag
11 pages
Floor Screeds: Traditional Cement Sand Screeds
No ratings yet
Floor Screeds: Traditional Cement Sand Screeds
6 pages
Pipe Fitting Installation Guide
No ratings yet
Pipe Fitting Installation Guide
1 page
Short-Tube Vertical Evaporators
No ratings yet
Short-Tube Vertical Evaporators
9 pages
Minimum Reinforcement As Per EN 1992-1-1:2004: Input Parameters Symbol Unit Value
No ratings yet
Minimum Reinforcement As Per EN 1992-1-1:2004: Input Parameters Symbol Unit Value
3 pages
1 XGA3253BEVWCA（6X4电动自卸车）
No ratings yet
1 XGA3253BEVWCA（6X4电动自卸车）
4 pages
Mine Sweeper
No ratings yet
Mine Sweeper
5 pages
2024 Hysweep Interfacing - Vers 2024.0
No ratings yet
2024 Hysweep Interfacing - Vers 2024.0
136 pages
Translator - Impact 5 - II - Speaker Set - Catawiki
No ratings yet
Translator - Impact 5 - II - Speaker Set - Catawiki
1 page
History of Multimedia Systems
100% (1)
History of Multimedia Systems
40 pages
ABS Electrical System Tiba 2
No ratings yet
ABS Electrical System Tiba 2
8 pages
LSCM-1-2015 #2 Introduction To LSCM
No ratings yet
LSCM-1-2015 #2 Introduction To LSCM
42 pages
HELLO Sim Card Marketing Plan
No ratings yet
HELLO Sim Card Marketing Plan
5 pages
Welding & Cutting Hazards
No ratings yet
Welding & Cutting Hazards
1 page
ASME-Pressure Gauge Accuracy
No ratings yet
ASME-Pressure Gauge Accuracy
2 pages
Ring Main Unit - 8DJH ST
No ratings yet
Ring Main Unit - 8DJH ST
8 pages
AX88780 Ethernet Application Design Note v100
No ratings yet
AX88780 Ethernet Application Design Note v100
19 pages
PLP Thailand Solar Mount Proposal
100% (1)
PLP Thailand Solar Mount Proposal
8 pages
Fleet Analysis Case Study - Python
No ratings yet
Fleet Analysis Case Study - Python
2 pages
IMS DB Job Interview Preparation Guide
No ratings yet
IMS DB Job Interview Preparation Guide
9 pages
PySpark Join Strategies Explained
No ratings yet
PySpark Join Strategies Explained
5 pages
Natural Ventilation
100% (1)
Natural Ventilation
6 pages
Testo 454
No ratings yet
Testo 454
14 pages
YTO - LR - 4105 - 4108 - 4110 - Parts
No ratings yet
YTO - LR - 4105 - 4108 - 4110 - Parts
87 pages
MODBUS Register Guide & Examples
No ratings yet
MODBUS Register Guide & Examples
9 pages
Moc3za Kaz33d3 - 2013 11 06 - 04 23 39
No ratings yet
Moc3za Kaz33d3 - 2013 11 06 - 04 23 39
8 pages
Kits GA200 W
No ratings yet
Kits GA200 W
10 pages
Mobile Communication: Unit-I Two Marks Q&A
No ratings yet
Mobile Communication: Unit-I Two Marks Q&A
20 pages
From RUP To Scrum in Global Software Development: A Case Study
No ratings yet
From RUP To Scrum in Global Software Development: A Case Study
10 pages
Sample UDS Document
No ratings yet
Sample UDS Document
14 pages

MapReduce Introduction

Uploaded by

MapReduce Introduction

Uploaded by

By: Jeffrey Dean & Sanjay Ghemawat

Presented by: Warunika Ranaweera

MapReduce: Simplified Data Processing

Ph.D. in Computer Science University of Washington

Google Fellow in Systems and Infrastructure Group

Research Areas: Distributed Systems and Parallel Computing

Ph.D. in Computer Science Massachusetts Institute of

Research Areas: Distributed Systems and Parallel Computing

30*50 + 31*51 + 32*52 + 33*52 + .... + 40*60

Simple computation, but huge data set

Real world example for large computations

Parallelize tasks in a distributed computing

Complexities in Distributed Computing

A platform to hide the messy details of distributed

Example: Word count

the fox ate

Eg: Word count using MapReduce

Input Text file

map(String key, String value):

Input (fox, {1, 1})

List of Counts (Output from Map)

reduce(String key, Iterator values):

Reverse Web-Link Graph

Reverse Web-Link Graph

(My Web, Source 4)

Source web pages

(My Web, {Source 1, Source 3,.....})

(2) Assign Map

(2) Assign Reduce

(4) Local Write

Complexities in Distributed Computing, to be solved

Restricted Programming model

User specified Map & Reduce functions

1000s of workers, different data sets

Complexities in Distributed Computing, solving..

Master data structure

Complexities in Distributed Computing , solving..

o Automatic parallelization using Map & Reduce

o Coordinate nodes using a master node

No response from a worker task?

If a completed Map task: Re-execute

Master failure (unlikely)

Straggler: machine that takes a long time

Solution: Redundant Execution

Complexities in Distributed Computing , solving..

o Automatic parallelization using Map & Reduce

o Coordinate nodes using a master node

Same data set in different machines

If a task has data locally, no need to access

Complexities in Distributed Computing , solving..

o Automatic parallelization using Map & Reduce

o Coordinate nodes using a master node

Fine granularity tasks: map tasks > machines

1 worker several tasks

Idle workers are quickly assigned to work

Skipping bad records

Debuggers local execution

Very long tail

200 processes killed

Clustering for Google News and Google Product Search

Apache Hadoop MapReduce

Hadoop Distributed File System (HDFS)

Higher level languages/systems based on Hadoop

Amazon Elastic MapReduce

Pig and Hive

Large variety of problems can be expressed as Map

Restricted programming model

Easy to hide details of distributed computing

Achieved scalability & programming efficiency

You might also like

3050 + 3151 + 3252 + 3352 + .... + 40*60