KEMBAR78
MapReduce Programming Model | PPTX
MapReduce
Programming Model
Adarsha Dhakal
Jaydeep Shah
Prakash Upadhyaya
Ritu Ratnam 1
MapReduce Programming Model
2
Introduction
• MapReduce is a programming model introduced by Google for processing
and generating large data sets on clusters of computers.
• Google first formulated the framework for the purpose of serving Google’s
Web page indexing, and the new framework replaced earlier indexing
algorithms.
• Beginner developers find the MapReduce framework beneficial because
library routines can be used to create parallel programs without any worries
about infra-cluster communication, task monitoring or failure handling
processes.
• MapReduce runs on a large cluster of commodity machines and is highly
scalable.
• It has several forms of implementation provided by multiple programming
languages, like Java, C# and C++.
• MapReduce is a general-purpose programming model for data-
intensive computing.
• It was introduced by Google in 2004 to construct its web index.
• It is also used at Yahoo, Facebook etc. It uses a parallel computing
model that distributes computational tasks to large number of
nodes(approximately 1000-10000 nodes.)
• It is fault-tolerable. It can work even when 1600 nodes among 1800
nodes fails.
• Hadoop framework from Apache Software Foundation is an
implementation of MapReduce Programming Model
Phases for MapReduce
1. Input Splits
2. Mapping
3. Shuffling
4. Sorting
5. Reducing
Steps for MapReduce
• Step 1: Transform raw data into key/value pairs in parallel.
• The mapper will get the data file and make the Rating the key and
the values will be the reviews. We will add number 1 for reviews.
• Step 2: Shuffle and sort by the MapReduce model.
• The process of transferring mappers’ intermediate output to the
reducer is known as shuffling. It will collect all the reviews(number
1s) together with the individual key and it will sort them. it will get
sorted by key.
• Step3: Process the data using Reduce.
• Reduce will count each value(number 1) for each key.
• Although, the map and reduce functions in MapReduce model is not
exactly same as in functional programming.
• Map and Reduce functions in MapReduce model:
• Map: It process a (key, value) pair and returns a list of
(intermediate key, value) pairs
map(k1, v1)→list(k2, v2)
• Reduce: It merges all intermediate values having the same
intermediate key
reduce(k2, list(v2))→list(v3)
Basic Concept
• In MapReduce model, user has to write only two functions map and
reduce.
• Few examples that can be easily expressed as MapReduce
computations:
• Distributed Grep ( is an efficient way to utilize a Hadoop cluster to
find log messages hidden within terabytes of log data)
• Count of URL Access Frequency
• Inverted Index
• Mining
Advantages
• MapReduce facilitates automatic parallelization and distribution,
reducing the time required to run the programs
• MapReduce provides fault tolerance by re-executing, writing map
output to a distributed file system, and restarting failed map or reducer
task
• MapReduce is a cost-effective solution for processing of data
• MapReduce processes large volume of unprocessed data very quickly
• MapReduce utilizes simple programming model to handle tasks more
efficiently and quickly and is easy to learn
• MapReduce is flexible and works with several Hadoop languages to
handle and store data
Limitations
• MapReduce is a low-level programming model which involves a lot of
writing code
• The batch-based processing nature of MapReduce makes it unsuitable for
real-time processing
• It does not support data pipelining or overlapping of Map and Reduce
functions
• Task initialization, coordination, monitoring, and scheduling take up a large
chunk of MapReduce's execution time and reduce its performance
• MapReduce cannot cache the intermediate data in memory, thereby
diminishing Hadoop’s performance
The data we have has 20491 rows and 2 columns, and
our task is to provide individual count of ratings.
MAPPING each rating with a shuffle and giving counter of 1.
Later sorting the ratings with the count.
REDUCING leads to giving lesser number of data.
Each rating has their total count from the data from Review of Hotel
Implementing MapReduce Programming
Model
• Hadoop, developed by Apache
• Spark, developed by AMPLab at UC Berkley
• Phoenix++, developed at Stanford University
• MARISSA (MApReduce Implementation for Streaming Science Application,
developed at SUNY Binghamton
• DRYAD and DRYADLINQ, developed by Microsoft
• MapReduce-MPI, Developed by Steve Plimpton (Sandia)
• Disco, developed by NOKIA
• Themis, developed by Rasmussen et al
• MR4C, developed by Skybox Imaging
Bibliography
• MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat Google, Inc.
• MapReduce Tutorial, https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
• Hadoop – MapReduce, https://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm
• MapReduce-Implementation-in-Python, https://github.com/rshah204/MapReduce-Implementation-in-
Python/blob/master/MapReduce.ipynb
• Hotel Reviews, https://www.kaggle.com/datasets/yash10kundu/hotel-reviews?resource=download
• MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON, Zeba Khanam
and Shafali Agarwal, Department of Computer Application, JSSATE, Noida, IJCSIT Vol 7, No 4, August
2015

MapReduce Programming Model

  • 1.
    MapReduce Programming Model Adarsha Dhakal JaydeepShah Prakash Upadhyaya Ritu Ratnam 1
  • 2.
  • 3.
    Introduction • MapReduce isa programming model introduced by Google for processing and generating large data sets on clusters of computers. • Google first formulated the framework for the purpose of serving Google’s Web page indexing, and the new framework replaced earlier indexing algorithms. • Beginner developers find the MapReduce framework beneficial because library routines can be used to create parallel programs without any worries about infra-cluster communication, task monitoring or failure handling processes. • MapReduce runs on a large cluster of commodity machines and is highly scalable. • It has several forms of implementation provided by multiple programming languages, like Java, C# and C++.
  • 4.
    • MapReduce isa general-purpose programming model for data- intensive computing. • It was introduced by Google in 2004 to construct its web index. • It is also used at Yahoo, Facebook etc. It uses a parallel computing model that distributes computational tasks to large number of nodes(approximately 1000-10000 nodes.) • It is fault-tolerable. It can work even when 1600 nodes among 1800 nodes fails. • Hadoop framework from Apache Software Foundation is an implementation of MapReduce Programming Model
  • 6.
    Phases for MapReduce 1.Input Splits 2. Mapping 3. Shuffling 4. Sorting 5. Reducing
  • 8.
    Steps for MapReduce •Step 1: Transform raw data into key/value pairs in parallel. • The mapper will get the data file and make the Rating the key and the values will be the reviews. We will add number 1 for reviews. • Step 2: Shuffle and sort by the MapReduce model. • The process of transferring mappers’ intermediate output to the reducer is known as shuffling. It will collect all the reviews(number 1s) together with the individual key and it will sort them. it will get sorted by key. • Step3: Process the data using Reduce. • Reduce will count each value(number 1) for each key.
  • 9.
    • Although, themap and reduce functions in MapReduce model is not exactly same as in functional programming. • Map and Reduce functions in MapReduce model: • Map: It process a (key, value) pair and returns a list of (intermediate key, value) pairs map(k1, v1)→list(k2, v2) • Reduce: It merges all intermediate values having the same intermediate key reduce(k2, list(v2))→list(v3)
  • 11.
    Basic Concept • InMapReduce model, user has to write only two functions map and reduce. • Few examples that can be easily expressed as MapReduce computations: • Distributed Grep ( is an efficient way to utilize a Hadoop cluster to find log messages hidden within terabytes of log data) • Count of URL Access Frequency • Inverted Index • Mining
  • 13.
    Advantages • MapReduce facilitatesautomatic parallelization and distribution, reducing the time required to run the programs • MapReduce provides fault tolerance by re-executing, writing map output to a distributed file system, and restarting failed map or reducer task • MapReduce is a cost-effective solution for processing of data • MapReduce processes large volume of unprocessed data very quickly • MapReduce utilizes simple programming model to handle tasks more efficiently and quickly and is easy to learn • MapReduce is flexible and works with several Hadoop languages to handle and store data
  • 14.
    Limitations • MapReduce isa low-level programming model which involves a lot of writing code • The batch-based processing nature of MapReduce makes it unsuitable for real-time processing • It does not support data pipelining or overlapping of Map and Reduce functions • Task initialization, coordination, monitoring, and scheduling take up a large chunk of MapReduce's execution time and reduce its performance • MapReduce cannot cache the intermediate data in memory, thereby diminishing Hadoop’s performance
  • 15.
    The data wehave has 20491 rows and 2 columns, and our task is to provide individual count of ratings.
  • 16.
    MAPPING each ratingwith a shuffle and giving counter of 1. Later sorting the ratings with the count.
  • 17.
    REDUCING leads togiving lesser number of data. Each rating has their total count from the data from Review of Hotel
  • 18.
    Implementing MapReduce Programming Model •Hadoop, developed by Apache • Spark, developed by AMPLab at UC Berkley • Phoenix++, developed at Stanford University • MARISSA (MApReduce Implementation for Streaming Science Application, developed at SUNY Binghamton • DRYAD and DRYADLINQ, developed by Microsoft • MapReduce-MPI, Developed by Steve Plimpton (Sandia) • Disco, developed by NOKIA • Themis, developed by Rasmussen et al • MR4C, developed by Skybox Imaging
  • 19.
    Bibliography • MapReduce: SimplifiedData Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat Google, Inc. • MapReduce Tutorial, https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html • Hadoop – MapReduce, https://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm • MapReduce-Implementation-in-Python, https://github.com/rshah204/MapReduce-Implementation-in- Python/blob/master/MapReduce.ipynb • Hotel Reviews, https://www.kaggle.com/datasets/yash10kundu/hotel-reviews?resource=download • MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON, Zeba Khanam and Shafali Agarwal, Department of Computer Application, JSSATE, Noida, IJCSIT Vol 7, No 4, August 2015