Map Reduce Examples
Here are a few simple examples of interesting programs that can be easily
expressed as MapReduce computations.
Distributed Grep: The map function emits a line if it matches a supplied pattern.
The reduce function is an identity function that just copies the supplied intermediate
data to the output.
Count of URL Access Frequency: The map function processes logs of web page
requests and outputs (URL, 1). The reduce function adds together all values for the
same URL and emits a (URL, total count) pair.
ReverseWeb-Link Graph: The map function outputs (target, source) pairs for each
link to a target URL found in a page named source. The reduce function
concatenates the list of all source URLs associated with a given target URL and
emits the pair: (target, list(source))
Term-Vector per Host: A term vector summarizes the most important words that
occur in a document or a set of documents as a list of (word, frequency) pairs. The
map function emits a (hostname, term vector) pair for each input document (where
the hostname is extracted from the URL of the document). The reduce function is
passed all per-document term vectors for a given host. It adds these term vectors
together, throwing away infrequent terms, and then emits a final (hostname, term
vector) pair.
Same but copied with formatted on next page.
Here are a few simple examples of interesting programs that can be easily expressed as
MapReduce computations.
Distributed Grep: The map function emits a line if it matches a supplied pattern. The
reduce function is an identity function that just copies the supplied intermediate data to the
output.
Count of URL Access Frequency: The map function processes logs of web page
requests and outputs (URL, 1). The reduce function adds together all values for the same
URL and emits a (URL, total count) pair.
Reverse Web-Link Graph: The map function outputs (target, source) pairs for
each link to a target URL found in a page named source. The reduce function
concatenates the list of all source URLs associated with a given target URL and emits
the pair: (target, list(source))
Term-Vector per Host: A term vector summarizes the most important words that occur
in a document or a set of documents as a list of (word, frequency) pairs. The map
function emits a (hostname, term vector) pair for each input document (where the
hostname is extracted from the URL of the document). The reduce function is passed
all per-document term vectors for a given host. It adds these term vectors together,
throwing away infrequent terms, and then emits a final (hostname, term vector)
pair.
Inverted Index: The map function parses each document, and emits a sequence of (word,
document ID) pairs. The reduce function accepts all pairs for a given word, sorts the
corresponding document IDs and emits a (word, list(document ID)) pair. The set of all
output pairs forms a simple inverted index. It is easy to augment this computation to keep
track of word positions.
Distributed Sort: The map function extracts the key from each record, and emits a (key,
record) pair. The reduce function emits all pairs unchanged. This computation depends on
the partitioning facilities described in Section 4.1 and the ordering properties described in
Section 4.2.
Strategy to solve MapReduce Problem
After grouping all the intermediate data, the values of all occurrences of the same key are
sorted and grouped together. As a result, after grouping each key, each key becomes unique in
all intermediate data. Therefore finding unique keys is the starting point to solving a typical
MapReduce problem. Then the intermediate (key, value) pairs as the output of the Map
function will be automatically found.
The following examples explain how to define keys and values in such problems:
Problem 1: Counting the number of occurrences of each word in a collection of documents
Solution: unique key: each word, intermediate value: number of occurrences
Problem 2: Counting the number of occurrences of words having the same size, or the same
number of letters, in a collection of documents
Solution: unique key: each word, intermediate value: size of the word
Problem 3: Counting the number of occurrences of anagrams in a collection of documents.
(Anagrams are words with the same set of letters but in a different order. (e.g. the words
listen and silent)
Solution: unique key: alphabetically sorted sequence of letters for each word (e.g. eilnst),
intermediate value: number of occurrences
6.2.2.4 Strategy to Solve MapReduce Problems
As mentioned earlier, aer grouping all the intermediate data, the values
of all occurrences of the same key are sorted and grouped together. As a
result, aer grouping, each key becomes unique in all intermediate data.
Therefore, finding unique keys is the starting point to solving a typical
MapReduce problem. Then the intermediate (key, value) pairs as the
output of the Map function will be automatically found. The following three
examples explain how to define keys and values in such problems:
Problem 1: Counting the number of occurrences of each word in a
collection of documents Solution: unique key: each word, intermediate
value: number of occurrences
Problem 2: Counting the number of occurrences of words having the
same size, or the same number of letters, in a collection of documents
Solution: unique key: each word, intermediate value: size of the word
Problem 3: Counting the number of occurrences of anagrams in a
collection of documents. Anagrams are words with the same set of letters
but in a different order (e.g., the words listen and silent). Solution:
unique key: alphabetically sorted sequence of letters for each word
(e.g., eilnst), intermediate value: number of occurrences
Transparent Programming Model
Programs written for cloud implementation need to be automatically parallelized and
executed on a large cluster of commodity machines.
The run-time system should take care of the details of partitioning the input data, scheduling
the program's execution across a set of machines, handling machine failures, and managing
the required inter-machine communication.
The programming model should allow programmers without many experiences with parallel
and distributed systems to easily utilize the resources of a large distributed system.
Scalable Data Processing on Large Clusters
A web programming model implemented for fast processing and generating large datasets
Applied mainly in web-scale search and cloud computing applications
Users specify a map function to generate a set of intermediate key/value pairs
Users use a reduce function to merge all intermediate values with the same intermediate
key.
Google MapReduce
Map, written by the user, takes an input pair and produces a set of intermediate key/value
pairs. The MapReduce library groups together all intermediate values associated with the
same intermediate key I and passes them to the Reduce function.
The Reduce function, also written by the user, accepts an intermediate key I and a set of
values for that key. It merges together these values to form a possibly smaller set of values.
Typically just zero or one output value is produced per Reduce invocation.
Hadoop : A software platform originally developed by Yahoo to enable users write and run
applications over vast distributed data.
Attractive Features in Hadoop :
Scalable : can easily scale to store and process petabytes of data in the Web space
Economical : An open-source MapReduce minimizes the overheads in task spawning and
massive data communication.
Efficient: Processing data with high-degree of parallelism across a large number of
commodity nodes
Reliable : Automatically maintains multiple copies of data to facilitate redeployment of
computing tasks on failures
Explain MapReduce with an example.
The computation takes a set of input key/value pairs, and produces a set of output key/value
pairs. The user of the MapReduce library expresses the computation as two functions: Map
and Reduce.
Map, written by the user, takes an input pair and produces a set of intermediate key/value
pairs. The MapReduce library groups together all intermediate values associated
with the same intermediate key I and passes them to the Reduce function.
The Reduce function, also written by the user, accepts an intermediate key I and a set of
values for that key. It merges together these values to form a possibly smaller set of values.
Typically just zero or one output value is produced per Reduce invocation. The intermediate
values are supplied to the user's reduce function via an iterator.
This allows us to handle lists of values that are too large to fit in memory.
Consider the problem of counting the n umber of occurrences of each word in a large
collection of documents. The user would write code similar to the following pseudo-code:
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
The map function emits each word plus an associated count of occurrences (just 1 in this
simple example). The reduce function sums together all counts emitted for a particular
word.
In addition, the user writes code to fill in a mapreduce specification object with the names of
the input and out- put files, and optional tuning parameters. The user then invokes the
MapReduce function, passing it the specification object. The users code is linked together
with the MapReduce library (implemented in C++). Appendix A contains the full program
text for this example.
Even though the previous pseudo-code is written in terms of string inputs and outputs,
conceptually the map and reduce functions supplied by the user have associated types:
map
(k1,v1)
list(k2,v2)
reduce
(k2,list(v2))
list(v2)
I.e., the input keys and values are drawn from a different domain than the output keys and
values. Furthermore, the intermediate keys and values are from the same domain
as the output keys and values.
We have a large collection of text documents in a folder.
Count the frequency of distinct words in the documents.
Map function
Map function operates on every key/value pair of input data and
transforms the data based on the transformation logic provided in the
map function.
Map function always emits an intermediate key/value pair as output
Map( Key1, Value1) -> List ( Key2, Value2 )
For each file
Read each line from the input file
Locate each word
Emit the (word,1) for every word found
//The emitted (word, 1) will form the list that is output from the
Map function
Reduce function takes the list of every key and transforms the data based
on the (aggregation) logic provided in the reduce function. It is similar to
the Aggregate functions in Standard SQL.
For the List(key, value) output from the mapper . Shuffle and Sort the data
by key.
Group by Key and create the list of values for a key.
Reduce function
Reduce ( Key2, List(Value2) ) -> List (Key3, Value3 )
Read each key (word) and list of values (1, 1, 1,..) associated with it.
For each key add the list of values to calculate sum
Emit the word, sum for every word found