0% found this document useful (0 votes)

46 views21 pages

Cloud Application Development

The document discusses parallel and distributed programming paradigms, highlighting the differences between distributed and parallel computing, and the advantages of running parallel programs on distributed systems. It explains the MapReduce framework, detailing its architecture, phases, and the roles of Job Tracker and Task Tracker, as well as the benefits of Twister for iterative computations. Additionally, it covers the Hadoop library, its HDFS architecture, features, and fault tolerance mechanisms.

Uploaded by

satyamshivam.in

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views21 pages

Cloud Application Development

Uploaded by

satyamshivam.in

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Cloud Application

Development
(UNIT-4: PROGRAMMING MODEL)
1
Parallel and Distributed Programming Paradigms

• Distributed Computing: A distributed computing system is a set of computational engines

connected by a network to achieve a common goal of running a job or an application. A
computer cluster or network of workstations is an example of a distributed computing
system.
• Parallel Computing: Parallel computing is the simultaneous use of more than one
computational engine (not necessarily connected via a network) to run a job or an
application. For instance, parallel computing may use either a distributed or a non
distributed computing system such as a multiprocessor platform.
• Running a parallel program on a distributed computing system (parallel and distributed
programming) has several advantages for both users and distributed computing systems.
From the users’ perspective, it decreases application response time; from the distributed
computing systems’ standpoint, it increases throughput and resource utilization.
• Running a parallel program on a distributed computing system, however, could be a very
complicated process.

2
Parallel Computing and Programming Paradigms

The system issues for running a typical parallel program in either a parallel or a distributed
manner would include the following:
Partitioning: This is applicable to both computation and data as follows:
• Computation partitioning: This splits a given job or a program into smaller tasks.
Partitioning greatly depends on correctly identifying portions of the job or program that
can be performed concurrently. Different parts may process different data or a copy of
the same data.
• Data partitioning: This splits the input or intermediate data into smaller pieces. Data
pieces may be processed by different parts of a program or a copy ofthe same program.
Mapping: This assigns the either smaller parts of a program or the smaller pieces of data to
underlying resources. This process aims to appropriately assign such parts or pieces to be
run simultaneously on different workers and is usually handled by resource allocators in the
system.
3
Parallel Computing and Programming Paradigms
• Synchronization: Because different workers may perform different tasks,
synchronization and coordination among workers is necessary so that race conditions
are prevented and data dependency among different workers is properly managed.
Multiple accesses to a shared resource by different workers may raise race
conditions, whereas data dependency happens when a worker needs the processed
data of other workers.
• Communication: Because data dependency is one of the main reasons for
communication among workers, communication is always triggered when the
intermediate data is sent to workers.
• Scheduling: For a job or program, when the number of computation parts (tasks) or
data pieces is more than the number of available workers, a scheduler selects a
sequence of tasks or data pieces to be assigned to the workers. The resource
allocator performs the actual mapping of the computation or data pieces to workers,
while the scheduler only picks the next part from the queue of unassigned tasks
based on a set of rules called the scheduling policy. For multiple jobs or programs, a
scheduler selects a sequence of jobs or programs to be run on the distributed
computing system. Scheduling is also necessary when system resources are not
sufficient to simultaneously run multiple jobs or programs. 4
Map Reduce
• MapReduce is a software framework which
supports parallel and distributed computing
on large data sets.
• This software framework abstracts the data
flow of running a parallel program on a
distributed computing system by providing
users with two interfaces in the form of two
functions: Map and Reduce.
• Users can override these two functions to
interact with and manipulate the data flow of
running their programs. Figure 4.1 illustrates
the logical data flow from the Map to the
Reduce function in MapReduce frameworks.
Fig 4.1: MapReduce framework: Input data flows
• In this framework, the “value” part of the
through the Map and Reduce functions to generate the
data, (key, value), is the actual data, and the
“key” part is only used by the MapReduce
output result under the control flow using MapReduce
controller to control the data flow
software library.
5
Map Reduce

Fig 4.2: MapReduce Architecture

6
Map Reduce
• The MapReduce task is mainly divided into 2 phases i.e. Map phase and
Reduce phase.
• Map: As the name suggests its main use is to map the input data in key-value
pairs. The input to the map may be a key-value pair where the key can be the
id of some kind of address and value is the actual value that it keeps.
The Map() function will be executed in its memory repository on each of these
input key-value pairs and generates the intermediate key-value pair which
works as input for the Reducer or Reduce() function.

• Reduce: The intermediate key-value pairs that work as input for Reducer are
shuffled and sort and send to the Reduce() function. Reducer aggregate or
group the data based on its key-value pair as per the reducer algorithm
written by the developer. 7
Map Reduce
How Job tracker and the task tracker deal with MapReduce:
• Job Tracker: The work of Job tracker is to manage all the resources and all the
jobs across the cluster and also to schedule each map on the Task Tracker
running on the same data node since there can be hundreds of data nodes
available in the cluster.

• Task Tracker: The Task Tracker can be considered as the actual slaves that are
working on the instruction given by the Job Tracker. This Task Tracker is
deployed on each of the nodes available in the cluster that executes the Map
and Reduce task as instructed by Job Tracker.

8
Map Reduce
MapReduce Actual Data and Control Flow
The main responsibility of the MapReduce framework is to efficiently run a user’s program on a
distributed computing system. Therefore, the MapReduce framework meticulously handles all
partitioning, mapping, synchronization, communication, and scheduling details of such data flows. It is
summarized in the following steps:
1. Data partitioning: The MapReduce library splits the input data (files), already stored in GFS, into
M pieces that also correspond to the number of map tasks.
2. Computation partitioning: This is implicitly handled by obliging users to write their programs in
the form of the Map and Reduce functions. Therefore, the MapReduce library only generates
copies of a user program containing the Map and the Reduce functions, distributes them, and
starts them up on a number of available computation engines.
3. Determining the master and workers: The MapReduce architecture is based on a masterworker
model. Therefore, one of the copies of the user program becomes the master and the rest
become workers.

9
Map Reduce
MapReduce Actual Data and Control Flow
4. Reading the input data (data distribution): Each map worker reads its corresponding portion
of the input data, namely the input data split, and sends it to its Map function.
5. Map function: Each Map function receives the input data split as a set of (key, value) pairs to
process and produce the intermediated (key, value) pairs.
6. Combiner function: The Combiner is an optional local function in the map worker that
pre-processes intermediate (key, value) pairs using the same logic as the Reduce function.
Invoked by the user, it merges local data before network transfer, reducing communication
cost. Like the Reduce phase, MapReduce sorts and groups data before applying the
Combiner.
7. Partitioning function: In MapReduce, intermediate (key, value) pairs with the same key must
be processed by the same Reduce task. Since multiple map tasks may generate such pairs, a
Partitioning function is used to divide the output of each map task into R regions (R = number
of reduce tasks), ensuring all pairs with the same key go to the same region. Each reduce task
then collects data from its corresponding region across all map tasks. The master node keeps
track of these partitions to route data correctly to the reduce workers.
10
Map Reduce
MapReduce Actual Data and Control Flow
8. Synchronization: MapReduce applies a simple synchronization policy to coordinate map workers
with reduce workers, in which the communication between them starts when all map tasks
finish.
9. Communication: Reduce worker i, already notified of the location of region i of all map workers,
uses a remote procedure call to read the data from the respective region of all map workers.
Since all reduce workers read the data from all map workers, all-to-all communication among all
map and reduce workers, which incurs network congestion, occurs in the network. This issue is
one of the major bottlenecks in increasing the performance of such systems.
10. Sorting and Grouping: Once a reduce worker finishes reading input data, it buffers it locally, then
sorts and groups intermediate (key, value) pairs by key. Sorting is essential as a map worker can
generate more unique keys than the number of R regions, with multiple keys per region.
11. Reduce function The reduce worker iterates over the grouped (key, value) pairs, and for each
unique key, it sends the key and corresponding values to the Reduce function. Then this function
processes its input data and stores the output results in predetermined files in the user’s
program.
11
Twister and Iterative Map Reduce
Why Traditional MapReduce Falls Short ?
• Designed for batch processing.
• Each MapReduce job is stateless: after each iteration, data is written to disk and reloaded in
the next iteration.
• This causes high I/O overhead and inefficiency for algorithms that need multiple passes
over data (e.g., K-Means, PageRank, Gradient Descent).
Twister: Iterative MapReduce for Efficient Computation
Twister is a lightweight MapReduce runtime designed to efficiently support iterative
computations.
Key Features:
• Static Data Support: Input data is loaded once and reused across iterations (in-memory).
• Publish/Subscribe Communication: For faster data exchange between tasks.
• Long-running Map/Reduce Tasks: Map/reduce tasks can persist across iterations, reducing
startup costs.
• Intermediate Results In-Memory: Avoids the overhead of writing intermediate results to
disk.
12
Twister and Iterative Map Reduce
Advantages:
• Great for Iterative Algorithms: Twister significantly improves performance on algorithms
like K-Means, PageRank, and SVM.
• Better than Hadoop for Iterations: Because Hadoop writes to disk between iterations,
Twister can be 10x–100x faster in some cases.
Example Use Case:
• K-Means Clustering:
• Data is loaded once into mappers.
• In each iteration, new cluster centers are computed in reducers.
• Iterations continue until convergence — all done without reloading data every time.

13
Twister and Iterative Map Reduce

Iterative MapReduce (General Concept)

• While Twister is a specific implementation, Iterative MapReduce is a broader concept that
extends MapReduce with native support for iterations.
Basic Workflow:
• Initial MapReduce job is run with input data.
• Loop control is added (e.g., based on convergence criteria).
• Intermediate state (e.g., model parameters, centroids) is passed between iterations.
• Execution continues until the loop terminates.

14
Hadoop Library from Apache

• Hadoop is an open source implementation of MapReduce coded and released

in Java by Apache. The Hadoop implementation of MapReduce uses the
Hadoop Distributed File System (HDFS) as its underlying layer.
• The Hadoop core is divided into two fundamental layers: the MapReduce
engine and HDFS.
• The MapReduce engine is the computation engine running on top of HDFS as
its data storage manager.
• HDFS: HDFS is a distributed file system inspired by GFS that organizes files and
stores their data on a distributed computing system.

15
Hadoop Library from Apache

HDFS Architecture
• HDFS has a master/slave architecture containing a single NameNode as the master and a
number of DataNodes as workers (slaves).
• To store a file in this architecture, HDFS splits the file into fixed-size blocks (e.g., 64 MB) and
stores them on workers (DataNodes). The mapping of blocks to DataNodes is determined
by the NameNode.
• The NameNode (master) also manages the file system’s metadata and namespace. In such
systems, the namespace is the area maintaining the metadata.
• Metadata refers to all the information stored by a file system that is needed for overall
management of all files.
• For example, NameNode in the metadata stores all information regarding the location of
input splits/blocks in all DataNodes.
• Each DataNode, usually one per node in a cluster, manages the storage attached to the
node. Each DataNode is responsible for storing and retrieving its file blocks
16
Hadoop Library from Apache

17
Hadoop Library from Apache

HDFS Features

• Fault Tolerance: Data is replicated; if a DataNode fails, data is read from other replicas.
• High Throughput : Optimized for batch processing and large streaming reads.
• Scalability: Easily scalable to thousands of nodes and petabytes of data.
• Write Once, Read Many: Optimized for datasets where data is written once and read
multiple times.
• Data Locality: Computation is moved to the location of the data to minimize data
transfer.

18
Hadoop Library from Apache

HDFS Fault Tolerance

• One of the main aspects of HDFS is its fault tolerance characteristic. Since
Hadoop is designed to be deployed on low-cost hardware by default, a
hardware failure in this system is considered to be common rather than an
exception. Therefore, Hadoop considers the following issues to fulfill reliability
requirements of the file system :
1. Block replication: To reliably store data in HDFS, file blocks are replicated in
this system. In other words, HDFS stores a file as a set of blocks and each
block is replicated and distributed across the whole cluster. The replication
factor is set by the user and is three by default.

19
Hadoop Library from Apache

2. Replica Placement: To ensure fault tolerance, HDFS places replicas on different nodes,
preferably across racks. However, cross-rack communication is costly, so HDFS balances
reliability and efficiency. With the default replication factor of three, one replica is stored
on the local node, another on a different node within the same rack, and the third on a
node in a different rack—offering fault tolerance with reduced communication overhead.
3. Heartbeat and Blockreport messages: Heartbeats and Blockreports are periodic
messages sent to the NameNode by each DataNode in a cluster. Receipt of a Heartbeat
implies that the DataNode is functioning properly, while each Blockreport contains a list
of all blocks on a DataNode. The NameNode receives such messages because it is the sole
decision maker of all replicas in the system.
4. HDFS provides high-throughput access to large data sets by focusing on batch processing
over low latency. Files are split into large blocks (e.g., 64MB) to reduce metadata and
improve performance. Fewer, larger blocks lower metadata overhead and enable fast,
sequential streaming reads.
20
Hadoop Library from Apache

DCC Chapter 4
No ratings yet
DCC Chapter 4
37 pages
Hadoop MapReduce for Developers
No ratings yet
Hadoop MapReduce for Developers
4 pages
Unit 3 BDT
No ratings yet
Unit 3 BDT
42 pages
MapReduce Architecture Explained
No ratings yet
MapReduce Architecture Explained
13 pages
Mapreduce: Simpli - Ed Data Processing On Large Clusters
No ratings yet
Mapreduce: Simpli - Ed Data Processing On Large Clusters
4 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Big Data Analytics UNIT 3 Notets
No ratings yet
Big Data Analytics UNIT 3 Notets
12 pages
777 1651400043 BD Module 4
No ratings yet
777 1651400043 BD Module 4
21 pages
Hadoop MapReduce for Big Data
No ratings yet
Hadoop MapReduce for Big Data
5 pages
Unit 3
No ratings yet
Unit 3
33 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Hadoop Streaming: Mapreduce
No ratings yet
Hadoop Streaming: Mapreduce
8 pages
Module 3 (Part-1) - Big Data
No ratings yet
Module 3 (Part-1) - Big Data
46 pages
BIS613D Module 5 Textbook
No ratings yet
BIS613D Module 5 Textbook
9 pages
Mapreduce: Definition - What Is ?
No ratings yet
Mapreduce: Definition - What Is ?
3 pages
Parallel Programming, Mapreduce Model: Unit Ii
No ratings yet
Parallel Programming, Mapreduce Model: Unit Ii
47 pages
Bda Unit 3
No ratings yet
Bda Unit 3
29 pages
Unit-2 Bda Kalyan - Pagenumber
No ratings yet
Unit-2 Bda Kalyan - Pagenumber
15 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
3.1.how Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.how Map Reduce Works & 3.2 Anatomy
11 pages
Map Reduce Intro
No ratings yet
Map Reduce Intro
21 pages
Module 3 Version1
No ratings yet
Module 3 Version1
17 pages
BDP 2024 09
No ratings yet
BDP 2024 09
24 pages
Parallel & Distributed Computing
100% (1)
Parallel & Distributed Computing
52 pages
T05 MapReduce
No ratings yet
T05 MapReduce
20 pages
BDA Chapter 3
No ratings yet
BDA Chapter 3
17 pages
Map Reduce and Hadoop
No ratings yet
Map Reduce and Hadoop
39 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
MapReduce Architecture
No ratings yet
MapReduce Architecture
5 pages
Analysis of Mapreduce Algorithms: Harini Padmanaban
No ratings yet
Analysis of Mapreduce Algorithms: Harini Padmanaban
6 pages
Map Reduce: Simplified Processing On Large Clusters
No ratings yet
Map Reduce: Simplified Processing On Large Clusters
29 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
Map Reduce
No ratings yet
Map Reduce
8 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Adobe Scan 03 Jul 2025
No ratings yet
Adobe Scan 03 Jul 2025
25 pages
Unit - III
No ratings yet
Unit - III
37 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Unit 3
100% (1)
Unit 3
46 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
Big Data
No ratings yet
Big Data
120 pages
Cloud Computing Unit - 3 Final
No ratings yet
Cloud Computing Unit - 3 Final
43 pages
BDA Unit 2 Notes
No ratings yet
BDA Unit 2 Notes
32 pages
Map Reduce Report
No ratings yet
Map Reduce Report
16 pages
Parallel & Distributed Computing Guide
No ratings yet
Parallel & Distributed Computing Guide
26 pages
Big Data Unit - 3
No ratings yet
Big Data Unit - 3
7 pages
Map Reduce Intro CS4961-L22
No ratings yet
Map Reduce Intro CS4961-L22
20 pages
BDP 2023 07
No ratings yet
BDP 2023 07
28 pages
P.Prabu (23x61c) CCS334-BDA - Unit-3
No ratings yet
P.Prabu (23x61c) CCS334-BDA - Unit-3
23 pages
Ditp - ch2 4
No ratings yet
Ditp - ch2 4
2 pages
BDA UNIT-3 (1) - Merged
No ratings yet
BDA UNIT-3 (1) - Merged
98 pages
Chapter 3 Map Reduce Framework 250525 070916
No ratings yet
Chapter 3 Map Reduce Framework 250525 070916
28 pages
Da Unit 5 Data Analytics
No ratings yet
Da Unit 5 Data Analytics
43 pages
Map Reduce Paradigm
No ratings yet
Map Reduce Paradigm
3 pages
Big Data BCA Unit4
No ratings yet
Big Data BCA Unit4
9 pages
132 P16cse5a-P16ite3a 2020052706582977
No ratings yet
132 P16cse5a-P16ite3a 2020052706582977
15 pages
UNIT III Notes
No ratings yet
UNIT III Notes
24 pages
Assignment May 2023
No ratings yet
Assignment May 2023
5 pages
Poweroil: Power Gem Ep00, Ep0, Ep1 & Ep2 Extreme Pressure Greases
No ratings yet
Poweroil: Power Gem Ep00, Ep0, Ep1 & Ep2 Extreme Pressure Greases
1 page
Susanne K. Langer: THE Symbol OF Feeling
No ratings yet
Susanne K. Langer: THE Symbol OF Feeling
15 pages
Exercise Workbook2 Basic
No ratings yet
Exercise Workbook2 Basic
90 pages
Rexa Iom X3
No ratings yet
Rexa Iom X3
157 pages
ND II 3rdterm Sum
No ratings yet
ND II 3rdterm Sum
7 pages
Action Plan AP
No ratings yet
Action Plan AP
3 pages
IndividualTaskReport - ESPINOZA, JOAN
No ratings yet
IndividualTaskReport - ESPINOZA, JOAN
2 pages
Standard of Competence
No ratings yet
Standard of Competence
11 pages
Cos 202
No ratings yet
Cos 202
28 pages
Cat Red
No ratings yet
Cat Red
5 pages
Aircraft Dji Enterprise Mavic 3 Thermal
No ratings yet
Aircraft Dji Enterprise Mavic 3 Thermal
19 pages
7 The Brain
100% (1)
7 The Brain
19 pages
43 To 49 - 2025 - Notice-NE-4
No ratings yet
43 To 49 - 2025 - Notice-NE-4
4 pages
Presented By-Khyati, Chareeta, Hitesh
No ratings yet
Presented By-Khyati, Chareeta, Hitesh
6 pages
DLL Matatag - English 4 q4 w2
No ratings yet
DLL Matatag - English 4 q4 w2
13 pages
Grade 12 Math Integration Worksheet
No ratings yet
Grade 12 Math Integration Worksheet
10 pages
Efficient Market Hypothesis in The Indian Stock Market: January 2020
No ratings yet
Efficient Market Hypothesis in The Indian Stock Market: January 2020
11 pages
Studying Senior High School"
100% (1)
Studying Senior High School"
3 pages
Bis Two Mark Questions
No ratings yet
Bis Two Mark Questions
3 pages
Software Testing Life Cycle
100% (4)
Software Testing Life Cycle
3 pages
Seminar Face Recognition Technology
No ratings yet
Seminar Face Recognition Technology
21 pages
Assignment 2023
No ratings yet
Assignment 2023
5 pages
Chapter 82024
No ratings yet
Chapter 82024
23 pages
SHP 2 Grid
No ratings yet
SHP 2 Grid
7 pages
Error Identification - PT3
No ratings yet
Error Identification - PT3
1 page
IC Engines
No ratings yet
IC Engines
37 pages
Happiness Is Not Something Ready Made
100% (1)
Happiness Is Not Something Ready Made
11 pages
Manual-C-9102 (UL) Conventional Photoelectric Smoke Detecto20Issue1.04
No ratings yet
Manual-C-9102 (UL) Conventional Photoelectric Smoke Detecto20Issue1.04
10 pages

Cloud Application Development

Uploaded by

Cloud Application Development

Uploaded by

Cloud Application

• Distributed Computing: A distributed computing system is a set of computational engines

Fig 4.2: MapReduce Architecture

Iterative MapReduce (General Concept)

• Hadoop is an open source implementation of MapReduce coded and released

HDFS Fault Tolerance

You might also like