Cloud Application
Development
(UNIT-4: PROGRAMMING MODEL)
1
Parallel and Distributed Programming Paradigms
• Distributed Computing: A distributed computing system is a set of computational engines
connected by a network to achieve a common goal of running a job or an application. A
computer cluster or network of workstations is an example of a distributed computing
system.
• Parallel Computing: Parallel computing is the simultaneous use of more than one
computational engine (not necessarily connected via a network) to run a job or an
application. For instance, parallel computing may use either a distributed or a non
distributed computing system such as a multiprocessor platform.
• Running a parallel program on a distributed computing system (parallel and distributed
programming) has several advantages for both users and distributed computing systems.
From the users’ perspective, it decreases application response time; from the distributed
computing systems’ standpoint, it increases throughput and resource utilization.
• Running a parallel program on a distributed computing system, however, could be a very
complicated process.
2
Parallel Computing and Programming Paradigms
The system issues for running a typical parallel program in either a parallel or a distributed
manner would include the following:
Partitioning: This is applicable to both computation and data as follows:
• Computation partitioning: This splits a given job or a program into smaller tasks.
Partitioning greatly depends on correctly identifying portions of the job or program that
can be performed concurrently. Different parts may process different data or a copy of
the same data.
• Data partitioning: This splits the input or intermediate data into smaller pieces. Data
pieces may be processed by different parts of a program or a copy ofthe same program.
Mapping: This assigns the either smaller parts of a program or the smaller pieces of data to
underlying resources. This process aims to appropriately assign such parts or pieces to be
run simultaneously on different workers and is usually handled by resource allocators in the
system.
3
Parallel Computing and Programming Paradigms
• Synchronization: Because different workers may perform different tasks,
synchronization and coordination among workers is necessary so that race conditions
are prevented and data dependency among different workers is properly managed.
Multiple accesses to a shared resource by different workers may raise race
conditions, whereas data dependency happens when a worker needs the processed
data of other workers.
• Communication: Because data dependency is one of the main reasons for
communication among workers, communication is always triggered when the
intermediate data is sent to workers.
• Scheduling: For a job or program, when the number of computation parts (tasks) or
data pieces is more than the number of available workers, a scheduler selects a
sequence of tasks or data pieces to be assigned to the workers. The resource
allocator performs the actual mapping of the computation or data pieces to workers,
while the scheduler only picks the next part from the queue of unassigned tasks
based on a set of rules called the scheduling policy. For multiple jobs or programs, a
scheduler selects a sequence of jobs or programs to be run on the distributed
computing system. Scheduling is also necessary when system resources are not
sufficient to simultaneously run multiple jobs or programs. 4
Map Reduce
• MapReduce is a software framework which
supports parallel and distributed computing
on large data sets.
• This software framework abstracts the data
flow of running a parallel program on a
distributed computing system by providing
users with two interfaces in the form of two
functions: Map and Reduce.
• Users can override these two functions to
interact with and manipulate the data flow of
running their programs. Figure 4.1 illustrates
the logical data flow from the Map to the
Reduce function in MapReduce frameworks.
Fig 4.1: MapReduce framework: Input data flows
• In this framework, the “value” part of the
through the Map and Reduce functions to generate the
data, (key, value), is the actual data, and the
“key” part is only used by the MapReduce
output result under the control flow using MapReduce
controller to control the data flow
software library.
5
Map Reduce
Fig 4.2: MapReduce Architecture
6
Map Reduce
• The MapReduce task is mainly divided into 2 phases i.e. Map phase and
Reduce phase.
• Map: As the name suggests its main use is to map the input data in key-value
pairs. The input to the map may be a key-value pair where the key can be the
id of some kind of address and value is the actual value that it keeps.
The Map() function will be executed in its memory repository on each of these
input key-value pairs and generates the intermediate key-value pair which
works as input for the Reducer or Reduce() function.
• Reduce: The intermediate key-value pairs that work as input for Reducer are
shuffled and sort and send to the Reduce() function. Reducer aggregate or
group the data based on its key-value pair as per the reducer algorithm
written by the developer. 7
Map Reduce
How Job tracker and the task tracker deal with MapReduce:
• Job Tracker: The work of Job tracker is to manage all the resources and all the
jobs across the cluster and also to schedule each map on the Task Tracker
running on the same data node since there can be hundreds of data nodes
available in the cluster.
• Task Tracker: The Task Tracker can be considered as the actual slaves that are
working on the instruction given by the Job Tracker. This Task Tracker is
deployed on each of the nodes available in the cluster that executes the Map
and Reduce task as instructed by Job Tracker.
8
Map Reduce
MapReduce Actual Data and Control Flow
The main responsibility of the MapReduce framework is to efficiently run a user’s program on a
distributed computing system. Therefore, the MapReduce framework meticulously handles all
partitioning, mapping, synchronization, communication, and scheduling details of such data flows. It is
summarized in the following steps:
1. Data partitioning: The MapReduce library splits the input data (files), already stored in GFS, into
M pieces that also correspond to the number of map tasks.
2. Computation partitioning: This is implicitly handled by obliging users to write their programs in
the form of the Map and Reduce functions. Therefore, the MapReduce library only generates
copies of a user program containing the Map and the Reduce functions, distributes them, and
starts them up on a number of available computation engines.
3. Determining the master and workers: The MapReduce architecture is based on a masterworker
model. Therefore, one of the copies of the user program becomes the master and the rest
become workers.
9
Map Reduce
MapReduce Actual Data and Control Flow
4. Reading the input data (data distribution): Each map worker reads its corresponding portion
of the input data, namely the input data split, and sends it to its Map function.
5. Map function: Each Map function receives the input data split as a set of (key, value) pairs to
process and produce the intermediated (key, value) pairs.
6. Combiner function: The Combiner is an optional local function in the map worker that
pre-processes intermediate (key, value) pairs using the same logic as the Reduce function.
Invoked by the user, it merges local data before network transfer, reducing communication
cost. Like the Reduce phase, MapReduce sorts and groups data before applying the
Combiner.
7. Partitioning function: In MapReduce, intermediate (key, value) pairs with the same key must
be processed by the same Reduce task. Since multiple map tasks may generate such pairs, a
Partitioning function is used to divide the output of each map task into R regions (R = number
of reduce tasks), ensuring all pairs with the same key go to the same region. Each reduce task
then collects data from its corresponding region across all map tasks. The master node keeps
track of these partitions to route data correctly to the reduce workers.
10
Map Reduce
MapReduce Actual Data and Control Flow
8. Synchronization: MapReduce applies a simple synchronization policy to coordinate map workers
with reduce workers, in which the communication between them starts when all map tasks
finish.
9. Communication: Reduce worker i, already notified of the location of region i of all map workers,
uses a remote procedure call to read the data from the respective region of all map workers.
Since all reduce workers read the data from all map workers, all-to-all communication among all
map and reduce workers, which incurs network congestion, occurs in the network. This issue is
one of the major bottlenecks in increasing the performance of such systems.
10. Sorting and Grouping: Once a reduce worker finishes reading input data, it buffers it locally, then
sorts and groups intermediate (key, value) pairs by key. Sorting is essential as a map worker can
generate more unique keys than the number of R regions, with multiple keys per region.
11. Reduce function The reduce worker iterates over the grouped (key, value) pairs, and for each
unique key, it sends the key and corresponding values to the Reduce function. Then this function
processes its input data and stores the output results in predetermined files in the user’s
program.
11
Twister and Iterative Map Reduce
Why Traditional MapReduce Falls Short ?
• Designed for batch processing.
• Each MapReduce job is stateless: after each iteration, data is written to disk and reloaded in
the next iteration.
• This causes high I/O overhead and inefficiency for algorithms that need multiple passes
over data (e.g., K-Means, PageRank, Gradient Descent).
Twister: Iterative MapReduce for Efficient Computation
Twister is a lightweight MapReduce runtime designed to efficiently support iterative
computations.
Key Features:
• Static Data Support: Input data is loaded once and reused across iterations (in-memory).
• Publish/Subscribe Communication: For faster data exchange between tasks.
• Long-running Map/Reduce Tasks: Map/reduce tasks can persist across iterations, reducing
startup costs.
• Intermediate Results In-Memory: Avoids the overhead of writing intermediate results to
disk.
12
Twister and Iterative Map Reduce
Advantages:
• Great for Iterative Algorithms: Twister significantly improves performance on algorithms
like K-Means, PageRank, and SVM.
• Better than Hadoop for Iterations: Because Hadoop writes to disk between iterations,
Twister can be 10x–100x faster in some cases.
Example Use Case:
• K-Means Clustering:
• Data is loaded once into mappers.
• In each iteration, new cluster centers are computed in reducers.
• Iterations continue until convergence — all done without reloading data every time.
13
Twister and Iterative Map Reduce
Iterative MapReduce (General Concept)
• While Twister is a specific implementation, Iterative MapReduce is a broader concept that
extends MapReduce with native support for iterations.
Basic Workflow:
• Initial MapReduce job is run with input data.
• Loop control is added (e.g., based on convergence criteria).
• Intermediate state (e.g., model parameters, centroids) is passed between iterations.
• Execution continues until the loop terminates.
14
Hadoop Library from Apache
• Hadoop is an open source implementation of MapReduce coded and released
in Java by Apache. The Hadoop implementation of MapReduce uses the
Hadoop Distributed File System (HDFS) as its underlying layer.
• The Hadoop core is divided into two fundamental layers: the MapReduce
engine and HDFS.
• The MapReduce engine is the computation engine running on top of HDFS as
its data storage manager.
• HDFS: HDFS is a distributed file system inspired by GFS that organizes files and
stores their data on a distributed computing system.
15
Hadoop Library from Apache
HDFS Architecture
• HDFS has a master/slave architecture containing a single NameNode as the master and a
number of DataNodes as workers (slaves).
• To store a file in this architecture, HDFS splits the file into fixed-size blocks (e.g., 64 MB) and
stores them on workers (DataNodes). The mapping of blocks to DataNodes is determined
by the NameNode.
• The NameNode (master) also manages the file system’s metadata and namespace. In such
systems, the namespace is the area maintaining the metadata.
• Metadata refers to all the information stored by a file system that is needed for overall
management of all files.
• For example, NameNode in the metadata stores all information regarding the location of
input splits/blocks in all DataNodes.
• Each DataNode, usually one per node in a cluster, manages the storage attached to the
node. Each DataNode is responsible for storing and retrieving its file blocks
16
Hadoop Library from Apache
17
Hadoop Library from Apache
HDFS Features
• Fault Tolerance: Data is replicated; if a DataNode fails, data is read from other replicas.
• High Throughput : Optimized for batch processing and large streaming reads.
• Scalability: Easily scalable to thousands of nodes and petabytes of data.
• Write Once, Read Many: Optimized for datasets where data is written once and read
multiple times.
• Data Locality: Computation is moved to the location of the data to minimize data
transfer.
18
Hadoop Library from Apache
HDFS Fault Tolerance
• One of the main aspects of HDFS is its fault tolerance characteristic. Since
Hadoop is designed to be deployed on low-cost hardware by default, a
hardware failure in this system is considered to be common rather than an
exception. Therefore, Hadoop considers the following issues to fulfill reliability
requirements of the file system :
1. Block replication: To reliably store data in HDFS, file blocks are replicated in
this system. In other words, HDFS stores a file as a set of blocks and each
block is replicated and distributed across the whole cluster. The replication
factor is set by the user and is three by default.
19
Hadoop Library from Apache
2. Replica Placement: To ensure fault tolerance, HDFS places replicas on different nodes,
preferably across racks. However, cross-rack communication is costly, so HDFS balances
reliability and efficiency. With the default replication factor of three, one replica is stored
on the local node, another on a different node within the same rack, and the third on a
node in a different rack—offering fault tolerance with reduced communication overhead.
3. Heartbeat and Blockreport messages: Heartbeats and Blockreports are periodic
messages sent to the NameNode by each DataNode in a cluster. Receipt of a Heartbeat
implies that the DataNode is functioning properly, while each Blockreport contains a list
of all blocks on a DataNode. The NameNode receives such messages because it is the sole
decision maker of all replicas in the system.
4. HDFS provides high-throughput access to large data sets by focusing on batch processing
over low latency. Files are split into large blocks (e.g., 64MB) to reduce metadata and
improve performance. Fewer, larger blocks lower metadata overhead and enable fast,
sequential streaming reads.
20
Hadoop Library from Apache
2. Replica Placement: To ensure fault tolerance, HDFS places replicas on different nodes,
preferably across racks. However, cross-rack communication is costly, so HDFS balances
reliability and efficiency. With the default replication factor of three, one replica is stored
on the local node, another on a different node within the same rack, and the third on a
node in a different rack—offering fault tolerance with reduced communication overhead.
3. Heartbeat and Blockreport messages: Heartbeats and Blockreports are periodic
messages sent to the NameNode by each DataNode in a cluster. Receipt of a Heartbeat
implies that the DataNode is functioning properly, while each Blockreport contains a list
of all blocks on a DataNode. The NameNode receives such messages because it is the sole
decision maker of all replicas in the system.
4. HDFS provides high-throughput access to large data sets by focusing on batch processing
over low latency. Files are split into large blocks (e.g., 64MB) to reduce metadata and
improve performance. Fewer, larger blocks lower metadata overhead and enable fast,
sequential streaming reads.
21