E - Notes - HPC-Unit 3-1
E - Notes - HPC-Unit 3-1
VANIYAMBADI
PG and Department of Computer Applications
II M.C.A – Semester - IV
Overview:
Basics of parallelization
- Introduction to Parallelism
- Parallel scalability
Shared memory parallel programming with OpenMP
- Short introduction to OpenMP
- openMP -parallel Jacobi algorithm
Basics of parallelization
1. Introduction to Parallelism
What is Parallelism?
Parallelism is the process of processing several set of instructions simultaneously. It reduces the total
computational time. Parallelism can be implemented by using parallel computers, i.e. a computer with many
processors. Parallel computers require parallel algorithm, programming languages, compilers and operating
system that support multitasking.
Why parallelize?
Parallelization is a technique that breaks down a large task into smaller tasks that can be processed at the
same time. This is done to reduce the time it takes to complete the task.
Saves time
Parallel computing is faster than serial computing, which uses a single processor to solve problems
one at a time.
Saves money
Parallel computing can save money by using resources more efficiently.
Solves complex problems
Parallel computing can solve more complex problems by using more resources.
Models real-world processes
Parallel computing models how the real world works, where things don't happen one at a time.
Where is parallelization used?
Scientific simulations, Data analysis, Machine learning, Computer graphics, Web servers, Database
servers, and Distributed computing systems.
Parallelism
Writing a parallel program must always start by identifying the parallelism inherent in the algorithm at hand.
Different variants of parallelism induce different methods of parallelization.
1.1 Data parallelism
Data parallelism is a parallel computing technique that involves distributing data across multiple
processors to perform the same operation simultaneously. The goal is to improve speed and
computational efficiency.
Example: Medium-grained loop parallelism
Processing of array data by loops or loop nests is a central component in most scientific
codes.
Often the computations performed on individual array elements are independent of each other
and are hence typical candidates for parallel execution by several processors in shared
memory
The reason why this variant of parallel computing is often called “medium-grained” is that
the distribution of work across processors is flexible and easily changeable down to the single
data element.
First and foremost, the computational effort should be equal for all domains to prevent some
workers from idling while others still update their own domains. This is called load
balancing.
Domain decomposition has the attractive property that domain boundary area grows more
slowly than volume if the problem size increases with N constant.
There fore, one can sometimes alleviate communication bottlenecks just by choosing a
larger problem size. The expected effects of scaling problem size and/or the number
of workers with optimal domain decomposition in three dimensions.
In functional parallelism, tasks are distributed across different processors so that they can be
performed concurrently. This is different from data parallelism, which involves running the
same task on different data components.
Sometimes the solution of a “big” numerical problem can be split into more or
less disparate subtasks, which work together by data exchange and synchronization.
In this case, the subtasks execute completely different code on different data items,
which is why functional parallelism is also called MPMD (Multiple Program Multiple
Data).
Example: Master-worker scheme
Reserving one compute element for administrative tasks while all others solve the actual problem is
called the master-worker scheme. The master distributes work and collects results.
If all compute elements have a copy of the scene, all pixels are independent and can be computed in
parallel. Due to efficiency concerns, the picture is usually divided into “work packages” (rows or
tiles).
In case of a distributed-memory system, the finished tile must also be communicated over the
network.
A drawback of the master-worker scheme is the potential communication and performance bottleneck
that may appear with a single master when the number of workers is large.
Functional decomposition is a method of breaking down a complex process into smaller, more
manageable parts. It can be used to introduce parallelism, which is the process of executing multiple
parts of a program simultaneously.
2. Parallel scalability
Parallel scalability is a measure of how well a parallel system can increase speed as more processors are
added. It can also refer to how well a system can maintain efficiency as the number of processors increases.
Finding parallelism is not only a common problem in computing but also in many other areas like
manufacturing, traffic flow and even business processes. In a very simplistic view, all execution units
(workers, assembly lines, waiting queues,CPUs,. . . ) execute their assigned work in exactly the same amount
of time. Under such conditions, using N workers, a problem that takes a time T to be solved sequentially
will now ideally take only T/N .We call this a speedup of N.
Whatever parallelization scheme is chosen, this perfect picture will most probably not hold in reality. Some
of the reasons for this have already been mentioned above: Not all workers might execute their tasks in the
same amount of time because the problem was not (or could not) be partitioned into pieces with equal
complexity.
Not all workers might execute their tasks in the same amount of time because the problem was not (or could
not) be partitioned into pieces with equal complexity. Hence, there are times when all but a few have nothing
to do but wait for the latecomers to arrive.
This load imbalance hampers performance because some resources are underutilized. Moreover there might
be shared resources like, e.g., tools that only exist once but are needed by all workers. This will effectively
serialize part of the concurrent execution.
And finally, the parallel workflow may require some communication between workers, adding overhead that
would not be present in the serial case. All these effects can impose limits on speedup.
All these effects can impose limits on speedup. How well a task can be parallelized is usually quantified by
some scalability metric. Using such metrics, one can answer questions like:
• How much faster can a given problem be solved with N workers instead of one?
• How much more work can be done with N workers instead of one?
• What impact do the communication requirements of the parallel application have on performance and
scalability?
• What fraction of the resources is actually used productively for solving the problem?
In order to be able to define scalability we first have to identify the basic measurements on which derived
performance metrics are built. In a simple model, the overall problem size (“amount of work”)
shall be s+ p = 1, where s is the serial (nonparallelizable) part and p is the perfectly parallelizable fraction.
• Algorithmic limitations. Operations that cannot be done in parallel because of, e.g., mutual
dependencies, can only be performed one after another, or even in a certain order.
• Bottlenecks. Shared resources are common in computer systems: Execution units in the core, shared
paths to memory in multicore chips, I/O devices. Access to a shared resource serializes execution.
Even if the algorithm itself could be performed completely in parallel, concurrency may be limited
by bottlenecks.
• Startup overhead. Starting a parallel program, regardless of the technical details,takes time. Of
course, system designs try to minimize startup time, especially in massively parallel systems, but
there is always a nonvanishing serial part. If a parallel application’s overall runtime is too short,
startup will have a strong impact.
• Communication. Fully concurrent communication between different parts of a parallel system cannot
be taken for granted. If solving a problem in parallel requires communication, some serialization
is usually unavoidable.
First we assume a fixed problem, which is to be solved by N workers. We normalize the single-
worker (serial) runtime Tsf = s+ p
to one. Solving the same problem on N workers will require a runtime of
Tpf = s+p / N
This is called strong scaling because the amount of work stays constant no matter how many
workers are used. Here the goal of parallelization is minimization of time to solution for a given
problem.
If time to solution is not the primary objective because larger problem sizes (for which available
memory is the limiting factor) are of interest, it is appropriate to scale the problem size with some
power of N so that the total amount of work is s+pNa , where a is a positive but otherwise free
parameter. Here we use the implicit assumption that the serial fraction s is a constant.We define the
serial runtime for the scaled (variably-sized) problem as
Tsv = s+ pNa
Tpv = s+ pNa−1
The term weak scaling has been coined for this approach, although it is commonly
used only for the special case a = 1.
Application speedup can be defined as the quotient of parallel and serial performance for fixed problem size.
In the following we define “performance”as “work over time,” unless otherwise noted. Serial performance
for fixed problem size (work) s+ p is thus
Amdahl's law and Gustafson's law are both used to analyze how much speedup can be achieved when
solving problems using multiple processors. They are primarily used in the field of Parallel Distributed
Computing (PDC).
Amdahl's law
• Assumes that the computing requirements remain the same when processing power increases
• Applies when the problem size is fixed
• States that the potential speedup is limited by the ratio of accelerated and non-accelerated fractions of
the algorithm
Amdahl's Law is a principle used in parallel computing to predict the theoretical maximum speedup of a task
when parts of it are parallelized. It highlights the limits of parallelization and shows how the non-
parallelizable portion of a task constrains overall performance improvement.
Formula
Amdahl's Law is expressed as:
Practical Implications
Gustafson’s Law
Gustafson’s Law is a principle in parallel computing that complements Amdahl’s Law by providing a
different perspective on the scalability of parallel systems. It argues that as the number of processors
increases, the problem size can also grow, allowing the parallelizable portion of the workload to dominate,
thereby achieving greater speedup.
Formula
Gustafson’s Law is expressed as:
𝑆=𝑁−(1−𝑃)⋅𝑁
Key Insights
• Problem Size Scaling: Unlike Amdahl’s Law, which assumes a fixed workload, Gustafson’s Law
assumes that the problem size increases with the number of processors. This makes it particularly
useful for large-scale computations.
• Practical Speedup: It highlights that with a sufficiently large problem size, the parallelizable portion
can outweigh the sequential part, leading to near-linear speedup.
• Focus on Scalability: Gustafson’s Law suggests that rather than being limited by the sequential
portion, systems can achieve significant speedup by scaling the workload.
Practical Implications
Communication Overhead: Processors need to communicate and synchronize, which can reduce
efficiency.
Load Balancing: Uneven distribution of work among processors leads to idle time and lower
efficiency.
Sequential Portion: The non-parallelizable portion of the workload (as per Amdahl's Law) limits
efficiency.
Granularity: Finer tasks may result in higher overhead, while coarser tasks may better utilize
processors.
Scalability: Increasing problem size relative to the number of processors can improve efficiency.
2.5 Serial performance versus strong scalability
"Serial performance" refers to the execution speed of a single-threaded program on a single processor,
while "strong scalability" describes how well a program's performance improves when the problem size
remains constant but the number of processors used increases,
Example:
Scenario: Imagine an image processing algorithm where a large portion of the image can be
processed independently on different cores (parallelizable), but a small part requires sequential
processing (serial bottleneck).
o Serial Performance: This would be the time it takes to process the entire image using only
one core, including the time spent on the serial part.
o Strong Scalability: If you add more cores, the parallelizable parts of the image processing
will get significantly faster, but the overall speedup will be limited by the time spent on the
serial section, which cannot be parallelized.
1. Serial Performance
Definition:
Serial performance refers to the efficiency and speed of executing a task on a single processor (or core)
without any parallelization. It measures the baseline performance of the algorithm or system in a non-
parallel environment.
Key Characteristics:
Dependent on Hardware: Affected by processor speed, memory access, and other hardware factors.
Algorithm Design: Influenced by the efficiency of the algorithm and the implementation.
Impact on Parallel Performance: The faster the serial performance, the less time is spent on non-
parallelizable portions, improving parallel speedup as per Amdahl’s Law.
Optimization Focus:
Definition:
Strong scalability refers to the ability of a parallel system to achieve improved performance (speedup) as the
number of processors increases, while keeping the total problem size fixed.
Key Characteristics:
Fixed Problem Size: The workload does not change as processors are added.
Diminishing Returns: Speedup is limited by the sequential portion of the task (per Amdahl’s Law).
Efficiency Decreases with Processors: Communication and synchronization overheads grow as more
processors are used.
Challenges:
Refined performance models in parallel computing aim to provide a more detailed and accurate
understanding of system performance by accounting for real-world factors beyond simple speedup
equations. These models extend basic metrics like Amdahl’s Law and Gustafson’s Law to incorporate
communication overhead, load imbalance, memory access patterns, and architectural constraints.
2.Communication Overhead
Parallel systems require processors to exchange data. The communication cost increases with:
The scaling baseline determines the starting point for evaluating how a system performs as resources (e.g.,
processors, memory, or problem size) are scaled. Selecting the appropriate baseline is crucial for accurately
analyzing and interpreting performance metrics like speedup, efficiency, and scalability.
Types of Scaling Baselines
Serial Baseline
Definition: Uses the execution time of the task on a single processor as the baseline.
Use Case: Common in theoretical analyses (e.g., Amdahl’s Law) and when emphasizing the benefits
of parallelization compared to a serial implementation.
Advantages:
Straightforward to compute.
Highlights the overall performance improvement due to parallelization.
Challenges:
May overestimate the benefits if the serial implementation is inefficient or unrealistic.
Does not always reflect real-world scenarios where tasks start with multiple processors.
Parallel Baseline
Use Case: Relevant for applications already designed for parallel execution.
Advantages:
Reflects realistic starting conditions for parallel systems.
Allows for relative comparisons across configurations.
Challenges:
Requires careful selection of the baseline processor count to ensure consistency.
Algorithm-Dependent Baseline
Definition: Uses the best-known implementation or a reference algorithm as the baseline, regardless
of the processor count.
Use Case: Useful for benchmarking and algorithm comparisons.
Advantages:
Focuses on improving state-of-the-art solutions.
Challenges:
Requires a well-established reference algorithm.
Strong Scalability Baseline
Definition: Keeps the problem size fixed and evaluates performance as the number of processors
increases.
Use Case: Highlights how well a system can utilize additional resources for a fixed workload.
Advantages:
Provides insights into the limits of parallelization.
Challenges:
Limited by Amdahl’s Law (sequential bottlenecks) and overhead.
Definition: Increases the problem size proportionally with the number of processors while keeping
the workload per processor constant.
Use Case: Evaluates how systems handle larger workloads with increased resources.
Advantages:
Reflects real-world scenarios for large-scale computations.
Aligned with Gustafson’s Law, showing scalability with workload growth.
Challenges:
Assumes problem sizes can grow indefinitely.
Scalability is a critical concept in parallel computing, describing how well a system can handle increased
resources. Different scaling approaches—strong scaling, strictly weak scaling, and modified weak scaling—
evaluate performance under varying conditions. Below, we explore these concepts with definitions,
formulas, and examples.
1. Strong Scaling
Definition:
Strong scaling examines performance improvement as the number of processors increases while keeping the
problem size fixed.
Formula:
Key Characteristics:
Emphasizes parallel efficiency for a fixed workload.
Limited by Amdahl’s Law, as the non-parallelizable portion (sequential bottleneck) becomes
dominant with increasing processors.
Communication and synchronization overheads grow with more processors.
Example:
Definition:
Strictly weak scaling measures performance when the problem size increases linearly with the number of
processors, ensuring that the workload per processor remains constant.
Formula:
Definition:
Modified weak scaling evaluates performance when the problem size increases sub-linearly or super-linearly
with the number of processors, meaning the workload per processor changes.
Formula:
Key Characteristics:
Allows flexibility in problem size growth, accommodating real-world scenarios where scaling may
not be strictly linear.
Useful for applications with irregular workloads or those dependent on external constraints.
Example:
If the problem size grows sub-linearly (𝑘<𝑁), processors may become underutilized, highlighting
inefficiencies.
2.9 Load imbalance
Load imbalance occurs when work is unevenly distributed across processors or computing units in a
parallel system. This results in some processors being underutilized (idle) while others are still performing
computations, reducing overall efficiency and performance.
OS jitter
OS jitter refers to performance variability in parallel computing systems caused by operating system (OS)
activities that disrupt or interfere with the execution of user applications. These interruptions can degrade
performance and scalability, especially in tightly coupled, high-performance computing environments.
Sources of OS Jitter
1. Kernel Interrupts
Examples: Timer ticks, context switching, I/O handling.
2. Daemon Processes
Background processes or services (e.g., logging, monitoring, or security scans) that
consume CPU cycles.
3. Resource Contention
Competition for shared system resources like memory, cache, or I/O bandwidth among user
applications and OS tasks.
4. Hardware Management
System-level tasks like power management, thermal throttling, or hardware interrupts.
5. Network Stack Activity
Handling network communication, especially in distributed computing.
Impact of OS Jitter
1. Performance Degradation
Jitter causes delays in computations, increasing runtime variability and reducing efficiency.
2. Scalability Challenges
In large-scale HPC systems, even small jitter effects on individual nodes can amplify, leading to
significant slowdowns.
3. Load Imbalance
Some processors may experience more OS interruptions than others, causing uneven work
distribution.
4. Increased Latency in Synchronization
Barrier synchronization and other collective operations are delayed if even one processor is
interrupted.
Example in Practice
Scenario:
In an MPI-based HPC application running on 1,024 nodes, a periodic OS daemon triggers disk I/O on
random nodes. This causes:
All processors can directly access all data in a shared memory,no need for explicit communication between
the processors.
OpenMP: A parallel programming standard for shared-memory parallel computers
A set of compiler directives (with additional clauses)
A small number of library functions
A few environment variables
Threads in OpenMP
The central execution entities in an OpenMP program are threads—lightweight processes.
The OpenMP threads share a common address space and mutually access data.
Spawning a thread is much less costly than forking a new process, because threads share everything
except the instruction pointer, stack pointer and register state.
If wanted, each thread can have a few “private variables” (by means of the local stack
pointer).
Parallel execution : “Fork-join” model
In any OpenMP program, a single thread, called master thread, runs immediately after startup.
The master thread can spawn (also called fork) a number of additional threads when entering a so-
called parallel region.
Inside a parallel region, the master thread and the spawned threads execute instruction streams
concurrently.
Each thread has a unique ID.
Different threads work on different parts of the shared data or carry out different tasks.
OpenMP has compiler directives for dividing the work among the threads.
At the end of a parallel region, the threads are joined: all the threads are terminated but the master
thread.
There can be multiple parallel regions in an OpenMP program.
we will learn how to create a parallel Hello World Program using OpenMP.
1. Include the header file: We have to include the OpenMP header for our program along with the
standard header files.
//OpenMP header
#include <omp.h>
2. Specify the parallel region: In OpenMP, we need to mention the region which we are going to make
it as parallel using the keyword pragma omp parallel. The pragma omp parallel is used to fork
additional threads to carry out the work enclosed in the parallel. The original thread will be denoted
as the master thread with thread ID 0. Code for creating a parallel region would be,
Compiler directives. These are preprocessor directives that can be used by programmers to dene and
control parallel regions of code.
Runtime library routines. These are functions from the OpenMP library that can be called by the
program and are then linked into the program;
Environment variables. These can be used to control the behavior of OpenMP programs.
“Manual” loop parallelization
Now, let’s try to “manually” parallelize a for-loop, that is, divide the iterations evenly among the threads.
Example:
for (i=0; i<N; i++)
a[i] = b[i] + c[i];
Important observation: Assuming that the arrays a, b and c do not overlap (that is, no aliasing), then the
iterations of this for-loop are independent of each other, thus safe to be executed by multiple threads
concurrently.
Given num_threads as the total number of threads, one way to divide N iterations for thread with ID
thread_id is as follows:
OpenMP coding
Data Scoping
Any variables that existed before a parallel region still exist inside the parallel region, and are by
default shared between all threads.
Often it will be necessary for the threads to have some private variables.
Each thread can either declare new local variables inside the parallel region, these variables
are private “by birth”;
Or, each thread can “privatize” some of the shared variables that already existed before a
parallel region (using the private clause):
int blen, bstart, bend;
#pragma omp parallel private(blen, bstart, bend)
{
// ...
}
Each “privatized” variable has one (uninitialized) instance per thread;
The private variables’ scope is until the end of the parallel region.
Parallelizing for-loops is OpenMP’s main work-sharing mechanism.
OpenMP has several built-in strategies for dividing the iterations among the threads.
No need to manually calculate each thread’s loop bounds
Being the omnipresent programming structure in scientific codes, loops are natural candidates for
parallelization if individual iterations are independent. As an example, consider a parallel version of a simple
program for the calculation of π by integration:
A simple program for numerical integration of a function in OpenMP.
Synchronization
Critical regions
Concurrent write accesses to a shared variable must be avoided by all means to circumvent
race conditions.
An OpenMP critical code block is executed by one thread at a time. This is one way to avoid
race conditions. (There are other ways.)
The variable approx_pi in the above example is a shared variable, to which all the threads will
write. Thus, “protection” is provided by a critical code block.
Use of the critical directive will incur overhead.
Improper use of the critical directive may lead to deadlock.
Barriers
If, at a certain point in the parallel execution, it is necessary to synchronize all threads, a BARRIER can be
used:
!$OMP BARRIER
The barrier is a synchronization point, which guarantees that all threads have reached it before any thread
goes on executing the code below it. Certainly it must be ensured that every thread hits the barrier, or a
deadlock may occur.
Barriers should be used with caution in OpenMP programs, partly because of their potential to cause
deadlocks, but also due to their performance impact .
Reductions
Actually, using critical to prevent concurrent writes to the variable approx_pi is an “over-kill”. The reduction
clause of OpenMP is designed for this particular purpose:
Loop scheduling
How are the loop iterations exactly divided among the threads?
Tasking
A task can be defined by OpenMP’s task directive, containing the code to be executed.
When a thread encounters a task construct, it may execute it right away or set up the appropriate data
environment and defer its execution. The task is then ready to be executed later by any thread of the
team.
A “single”
code block in OpenMP will be entered by one thread only, namely the therad that reaches the single
directive first. All others skip the code and wait at the end of the single block due to an implicit
barrier.
A “master” code block is only entered by the master thread, all the other threads skip over without
waiting for the master thread to finish.
OpenMP has a separate “barrier” directive for explicit synchronization among the threads. (Use with
care!!!)
N×N lattice solves a system of equations using the Jacobi iterative method in two dimensions. This
can be used, for example, to solve partial differential equations (PDEs) on a grid.
https://www.igpm.rwth-aachen.de/Download/ss17/EWP/parallel.pdf
Additional Resources
1. https://cse.cet.ac.in/wp-content/uploads/2019/10/ModernProcessors.pdf
2. https://www.geeksforgeeks.org/high-performance-computing/
3. https://www.studocu.com/in/document/vignan-institute-of-technology-and-science/computer-
science/hpc-unit-3/88659975
4. https://www.igpm.rwth-aachen.de/Download/ss17/EWP/parallel.pdf
Practice Questions
1. What is parallelization?
2. State Amdahl's Law.
3. Explain the key idea behind Gustafson's Law.
4. Why is load balancing important in parallel computing?
5. What are the uses of parallelization?
6. Differentiate between Amdahl's Law and Gustafson's Law.
7. Explain the factors affecting parallel efficiency.
8. Describe the concept of strong scalability with an example.
9. Discuss the importance of communication overhead in parallel system
10. What is parallelization? Discuss its types and applications.
11. Explain Amdahl's Law and Gustafson's Law with their formulas and implications.
12. Discuss the factors limiting scalability in parallel computing.
13. Explain data parallelism and functional parallelism with examples.
14. What are scalability metrics? Discuss their significance in parallel performance analysis.
15. What is OpenMP and what programming languages does it support?
16. What is the role of the master thread in OpenMP?
17. Define shared memory in the context of OpenMP.
18. What are the primary components of the OpenMP API?
19. What is a "critical region" in OpenMP?
20. Explain the fork-join model in OpenMP.
21. How does OpenMP handle parallel regions in a program?
22. What is the significance of the private clause in OpenMP?
23. Explain the #pragma omp parallel for directive with an example.
24. What is the difference between the static, dynamic, and guided scheduling options in OpenMp
25. Explain the process of parallelizing a for loop in OpenMP with an example.
26. Illustrate the parallelization of the Jacobi algorithm using OpenMP for solving partial differential
equations.