KEMBAR78
E - Notes - HPC-Unit 3-1 | PDF | Parallel Computing | Scalability
0% found this document useful (0 votes)
12 views26 pages

E - Notes - HPC-Unit 3-1

The document provides an overview of high-performance computing, focusing on parallelization techniques, including shared memory programming with OpenMP and the Jacobi algorithm. It discusses the importance of parallelism in reducing computational time and solving complex problems, as well as the concepts of parallel scalability, data and functional parallelism, and the limitations of parallel execution. Additionally, it covers scalability metrics, Amdahl's and Gustafson's laws, and the implications for performance in parallel computing systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views26 pages

E - Notes - HPC-Unit 3-1

The document provides an overview of high-performance computing, focusing on parallelization techniques, including shared memory programming with OpenMP and the Jacobi algorithm. It discusses the importance of parallelism in reducing computational time and solving complex problems, as well as the concepts of parallel scalability, data and functional parallelism, and the limitations of parallel execution. Additionally, it covers scalability metrics, Amdahl's and Gustafson's laws, and the implications for performance in parallel computing systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 26

MARUDHAR KESARI JAIN COLLEGE FOR WOMEN (AUTONOMOUS)

VANIYAMBADI
PG and Department of Computer Applications

II M.C.A – Semester - IV

E-Notes (Study Material)

Core Course -1: High Performance Computing Code:23PCA41

Unit: 3 – Basics of parallelization: Introduction to Parallelism -Parallel scalability.


Shared memory parallel programming with OpenMP: Short introduction to OpenMP-
OpenMP-parallel Jacobi algorithm.
Learning Objectives:To get idea of what techniques used in HPC models.
Course Outcomes: Apply with parallel computing.

Overview:
Basics of parallelization
- Introduction to Parallelism
- Parallel scalability
Shared memory parallel programming with OpenMP
- Short introduction to OpenMP
- openMP -parallel Jacobi algorithm
Basics of parallelization
1. Introduction to Parallelism
What is Parallelism?
Parallelism is the process of processing several set of instructions simultaneously. It reduces the total
computational time. Parallelism can be implemented by using parallel computers, i.e. a computer with many
processors. Parallel computers require parallel algorithm, programming languages, compilers and operating
system that support multitasking.
Why parallelize?
Parallelization is a technique that breaks down a large task into smaller tasks that can be processed at the
same time. This is done to reduce the time it takes to complete the task.
 Saves time
Parallel computing is faster than serial computing, which uses a single processor to solve problems
one at a time.
 Saves money
Parallel computing can save money by using resources more efficiently.
 Solves complex problems
Parallel computing can solve more complex problems by using more resources.
 Models real-world processes
Parallel computing models how the real world works, where things don't happen one at a time.
Where is parallelization used?
Scientific simulations, Data analysis, Machine learning, Computer graphics, Web servers, Database
servers, and Distributed computing systems.
Parallelism
Writing a parallel program must always start by identifying the parallelism inherent in the algorithm at hand.
Different variants of parallelism induce different methods of parallelization.
1.1 Data parallelism
Data parallelism is a parallel computing technique that involves distributing data across multiple
processors to perform the same operation simultaneously. The goal is to improve speed and
computational efficiency.
Example: Medium-grained loop parallelism
 Processing of array data by loops or loop nests is a central component in most scientific
codes.

 Often the computations performed on individual array elements are independent of each other
and are hence typical candidates for parallel execution by several processors in shared
memory
 The reason why this variant of parallel computing is often called “medium-grained” is that
the distribution of work across processors is flexible and easily changeable down to the single
data element.

Example: Coarse-grained parallelism by domain decomposition


Coarse-Grained SIMD architecture concentrates on the processing of large chunks of data
with one command. The tasks are performed in larger batches and although the level of
control that is achievable is lower than in the fine-grained approach, the throughput may be
higher.
A straightforward way to distribute the work involved across workers, i.e., processors, is to
assign a part of the grid to each worker. This is called domain decomposition.
grid.
Domain decomposition for N workers subdivides the computational domain into N
subdomains.
 On a shared-memory parallel computer, all grid sites in all domains can be updated
before the processors have to synchronize at the end of the sweep.
 on a distributed-memory system, updating the boundary sites of one domain requires
data from one or more adjacent domains.
 Therefore, before adomain update, all boundary values needed for the upcoming
sweep must be communicated to the relevant neighboring domains. In order to store
this data, each domain must be equipped with some extra grid points, the so-called
halo or ghost layers.
After the exchange, each domain is ready for the next sweep. The whole parallel algorithm is
completely equivalent to purely serial execution.

First and foremost, the computational effort should be equal for all domains to prevent some
workers from idling while others still update their own domains. This is called load
balancing.

Domain decomposition has the attractive property that domain boundary area grows more
slowly than volume if the problem size increases with N constant.

There fore, one can sometimes alleviate communication bottlenecks just by choosing a
larger problem size. The expected effects of scaling problem size and/or the number
of workers with optimal domain decomposition in three dimensions.

1.2 Functional parallelism


Functional parallelism is a method of parallelizing computer code across multiple processors.
It's also known as task parallelism or control parallelism.

In functional parallelism, tasks are distributed across different processors so that they can be
performed concurrently. This is different from data parallelism, which involves running the
same task on different data components.

Functional parallelism can be used to:


 Improve performance: By using more resources and reducing task wait time
 Improve scalability: By distributing the workload across multiple machines and
networks
 Improve modularity: By separating computation and coordination concerns

Sometimes the solution of a “big” numerical problem can be split into more or
less disparate subtasks, which work together by data exchange and synchronization.
In this case, the subtasks execute completely different code on different data items,
which is why functional parallelism is also called MPMD (Multiple Program Multiple
Data).
Example: Master-worker scheme

Reserving one compute element for administrative tasks while all others solve the actual problem is
called the master-worker scheme. The master distributes work and collects results.

If all compute elements have a copy of the scene, all pixels are independent and can be computed in
parallel. Due to efficiency concerns, the picture is usually divided into “work packages” (rows or
tiles).

In case of a distributed-memory system, the finished tile must also be communicated over the
network.

A drawback of the master-worker scheme is the potential communication and performance bottleneck
that may appear with a single master when the number of workers is large.

Example: Functional decomposition

Functional decomposition is a method of breaking down a complex process into smaller, more
manageable parts. It can be used to introduce parallelism, which is the process of executing multiple
parts of a program simultaneously.

How functional decomposition works

 Identify the distinct computations that make up a process


 Break down the process into smaller units that can be executed independently
 Assign each unit a pure function
 Execute each unit on a different processor

2. Parallel scalability

Parallel scalability is a measure of how well a parallel system can increase speed as more processors are
added. It can also refer to how well a system can maintain efficiency as the number of processors increases.

2.1 Factors that limit parallel execution

Finding parallelism is not only a common problem in computing but also in many other areas like
manufacturing, traffic flow and even business processes. In a very simplistic view, all execution units
(workers, assembly lines, waiting queues,CPUs,. . . ) execute their assigned work in exactly the same amount
of time. Under such conditions, using N workers, a problem that takes a time T to be solved sequentially
will now ideally take only T/N .We call this a speedup of N.
Whatever parallelization scheme is chosen, this perfect picture will most probably not hold in reality. Some
of the reasons for this have already been mentioned above: Not all workers might execute their tasks in the
same amount of time because the problem was not (or could not) be partitioned into pieces with equal
complexity.

Not all workers might execute their tasks in the same amount of time because the problem was not (or could
not) be partitioned into pieces with equal complexity. Hence, there are times when all but a few have nothing
to do but wait for the latecomers to arrive.

This load imbalance hampers performance because some resources are underutilized. Moreover there might
be shared resources like, e.g., tools that only exist once but are needed by all workers. This will effectively
serialize part of the concurrent execution.

And finally, the parallel workflow may require some communication between workers, adding overhead that
would not be present in the serial case. All these effects can impose limits on speedup.

All these effects can impose limits on speedup. How well a task can be parallelized is usually quantified by
some scalability metric. Using such metrics, one can answer questions like:

• How much faster can a given problem be solved with N workers instead of one?
• How much more work can be done with N workers instead of one?
• What impact do the communication requirements of the parallel application have on performance and
scalability?
• What fraction of the resources is actually used productively for solving the problem?

2.2 Scalability metrics

In order to be able to define scalability we first have to identify the basic measurements on which derived
performance metrics are built. In a simple model, the overall problem size (“amount of work”)
shall be s+ p = 1, where s is the serial (nonparallelizable) part and p is the perfectly parallelizable fraction.

There can be many reasons for a nonvanishing serial part:

• Algorithmic limitations. Operations that cannot be done in parallel because of, e.g., mutual
dependencies, can only be performed one after another, or even in a certain order.

• Bottlenecks. Shared resources are common in computer systems: Execution units in the core, shared
paths to memory in multicore chips, I/O devices. Access to a shared resource serializes execution.
Even if the algorithm itself could be performed completely in parallel, concurrency may be limited
by bottlenecks.

• Startup overhead. Starting a parallel program, regardless of the technical details,takes time. Of
course, system designs try to minimize startup time, especially in massively parallel systems, but
there is always a nonvanishing serial part. If a parallel application’s overall runtime is too short,
startup will have a strong impact.

• Communication. Fully concurrent communication between different parts of a parallel system cannot
be taken for granted. If solving a problem in parallel requires communication, some serialization
is usually unavoidable.

First we assume a fixed problem, which is to be solved by N workers. We normalize the single-
worker (serial) runtime Tsf = s+ p
to one. Solving the same problem on N workers will require a runtime of
Tpf = s+p / N

This is called strong scaling because the amount of work stays constant no matter how many
workers are used. Here the goal of parallelization is minimization of time to solution for a given
problem.

If time to solution is not the primary objective because larger problem sizes (for which available
memory is the limiting factor) are of interest, it is appropriate to scale the problem size with some
power of N so that the total amount of work is s+pNa , where a is a positive but otherwise free
parameter. Here we use the implicit assumption that the serial fraction s is a constant.We define the
serial runtime for the scaled (variably-sized) problem as
Tsv = s+ pNa

Consequently, the parallel runtime is

Tpv = s+ pNa−1
The term weak scaling has been coined for this approach, although it is commonly
used only for the special case a = 1.

2.3 Simple scalability laws

Application speedup can be defined as the quotient of parallel and serial performance for fixed problem size.
In the following we define “performance”as “work over time,” unless otherwise noted. Serial performance
for fixed problem size (work) s+ p is thus
Amdahl's law and Gustafson's law are both used to analyze how much speedup can be achieved when
solving problems using multiple processors. They are primarily used in the field of Parallel Distributed
Computing (PDC).

Amdahl's law
• Assumes that the computing requirements remain the same when processing power increases
• Applies when the problem size is fixed
• States that the potential speedup is limited by the ratio of accelerated and non-accelerated fractions of
the algorithm

Amdahl's Law is a principle used in parallel computing to predict the theoretical maximum speedup of a task
when parts of it are parallelized. It highlights the limits of parallelization and shows how the non-
parallelizable portion of a task constrains overall performance improvement.
Formula
Amdahl's Law is expressed as:

Practical Implications

Amdahl's Law helps developers and architects:

• Determine the potential benefits of parallelizing a task.


• Evaluate the trade-offs when designing parallel systems.
• Realize the importance of minimizing the sequential portion of computations to maximize
performance.

Gustafson’s Law
Gustafson’s Law is a principle in parallel computing that complements Amdahl’s Law by providing a
different perspective on the scalability of parallel systems. It argues that as the number of processors
increases, the problem size can also grow, allowing the parallelizable portion of the workload to dominate,
thereby achieving greater speedup.

Formula
Gustafson’s Law is expressed as:

𝑆=𝑁−(1−𝑃)⋅𝑁

Key Insights

• Problem Size Scaling: Unlike Amdahl’s Law, which assumes a fixed workload, Gustafson’s Law
assumes that the problem size increases with the number of processors. This makes it particularly
useful for large-scale computations.

• Practical Speedup: It highlights that with a sufficiently large problem size, the parallelizable portion
can outweigh the sequential part, leading to near-linear speedup.

• Focus on Scalability: Gustafson’s Law suggests that rather than being limited by the sequential
portion, systems can achieve significant speedup by scaling the workload.

Practical Implications

Gustafson’s Law is particularly relevant in:

• High-performance computing, where workloads can be scaled to maximize resource utilization.


• Applications where the problem size (e.g., simulations, data processing) grows with the system's
computational power.

2.4 Parallel efficiency


Parallel efficiency is a metric used in parallel computing to measure how effectively a parallel system
utilizes its resources. It quantifies the ratio of achieved speedup to the ideal speedup, providing insight into
how well processors are performing relative to their potential.

Key Factors Affecting Parallel Efficiency

 Communication Overhead: Processors need to communicate and synchronize, which can reduce
efficiency.
 Load Balancing: Uneven distribution of work among processors leads to idle time and lower
efficiency.
 Sequential Portion: The non-parallelizable portion of the workload (as per Amdahl's Law) limits
efficiency.
 Granularity: Finer tasks may result in higher overhead, while coarser tasks may better utilize
processors.
 Scalability: Increasing problem size relative to the number of processors can improve efficiency.
2.5 Serial performance versus strong scalability

"Serial performance" refers to the execution speed of a single-threaded program on a single processor,
while "strong scalability" describes how well a program's performance improves when the problem size
remains constant but the number of processors used increases,

Example:

 Scenario: Imagine an image processing algorithm where a large portion of the image can be
processed independently on different cores (parallelizable), but a small part requires sequential
processing (serial bottleneck).
o Serial Performance: This would be the time it takes to process the entire image using only
one core, including the time spent on the serial part.
o Strong Scalability: If you add more cores, the parallelizable parts of the image processing
will get significantly faster, but the overall speedup will be limited by the time spent on the
serial section, which cannot be parallelized.

1. Serial Performance

Definition:

Serial performance refers to the efficiency and speed of executing a task on a single processor (or core)
without any parallelization. It measures the baseline performance of the algorithm or system in a non-
parallel environment.

Key Characteristics:

 Dependent on Hardware: Affected by processor speed, memory access, and other hardware factors.
 Algorithm Design: Influenced by the efficiency of the algorithm and the implementation.
 Impact on Parallel Performance: The faster the serial performance, the less time is spent on non-
parallelizable portions, improving parallel speedup as per Amdahl’s Law.
Optimization Focus:

 Reduce computational complexity.


 Optimize memory access patterns.
 Improve compiler optimizations.
2. Strong Scalability

Definition:
Strong scalability refers to the ability of a parallel system to achieve improved performance (speedup) as the
number of processors increases, while keeping the total problem size fixed.

Key Characteristics:

 Fixed Problem Size: The workload does not change as processors are added.
 Diminishing Returns: Speedup is limited by the sequential portion of the task (per Amdahl’s Law).
 Efficiency Decreases with Processors: Communication and synchronization overheads grow as more
processors are used.
Challenges:

 Load balancing becomes harder with more processors.


 Non-parallelizable portions and communication overheads limit speedup.

2.6 Refined performance models

Refined performance models in parallel computing aim to provide a more detailed and accurate
understanding of system performance by accounting for real-world factors beyond simple speedup
equations. These models extend basic metrics like Amdahl’s Law and Gustafson’s Law to incorporate
communication overhead, load imbalance, memory access patterns, and architectural constraints.
2.Communication Overhead

Parallel systems require processors to exchange data. The communication cost increases with:

 The number of processors.


 The size of the data being transferred.
 Network latency and bandwidth.

A refined model might include terms like:

Refined Performance Model Example


Advantages of Refined Models
 Real-World Accuracy: Accounts for practical constraints like communication delays, load imbalance,
and memory access inefficiencies.
 Insightful Design: Helps identify bottlenecks (e.g., communication, synchronization) for
optimization.
 Applicability to Complex Systems: Useful for modern architectures such as GPUs, clusters, and
NUMA systems.

2.7 Choosing the right scaling baseline

The scaling baseline determines the starting point for evaluating how a system performs as resources (e.g.,
processors, memory, or problem size) are scaled. Selecting the appropriate baseline is crucial for accurately
analyzing and interpreting performance metrics like speedup, efficiency, and scalability.
Types of Scaling Baselines

Serial Baseline

 Definition: Uses the execution time of the task on a single processor as the baseline.
 Use Case: Common in theoretical analyses (e.g., Amdahl’s Law) and when emphasizing the benefits
of parallelization compared to a serial implementation.
Advantages:
 Straightforward to compute.
 Highlights the overall performance improvement due to parallelization.
Challenges:
 May overestimate the benefits if the serial implementation is inefficient or unrealistic.
 Does not always reflect real-world scenarios where tasks start with multiple processors.

Parallel Baseline

N=1, 𝑁=2, or N=8).


 Definition: Uses the performance of the task on a specific number of processors as the baseline (e.g.,

 Use Case: Relevant for applications already designed for parallel execution.
Advantages:
 Reflects realistic starting conditions for parallel systems.
 Allows for relative comparisons across configurations.
Challenges:
 Requires careful selection of the baseline processor count to ensure consistency.

Algorithm-Dependent Baseline

 Definition: Uses the best-known implementation or a reference algorithm as the baseline, regardless
of the processor count.
 Use Case: Useful for benchmarking and algorithm comparisons.
Advantages:
 Focuses on improving state-of-the-art solutions.
Challenges:
 Requires a well-established reference algorithm.
Strong Scalability Baseline

 Definition: Keeps the problem size fixed and evaluates performance as the number of processors
increases.
 Use Case: Highlights how well a system can utilize additional resources for a fixed workload.
Advantages:
 Provides insights into the limits of parallelization.
Challenges:
 Limited by Amdahl’s Law (sequential bottlenecks) and overhead.

Weak Scalability Baseline

 Definition: Increases the problem size proportionally with the number of processors while keeping
the workload per processor constant.
 Use Case: Evaluates how systems handle larger workloads with increased resources.
Advantages:
 Reflects real-world scenarios for large-scale computations.
 Aligned with Gustafson’s Law, showing scalability with workload growth.
Challenges:
 Assumes problem sizes can grow indefinitely.

2.8 Case study: Can slower processors compute faster?

Scalability is a critical concept in parallel computing, describing how well a system can handle increased
resources. Different scaling approaches—strong scaling, strictly weak scaling, and modified weak scaling—
evaluate performance under varying conditions. Below, we explore these concepts with definitions,
formulas, and examples.
1. Strong Scaling

Definition:

Strong scaling examines performance improvement as the number of processors increases while keeping the
problem size fixed.

Formula:

Key Characteristics:
 Emphasizes parallel efficiency for a fixed workload.
 Limited by Amdahl’s Law, as the non-parallelizable portion (sequential bottleneck) becomes
dominant with increasing processors.
 Communication and synchronization overheads grow with more processors.
Example:

A task takes 100 seconds on 1 processor and 20 seconds on 5 processors:

2. Strictly Weak Scaling

Definition:

Strictly weak scaling measures performance when the problem size increases linearly with the number of
processors, ensuring that the workload per processor remains constant.

Formula:

3. Modified Weak Scaling

Definition:

Modified weak scaling evaluates performance when the problem size increases sub-linearly or super-linearly
with the number of processors, meaning the workload per processor changes.

Formula:

Key Characteristics:
 Allows flexibility in problem size growth, accommodating real-world scenarios where scaling may
not be strictly linear.
 Useful for applications with irregular workloads or those dependent on external constraints.
Example:

If the problem size grows sub-linearly (𝑘<𝑁), processors may become underutilized, highlighting
inefficiencies.
2.9 Load imbalance

A load imbalance is the uneven distribution of work across workers.

Load imbalance occurs when work is unevenly distributed across processors or computing units in a
parallel system. This results in some processors being underutilized (idle) while others are still performing
computations, reducing overall efficiency and performance.

OS jitter
OS jitter refers to performance variability in parallel computing systems caused by operating system (OS)
activities that disrupt or interfere with the execution of user applications. These interruptions can degrade
performance and scalability, especially in tightly coupled, high-performance computing environments.
Sources of OS Jitter
1. Kernel Interrupts
Examples: Timer ticks, context switching, I/O handling.
2. Daemon Processes
Background processes or services (e.g., logging, monitoring, or security scans) that
consume CPU cycles.
3. Resource Contention
Competition for shared system resources like memory, cache, or I/O bandwidth among user
applications and OS tasks.
4. Hardware Management
System-level tasks like power management, thermal throttling, or hardware interrupts.
5. Network Stack Activity
Handling network communication, especially in distributed computing.
Impact of OS Jitter
1. Performance Degradation
Jitter causes delays in computations, increasing runtime variability and reducing efficiency.
2. Scalability Challenges
In large-scale HPC systems, even small jitter effects on individual nodes can amplify, leading to
significant slowdowns.
3. Load Imbalance
Some processors may experience more OS interruptions than others, causing uneven work
distribution.
4. Increased Latency in Synchronization
Barrier synchronization and other collective operations are delayed if even one processor is
interrupted.
Example in Practice

Scenario:

In an MPI-based HPC application running on 1,024 nodes, a periodic OS daemon triggers disk I/O on
random nodes. This causes:

 Some nodes to take longer at synchronization points.


 The whole system's performance to degrade due to waiting at barriers.
Solution:

 Isolate OS daemons to specific cores or nodes.


 Use a lightweight OS kernel to minimize background noise.
 Employ MPI task mapping to avoid noisy nodes.

Shared memory parallel programming with OpenMP


Define OpenMP:
OpenMP, or Open Multi-Processing, is a programming interface (API) that helps programmers create
parallel programs. It's used to develop multi-threaded programs in C, C++, and Fortran.
How it works
 OpenMP uses a fork-join model to execute parallel code.
 A single master thread executes sequentially until it reaches a parallel region.
 The master thread then creates a team of parallel threads to execute the parallel region.
 Once the parallel region is complete, the parallel threads synchronize and terminate.
 The master thread then resumes execution sequentially.
What it's used for
 OpenMP can be used to create code that runs on multiple cores or threads.
 It can be used on many platforms, including Linux, macOS, Windows, and Solaris.
Shared memory

All processors can directly access all data in a shared memory,no need for explicit communication between
the processors.
OpenMP: A parallel programming standard for shared-memory parallel computers
 A set of compiler directives (with additional clauses)
 A small number of library functions
 A few environment variables
Threads in OpenMP
 The central execution entities in an OpenMP program are threads—lightweight processes.
 The OpenMP threads share a common address space and mutually access data.
 Spawning a thread is much less costly than forking a new process, because threads share everything
except the instruction pointer, stack pointer and register state.
 If wanted, each thread can have a few “private variables” (by means of the local stack
pointer).

Parallel execution : “Fork-join” model

 In any OpenMP program, a single thread, called master thread, runs immediately after startup.
 The master thread can spawn (also called fork) a number of additional threads when entering a so-
called parallel region.
 Inside a parallel region, the master thread and the spawned threads execute instruction streams
concurrently.
 Each thread has a unique ID.
 Different threads work on different parts of the shared data or carry out different tasks.
 OpenMP has compiler directives for dividing the work among the threads.
 At the end of a parallel region, the threads are joined: all the threads are terminated but the master
thread.
 There can be multiple parallel regions in an OpenMP program.
we will learn how to create a parallel Hello World Program using OpenMP.

1. Include the header file: We have to include the OpenMP header for our program along with the
standard header files.

//OpenMP header
#include <omp.h>

2. Specify the parallel region: In OpenMP, we need to mention the region which we are going to make
it as parallel using the keyword pragma omp parallel. The pragma omp parallel is used to fork
additional threads to carry out the work enclosed in the parallel. The original thread will be denoted
as the master thread with thread ID 0. Code for creating a parallel region would be,

#pragma omp parallel


{
//Parallel region code
}
So, here we include

#pragma omp parallel


{
printf("Hello World... from thread = %d\n",
omp_get_thread_num());
}
Program to print Hello World using openMP

The Components of the OpenMP API

The OpenMP API has three primary components:

 Compiler directives. These are preprocessor directives that can be used by programmers to dene and
control parallel regions of code.
 Runtime library routines. These are functions from the OpenMP library that can be called by the
program and are then linked into the program;
 Environment variables. These can be used to control the behavior of OpenMP programs.
“Manual” loop parallelization
Now, let’s try to “manually” parallelize a for-loop, that is, divide the iterations evenly among the threads.
Example:
for (i=0; i<N; i++)
a[i] = b[i] + c[i];
Important observation: Assuming that the arrays a, b and c do not overlap (that is, no aliasing), then the
iterations of this for-loop are independent of each other, thus safe to be executed by multiple threads
concurrently.

How to divide the iterations evenly among the threads?

Given num_threads as the total number of threads, one way to divide N iterations for thread with ID
thread_id is as follows:

int blen, bstart;


blen = N/num_threads;
if (thread_id < (N%num_threads))
{
blen = blen + 1;
bstart = blen * thread_id;
}
else {
bstart = blen * thread_id + (N%num_threads);
}

OpenMP coding

#pragma omp parallel


{
int num_threads, thread_id;
int blen, bstart, bend, i;
num_threads = omp_get_num_threads();
thread_id = omp_get_thread_num();
blen = N/num_threads;
if (thread_id < (N%num_threads))
{
blen = blen + 1;
bstart = blen * thread_id;
}
else
{
bstart = blen * thread_id + (N%num_threads);
}
bend = bstart + blen;
for (i=bstart; i<bend; i++)
a[i] = b[i] + c[i];
}

Data Scoping

 Any variables that existed before a parallel region still exist inside the parallel region, and are by
default shared between all threads.
 Often it will be necessary for the threads to have some private variables.
 Each thread can either declare new local variables inside the parallel region, these variables
are private “by birth”;
 Or, each thread can “privatize” some of the shared variables that already existed before a
parallel region (using the private clause):
int blen, bstart, bend;
#pragma omp parallel private(blen, bstart, bend)
{
// ...
}
 Each “privatized” variable has one (uninitialized) instance per thread;
 The private variables’ scope is until the end of the parallel region.
 Parallelizing for-loops is OpenMP’s main work-sharing mechanism.
 OpenMP has several built-in strategies for dividing the iterations among the threads.
 No need to manually calculate each thread’s loop bounds

OpenMP worksharing for loops

Being the omnipresent programming structure in scientific codes, loops are natural candidates for
parallelization if individual iterations are independent. As an example, consider a parallel version of a simple
program for the calculation of π by integration:
A simple program for numerical integration of a function in OpenMP.

Synchronization

Critical regions

 Concurrent write accesses to a shared variable must be avoided by all means to circumvent
race conditions.
 An OpenMP critical code block is executed by one thread at a time. This is one way to avoid
race conditions. (There are other ways.)
 The variable approx_pi in the above example is a shared variable, to which all the threads will
write. Thus, “protection” is provided by a critical code block.
 Use of the critical directive will incur overhead.
 Improper use of the critical directive may lead to deadlock.

Barriers

If, at a certain point in the parallel execution, it is necessary to synchronize all threads, a BARRIER can be
used:

!$OMP BARRIER
The barrier is a synchronization point, which guarantees that all threads have reached it before any thread
goes on executing the code below it. Certainly it must be ensured that every thread hits the barrier, or a
deadlock may occur.

Barriers should be used with caution in OpenMP programs, partly because of their potential to cause
deadlocks, but also due to their performance impact .

Reductions

Actually, using critical to prevent concurrent writes to the variable approx_pi is an “over-kill”. The reduction
clause of OpenMP is designed for this particular purpose:
Loop scheduling

#pragma omp parallel for


for (i=0; i<N; i++)
a[i] = b[i] + c[i];

How are the loop iterations exactly divided among the threads?

 Mapping of loop iterations to threads is configurable in OpenMP.


 The “secret” is the schedule clause:
#pragma omp parallel for schedule(static|dynamic|guided [,chunk])
 Default scheduling is static (no need to specify), which divides the iterations into contiguous chunks
of (roughly) equal size.
 Other alternatives of scheduling: dynamic and guided
 It is possible to prescribe a chunksize in the schedule clause.
#pragma omp parallel for schedule(dynamic,3)

Tasking
 A task can be defined by OpenMP’s task directive, containing the code to be executed.
 When a thread encounters a task construct, it may execute it right away or set up the appropriate data
environment and defer its execution. The task is then ready to be executed later by any thread of the
team.

 A “single”
code block in OpenMP will be entered by one thread only, namely the therad that reaches the single
directive first. All others skip the code and wait at the end of the single block due to an implicit
barrier.
 A “master” code block is only entered by the master thread, all the other threads skip over without
waiting for the master thread to finish.
 OpenMP has a separate “barrier” directive for explicit synchronization among the threads. (Use with
care!!!)

Case study: OpenMP-parallel Jacobi algorithm

The 2D Jacobi algorithm on an 𝑁×𝑁

N×N lattice solves a system of equations using the Jacobi iterative method in two dimensions. This
can be used, for example, to solve partial differential equations (PDEs) on a grid.

https://www.igpm.rwth-aachen.de/Download/ss17/EWP/parallel.pdf

Additional Resources

1. https://cse.cet.ac.in/wp-content/uploads/2019/10/ModernProcessors.pdf
2. https://www.geeksforgeeks.org/high-performance-computing/
3. https://www.studocu.com/in/document/vignan-institute-of-technology-and-science/computer-
science/hpc-unit-3/88659975
4. https://www.igpm.rwth-aachen.de/Download/ss17/EWP/parallel.pdf
Practice Questions

1. What is parallelization?
2. State Amdahl's Law.
3. Explain the key idea behind Gustafson's Law.
4. Why is load balancing important in parallel computing?
5. What are the uses of parallelization?
6. Differentiate between Amdahl's Law and Gustafson's Law.
7. Explain the factors affecting parallel efficiency.
8. Describe the concept of strong scalability with an example.
9. Discuss the importance of communication overhead in parallel system
10. What is parallelization? Discuss its types and applications.
11. Explain Amdahl's Law and Gustafson's Law with their formulas and implications.
12. Discuss the factors limiting scalability in parallel computing.
13. Explain data parallelism and functional parallelism with examples.
14. What are scalability metrics? Discuss their significance in parallel performance analysis.
15. What is OpenMP and what programming languages does it support?
16. What is the role of the master thread in OpenMP?
17. Define shared memory in the context of OpenMP.
18. What are the primary components of the OpenMP API?
19. What is a "critical region" in OpenMP?
20. Explain the fork-join model in OpenMP.
21. How does OpenMP handle parallel regions in a program?
22. What is the significance of the private clause in OpenMP?
23. Explain the #pragma omp parallel for directive with an example.
24. What is the difference between the static, dynamic, and guided scheduling options in OpenMp
25. Explain the process of parallelizing a for loop in OpenMP with an example.
26. Illustrate the parallelization of the Jacobi algorithm using OpenMP for solving partial differential
equations.

You might also like