What is Parallel Computing?
Traditionally, software has been written for serial computation:
o To be run on a single computer having a single Central Processing Unit (CPU);
o A problem is broken into a discrete series of instructions.
o Instructions are executed one after another.
o Only one instruction may execute at any moment in time.
In the simplest sense, parallel computing is the simultaneous use of multiple compute
resources to solve a computational problem:
o To be run using multiple CPUs
o A problem is broken into discrete parts that can be solved concurrently
o Each part is further broken down to a series of instructions
o Instructions from each part execute simultaneously on different CPUs
The compute resources can include:
o A single computer with multiple processors;
o An arbitrary number of computers connected by a network;
o A combination of both.
The computational problem usually demonstrates characteristics such as the ability to be:
o Broken apart into discrete pieces of work that can be solved simultaneously;
o Execute multiple program instructions at any moment in time;
o Solved in less time with multiple compute resources than with a single compute
resource.
The Universe is Parallel:
Parallel computing is an evolution of serial computing that attempts to emulate what has always
been the state of affairs in the natural world: many complex, interrelated events happening at the
same time, yet within a sequence. For example:
Galaxy formation
Planetary movement
Weather and ocean
patterns
o
o
o
o
o
o
Rush hour traffic
Automobile assembly line
Building a space shuttle
Ordering a hamburger at the drive
through.
o
Tectonic plate drift
The Real World is Massively Parallel
Uses for Parallel Computing:
Historically, parallel computing has been considered to be "the high end of computing",
and has been used to model difficult scientific and engineering problems found in the real
world. Some examples:
o Atmosphere, Earth, Environment
o Physics - applied, nuclear, particle, condensed matter, high pressure, fusion,
photonics
o Bioscience, Biotechnology, Genetics
o Chemistry, Molecular Sciences
o Geology, Seismology
o Mechanical Engineering - from prosthetics to spacecraft
o
o
Electrical Engineering, Circuit Design, Microelectronics
Computer Science, Mathematics
Today, commercial applications provide an equal or greater driving force in the
development of faster computers. These applications require the processing of large
amounts of data in sophisticated ways. For example:
o Databases, data mining
o Oil exploration
o Web search engines, web based business services
o Medical imaging and diagnosis
o Pharmaceutical design
o Management of national and multi-national corporations
o Financial and economic modeling
o Advanced graphics and virtual reality, particularly in the entertainment industry
o Networked video and multi-media technologies
o Collaborative work environments
Why Use Parallel Computing?
Main Reasons:
Save time and/or money: In theory, throwing more resources at a task will shorten its time
to completion, with potential cost savings. Parallel clusters can be built from cheap,
commodity components.
Solve larger problems: Many problems are so large and/or complex that it is impractical
or impossible to solve them on a single computer, especially given limited computer
memory. For example:
o "Grand Challenge" (en.wikipedia.org/wiki/Grand_Challenge) problems requiring
PetaFLOPS and PetaBytes of computing resources.
o Web search engines/databases processing millions of transactions per second
Provide concurrency: A single compute resource can only do one thing at a time.
Multiple computing resources can be doing many things simultaneously. For example, the
Access Grid (www.accessgrid.org) provides a global collaboration network where people
from around the world can meet and conduct work "virtually".
Use of non-local resources: Using compute resources on a wide area network, or even the
Internet when local compute resources are scarce. For example:
o SETI@home (setiathome.berkeley.edu) uses over 330,000 computers for a compute
power over 528 TeraFLOPS (as of August 04, 2008)
o Folding@home (folding.stanford.edu) uses over 340,000 computers for a compute
power of 4.2 PetaFLOPS (as of November 4, 2008)
Limits to serial computing: Both physical and practical reasons pose significant
constraints to simply building ever faster serial computers:
o Transmission speeds - the speed of a serial computer is directly dependent upon
how fast data can move through hardware. Absolute limits are the speed of light (30
cm/nanosecond) and the transmission limit of copper wire (9 cm/nanosecond).
Increasing speeds necessitate increasing proximity of processing elements.
Limits to miniaturization - processor technology is allowing an increasing number
of transistors to be placed on a chip. However, even with molecular or atomic-level
components, a limit will be reached on how small components can be.
Economic limitations - it is increasingly expensive to make a single processor
faster. Using a larger number of moderately fast commodity processors to achieve
the same (or better) performance is less expensive.
Current computer architectures are increasingly relying upon hardware level parallelism to
improve performance:
o
o
o
Multiple execution units
Pipelined instructions
Multi-core
RAM model
Random Access Machine is a favorite model of a sequential computer. Its main features are:
1. Computation unit with a user defined program.
2. Read-only input tape and write-only output tape.
3. Unbounded number of local memory cells.
4. Each memory cell is capable of holding an integer of unbounded size.
Instruction set includes operations for moving data between memory cells, comparisons and
conditional
branches, and simple arithmetic operations.
5.
6. Execution starts with the first instruction and ends when a HALT instruction is executed.
7. All operations take unit time regardless of the lengths of operands.
8. Time complexity = the number of instructions executed.
9. Space complexity = the number of memory cells accessed.
PRAM model
Parallel Random Access Machine is a straightforward and natural generalization of RAM. It is
an idealized
model of a shared memory SIMD machine. Its main features are:
1. Unbounded collection of numbered RAM processors P0, P1, P2,... (without tapes).
2. Unbounded collection of shared memory cells M[0], M[1], M[2],....
3. Each Pi has its own (unbounded) local memory (registers) and knows its index i.
4. Each processor can access any shared memory cell (unless there is an access conflict, see
further) in unit time.
5. Input af a PRAM algorithm consists of n items stored in (usually the first) n shared 5. memory
cells.
6. Output of a PRAM algorithm consists of n' items stored in n' shared memory cells.
7. PRAM instructions execute in 3-phase cycles.
1. Read (if any) from a shared memory cell.
2. Local computation (if any).
3. Write (if any) to a shared memory cell.
8. Processors execute these 3-phase PRAM instructions synchronously.
9. Special assumptions have to be made about R-R and W-W shared memory access conflicts.
10. The only way processors can exchange data is by writing into and reading from memory cells.
11. P0 has a special activation register specifying the maximum index of an active processor.
Initially, only P0 is active, it computes the number of required active processors and loads this
register, and then the other corresponding processors start executing their programs.
12. Computation proceeds until P0 halts, at which time all other active processors are halted.
13. Parallel time complexity = the time elapsed for P0's computation.
14. Space complexity = the number of shared memory cells accessed.
PRAM is an attractive and important model for designers of parallel algorithms. Why?
1. It is natural: the number of operations executed per one cycle on p processors is at most p.
2. It is strong: any processor can read or write any shared memory cell in unit time.
3. It is simple: it abstracts from any communication or synchronization overhead, which makes the
complexity and correctness analysis of PRAM algorithms easier. Therefore,
4. It can be used as a benchmark: If a problem has no feasible/efficient solution on PRAM, it has
no feasible/efficient solution on any parallel machine.
5. It is useful: it is an idealization of existing (and nowaday more and more abundant) shared
memory parallel machines.
Simulation From One PRAM Model To Other
PREFIX SUM
The sequential for loop executes [logn] times. Hence, The overall execution time will be [logn].
List Ranking Algorithm
.
The
.
Merging Two Sorted List
Cost Optimal Parallel Algorithms