Introduction to Parallel
Processing
Mr.Venkatesh Shankar
B.E., M.Tech., (PhD)
• Parallel computing is a form of computation in which many calculations are carried
out simultaneously, operating on the principle that large problems can often be
divided into smaller ones, which are then solved concurrently ("in parallel").
• There are several different forms of parallel computing: bit-level, instruction level,
data, and task parallelism.
• Parallel computer structures will be characterized as pipelined computers, array
processors, and multiprocessor systems. Several new computing concepts,
including data flow and VLSI approaches,
EVOLUTION OF COMPUTER SYSTEMS
• Over the past four decades the computer industry has experienced four
generations of development, physically marked by the rapid changing of
building blocks from
• relays and vacuum tubes (1940-1950s) to discrete diodes and transistors
(1950-19605), to small- and medium-scale integrated (SSI/MSI) circuits (196O-
1970s),
• To large- and very-large-scale integrated (LSl/VLSI) devices (1970s and
beyond). Increases in device speed and reliability and reductions in hardware
cost and physical size have greatly enhanced computer performance.
• Ever since the stored-program concept of von Neumann, the computer
has been recognized as more than just a hardware organization problem.
• A modern computer system is really a composite of such items as
processors, memories, functional units, interconnection networks,
compilers, operating systems. peripheral devices, communication
channels, and database banks.
• To design a powerful and cost-effective computer system and to devise
efficient programs to solve a computational problem, one must
understand the underlying hardware and software system structures and
the computing algorithms to be implemented on the machine with some
user-oriented programming languages.
• These disciplines constitute the technical scope of computer architecture.
Computer architecture is really a system concept integrating hardware,
software, algorithms, and languages to perform large computations. A
good computer architect should master all these disciplines.
Generations of Computer Systems
• The division of computer systems into generations is determined by the
device technology, system architecture, processing mode, and languages
used.
• The first generation (1938-1953) The introduction of the first electronic
analog computer in 1938 and the first electronic digital computer, ENIAC
(Electronic Numerical Integrator and Computer), in 1946 marked the
beginning of the first generation of computers.
• The second generation (1952-1963) Transistors were invented in 1948.
The first transistorized digital computer, TRADIC, was built by Bell
Laboratories in 1954. Discrete transistors and diodes were the building
blocks: 800 transistors 'were used in TRADIC.
• The third generation (1962-1975) This generation was marked by the use
of small-scale integrated (SSI) and medium-scale integrated (MSI) circuits
as the basic building blocks. Multilayered printed circuits were used High-
level languages were greatly enhanced with intelligent compilers during
this period.
• The fourth generation (1972-present) The present generation computers
emphasize the use of large-scale integrated (LSI) circuits for both logic and
memory sections. High-density packaging has appeared.
• A high degree of pipelining and multiprocessing is greatly emphasized in
commercial supercomputers. A massively parallel processor (MPP) was
custom-designed in 1982.
• The future Computers to be used in the I990s may be the next generation.
Very large- scale integrated (VLSI) chips will be used along with high-
density modular design.
Trends Towards Parallel Processing
• According to Sidney Fernbach:
Today's large computers (mainframes) would have been considered
‘supercomputers’ 10 to 20 years ago. By the same token, today's
supercomputers will be considered 'state-of the-art' standard equipment 10 to
20 years from now.
From an application point of view, the mainstream usage of computers is
experiencing a trend of four ascending levels of sophistication
• Data processing
• Information processing
• Knowledge processing
• Intelligence processing
• The relationships between data, information, knowledge, and intelligence are
demonstrated in Figure 1.2. The data space is the largest, including numeric
numbers in various formats, character symbols, and multidimensional measures.
Data objects are considered mutually unrelated in the space.
• An information item is a collection of data objects that are related by some
syntactic structure or relation . Therefore, information items form a subspace of the
data space.
• Knowledge consists of information items plus some semantic meanings. Thus
knowledge items form a subspace of the information space
• Finally, intelligence is derived from a collection of knowledge items. The intelligence
space is represented by the innermost and highest triangle Venn diagram.
• Most of today's computing is still confined within these two processing levels. A high
degree of parallelism has been found at these levels.
• As the accumulated knowledge bases expanded rapidly in recent years, there grew a
strong demand to use computers for knowledge processing.
• For example, the various expert computer systems listed in Table 1.1 are used for
problem solving in the specific areas where they can reach a level of performance
comparable to that of human experts.
Definition
• Definition Parallel processing is an efficient form of information processing which
emphasizes the exploitation of concurrent events in the com put in process.
• Concurrency implies parallelism, simultaneity, and pipelining. Parallel events may
occur in multiple resources during the same time interval simultaneous events may
occur at the same time instant; and pipelined events may occur in overlapped time
spans.
• Parallel processing demands concurrent execution of many programs in the
computer. It is in contrast to sequential processing. It is a cost-effective means to
improve system performance through concurrent activities in the computer.
Parallel processing can be challenged in four programmatic levels:
• Job or program level
• Task or procedure level
• Inter instruction level
• lntra instruction level
• Parallel processing and distributed processing are closely related. In some
cases, use certain distributed techniques to achieve parallelism. As data
communication technology advances progressively, the distinction
between parallel and distributed processing becomes smaller and smaller.
• Pipelining is an implementation technique where multiple instructions are
overlapped in execution. The computer pipeline is divided in stages. Each
stage completes a part of an instruction in parallel. The stages are
connected one to the next to form a pipe - instructions enter at one end,
progress through the stages, and exit at the other end.
PARALLELISM IN UNIPROCESSOR
SYSTEMS
A typical uniprocessor computer consists of three major components, the main
memory, the central processing unit (CPU), and the input-output (I/O) subsystem.
The architectures of two commerciality available uniprocessor computers are given
below to show the possible interconnection of structures among the three
subsystems. We will examine major components in the CPU and in the I/O
subsystem.
• Figure 1.3 shows the architectural components of the super minicomputer(VAX-
instruction set architecture). The CPU contains the master controller of system.
There are sixteen 32-bit general-purpose registers, one of which serves as the
program counter (PC). There is also a special CPU status register containing
information about the current state of the processor and of the program being
executed.
• The CPU contains an arithmetic and logic unit (A.LU) with an optional floating-
point accelerator, and some local cache memory With an optional diagnostic
memory. The CPU can be intervened by the operator through the console
connected to a floppy disk.
• The CPU, the main memory (232 words of 32 bits each), 'and the I/O subsystems
are all connected to a common bus, the synchronous backplane interconnect (SBI).
Through this bus, all I/O devices can communicate with each other With CPU, or
with the memory. Peripheral storage or I/O devices can be connected directly to
the SBI through the unibus and its controller.
Parallel Processing Mechanisms
• A Number of parallel processing mechanisms have been developed in uniprocessor
computers We identify them into six categories
Multiplicity of functional units
Parallelism and pipelining within the CPU
overlapped CPU and I/O operations
Use of a hierarchical memory system
Balancing of subsystem bandwidths
Multiprogramming and time sharing
• Multiplicity of functional units The early computer had only one arithmetic and
logic unit in its CPU. Furthermore, the ALU could only perform one function at a
time, a rather slow process for executing a long sequence of arithmetic logic
instructions. In practice, many of the functions of the ALU can be distributed to
multiple and specialized functional units which can operate in parallel.
• The CDC-6600(designed in 1964) has 10 functional units built into its CPU (Figure
1.5). These 10 units are independent of each other and may operate simultaneously.
A scoreboard is used to keep track of the availability of the functional units and
Registers being demanded. With 10 functional units and -24-registers available, the
instruction issue rate can be significantly increased.
Parallelism and pipelining within the
CPU
• Parallel adders, using such techniques as carry-look ahead and carry-save, are now
built into almost all ALUs. This contrast to the bit-serial adders used in the first-
generation machines.
• High-speed multiplier recoding and convergence division are techniques for
exploring parallelism and the sharing of hardware resources for the functions of
multiply and divide. The use of multiple functional unit is a form of parallelism
with the CPU.
• Various phases of Instruction executions are now pipelined, including instruction
fetch. decode, operand fetch, arithmetic logic execution, and store-result.
Overlapped CPU and I/O operations
• Overlapped CPU and I/O operations can be performed simultaneously with the
CPU computations by using separate I/O controllers. channels or I/O processor
The direct-memory-access (DMA) channel can be used
• provide direct information transfer between the I/O devices and main M/M The
DMA is conducted on a cycle stealing basis. which is apparent to the CPU
Use of Hierarchical memory system
• Use of Hierarchical memory system Usually, the CPU is about 1000 times faster
than memory access. A hierarchical memory system can be used to close up speed
gap. Computer memory hierarchy is conceptually illustrated in Figure
Balancing of Subsystem Bandwidth
• In general, the CPU is the fastest unit in a computer, with a processor cycle tp of
tens of nanoseconds; the main memory has a cycle time tm of hundreds of
nanoseconds; and the I/O devices are the slowest with an average access. time td
Of a few milliseconds. It is thus observed that
td>tm>tp
• The bandwidth of a system is defined as the number of operations performed per
unit time. Let W be the number of words delivered per memory cycle tm Then the
maximum memory bandwidth Bm is equal to
Multiprogramming and Time Sharing
Multiprogramming With in the same time interval there may be multiple process
active in a computer, competing for memory, I/O, and CPU resources, we are
aware of the fact that some computer programs are CPU bound and some are
I/O-bound (input-output intensive).
We can mix the execution of various types of programs in the computer to balance
bandwidth among the various functional units.
Time sharing Multi programming a uniprocessor is centered around the sharing of
the CPU by programs. Some times high priority programs may occupy the CPU
for too long to allow others to share This -problem can be overcome using a time
sharing operating system.
PARALLEL COMPUTER STRUCTURES
• Parallel computers are those systems that emphasize parallel processing. The
architectural features of parallel computers are introduced below. We divided
parallel computers into three architectural configurations:
• A pipeline computer performs overlapped computations to exploit temporaral
parallelism.
• An array processor uses multiple synchronized arithmetic logic units to achieve
spatial parallelism.
• A multiprocessor system achieves asynchronous parallelism through a set of
interactive processors with shared resources (memories database, etc.).
A pipeline computer
• Normally, the process of executing an instruction in a digital computer involves
four major steps: instruction fetch (IF) from the main memory; instruction decoding
(ID), identifying the operation to be performed; operand fetch (OF). If needed in
the execution; and then execution (EX) of the decoded arithmetic logic operation.
In a nonpipelined computer,
• these four steps must be completed before the next instruction can be issued. In a
pipelined computer, successive instructions are executed in an overlapped fashion,
as illustrated in Figure 1.10
A typical pipeline computer is conceptually depicted Figure 1.11architecture is
very similar to several commercial machines like Cray-1 to be described in Chapter
4. Both scalar arithmetic pipelines and vector arithmetic pipelines provided.
The instructions Pre processing unit is it self pipelined with three stages shown.
The OF stage consist of two independent stages one for fetching scalar operands -
and the other for vector operands .
Array Computers
• IAn array processor is a synchronous parallel computer with multiple arithmetic
logic units, called processing elements (PE), that can operate in parallel in a
lockstep fashion. By replication of ALUs, one can achieve the spatial parallelism.
The PEs are synchronized to perform the same function at the same time
• array processor is depicted in Figure 1.12. Scalar and control-type instructions are
directly executed in the control unit (CU). Each PE consists of an ALU with
registers and a local memory. The PEs are interconnected by a data-routing
network. The interconnection pattern to be established for specific computation is
under program control from the CU.
• vector processor or array processor is a central processing unit (CPU) that
implements an containing instructions that operate on one-dimensional arrays of
data called vectors.
• scalar processor. A CPU that performs computations on one number or set of data
at a time. Most computers have scalar CPUs. A scalar processor is known as a
"single instruction stream single data stream" (SISD) CPU. Contrast with
vector processor.
A multiprocessor system
• Research and development of multiprocessor systems are aimed at improving
throughput, reliability, flexibility, and availability. A basic multiprocessor
• organization is conceptually depicted in Figure 1.13. The system contains two or
more processors of approximately comparable capabilities. All processors share
access to common sets of memory modules, I/O channels, and peripheral devices.
• Most importantly, the entire system must be controlled by a single integrated
operating system providing interactions between processors and their programs at
various levels. Besides the shared memories and I/O devices, each processor has its
own local memory and private devices. Inter processor communications can be
done through the shared memories or through an interrupt network.
• Three different interconnections have been practiced in the past:
• Time-shared common bus
• Crossbar switch network
• Multiport memories
ARCHITECTURAL CLASSIFICATION
SCHEMES
• Three computer architectural classification schemes are presented in this section
Flynn's classification (1966) is based on the multiplicity of instruction streams data
streams in a computer system.
• In general, digital computers may be classified into four categories, accord to the
multiplicity of instruction and data streams. This scheme for classify computer
organizations was introduced by Michael J. Flynn.
• Instructions or data -are define with respect to a referenced machine. An
instruction stream is a sequence of instructions as executed by the machine; a data
stream is a sequence of data including input, partial, or temporary results, called
for by the instruction stream.
• Computer organizations are characterized by the multiplicity of the hardware
provided to service the instruction and data streams. Listed below are Flyn four
machine organizations:
• Single instruction stream-single data stream (SISD)
• Single instruction stream-multiple data stream (SIMD)
• Multiple instruction stream-single data stream (MISD)
• Multiple instruction stream-multiple data stream (MIMD)
SISD computer organization
• SISD computer organization This organization, shown in Figure 1.16a, represents
most serial computers available today. Instructions are executed sequentially but
may be overlapped in their execution stages (pipelining). Most SISD uniprocessor
systems are pipelined. An SISD computer may have more than one functional unit
in it.
SIMD computer organization
• SIMD computer organization This class corresponds to array processors,
introduced in Section 1.3.2. As illustrated in Figure I.I6b, there are multiple
processing elements supervised by the same control unit. All PEs receive the same
instruction broadcast from the control unit but operate on different data sets from
distinct data streams. The shared memory subsystem may contain multiple
modules.
MISD computer organization
• MISD computer organization This organization Figure I.I6c. There are n processor
units, each receiving distinct instructions operating over the same data stream and
its derivatives. The results (output) of one processor become the input (operands)
of the next processor in the macro pipe This structure has received much less
attention and has been challenged as impractical by some computer architects.
MIMD
• MIMD computer organization Most multiprocessor systems and multiple computer
systems can be classified in this category (Figure 1.16d). An intrinsic MIMD
computer implies interactions among the n processors because all memory stream
are derived from the same data space shared by all processor.
• An Intrinsic tightly coupled if the degree of interactions among the processors is
high other wise we consider them loosely coupled. Most commercial MIMD
Computers are loosely coupled.
Parallelism Versus Pipelining
• Wolfgang Handfer has proposed a classification scheme for identifying the
parallelism degree and pipelining degree built into the hardware structures of a
computer system. He considers parallel-pipeline processing at three subsystem
levels:
• Processor control unit (PCU)
• Arithmetic logic unit (ALU)
• Bit-level circuit (BLC)
• The functions of PCU and ALU should be clear to us. Each PCU correspond, one
processor or one CPU. The ALU is equivalent to the processing element we
specified for SIMD array processors. The BLC corresponds to the combinational
logic circuitry needed to perform I-bit operations in the ALU.
• A computer system C can be characterized by a triple containing six independent
entities, as defined below:
T(C) = (K x K', D x D', W x W')
• where K = the number of processors (PCUs) within the computer
• D = the number of ALUs (or PEs) under the control of one PCU
• W = the word length of an ALU or of a PE
• W' = the number of pipeline stages in all ALUs or in a PE
• D' = the number of ALUs that can be pipelined
• K' = the number of PCUs that can be pipe lined
Example
• Several real computer examples are used to clarify the above parametric
descriptions. The Texas Instrument's Advanced Scientific Computer (TI-ASC) one
controller controlling four arithmetic pipelines, each has 64-bit Word length length
and eight stages. Thus, we have
T(ASC) = (I x 1,4 x 1,64 x 8) = (1,4,64 x 8)
Parallel Processing Application
• Fast and efficient computers are in high demand in many scientific, engineering,
Energy resources, medical military AI and Basic Research area .
• Computer simulations are far cheaper and faster than physical experiments.
• Computer can solve a much wider range of problems than specific laboratory
equipment can.
• Computational approaches are only limited by computer speed memory capacity,
while physical experiments have many practical constraints.