High Performance Computing
Jimmy Mathew
jimmym@rajagiritech.ac.in
Department of Information Technology
Rajagiri School of Engineering and Technology, Cochin
High Performance Computing
Module 1 – Parallel Processing
Module 1 J Mathew RASET 2
References
[1] “Computer architecture & parallel processing”, Kai
Hwang, Faye A Briggs
[2] “Parallel Computers”, V Rajaraman, Siva Ram Murthy
[3] “Elements of parallel computing”, V Rajaraman
[4] “Super computers”, V Rajaraman
Module 1 J Mathew RASET 3
Seminar topics
• Applications of super computers
• 10 applications
• Contribution of India in super computers
• Various architectures
• IBM 360/91, Illiac IV, Univac 1100 / 80
Module 1 J Mathew RASET 4
Seminars:
Seminars shall go in parallel with regular classes. All the seminars shall be in groups of two or
three.
The concept of parallelism
• Evolution of computer science
• Generations of computers
• Need for more computation in lesser time
• Invention of networking
• Independent activities
• Multiple resources
• Parallelism
Module 1 J Mathew RASET 5
Introduction – The concept of parallelism
Parallelism was not a quick idea. The concept began to grow with the advancement in technology,
and researchers around the globe are still continuing to exploit more and more parallelisms
possible. In this module, we shall see how this concept evolved and eventually lead to invention
of super computers.
The growth of computer era began at 1938 with the invention of first analog computer called
ENIAC. The technology used in first generation computers were electromechanical relays and
vacuum tubes. Then came the era of second generation computers by 1952, were transistors and
diodes were building blocks. TRADIC from Bell Laboratories is one among them. Third
generation (1962) computers had seen advancement of electronics in Silicon based integrated
circuits (in small scale and medium scale), and multi layered printed circuit boards. Illiac IV, IBM
360 / 91 are a few to mention. Fourth generation began on 1972, and large scale integration were
used in every processor fabrication with high density packaging. Concept of Intellectual property
(IP) or Virtual Components (VC) were extensively used from then onwards. Cray – I and Univac
1100 / 80 were beginners in this era. Refer [1] pp 2-4 for more details.
The computation capability individual processor was increasing ever since, and networks helped
to interconnect these processors. Various application, which are distributed in nature and with
high computational requirement became more and more dependent on the architecture and
network configurations. The high performance applications were divided into independent tasks
and were distributed among various processors, which are either distributed or centralised. We
aim to study at architectures of such centralized processors.
Trends in parallelism
• Computers are being
outdated
• Importance of Moore's law
• Processing stages
Module 1 J Mathew RASET 6
Gordon Moore is a co-founder of Intel Corporation. His prediction related to processor capacity
was later on called Moore's law. In 1965 Gordon Moore published a paper which highlighted the
exponential growth in transistor densities, and this became known as Moore’s Law. Today, it is
understood to mean that densities double every two years, though in fact they increase by a factor
of 5 every 5 years.
Implication of Moore's law are performance increase, complexity increase, die area decrease,
faster technology expire, etc. In the earlier days, computers were used only for data processing.
Later, information processing began. From information, knowledge processing started. From
knowledge processing, today’s intelligence processing is derived. Computer systems are
becoming more and more capable of working by its own intelligence, as they have huge
applications.
Trends in parallelism
• OS advancement stages • Parallelism exploited in
Module 1 J Mathew RASET 7
OS advancements:
The advancements in Operating System technology has helped a lot to improve effective use of
processor capacity. From batch processing scenario to current days multi-processing capability
were significant milestones in OS history. Even though multi-processing capability is built in
OS, many applications run on such OS are still not capable to exploit full parallelism provided
by multiple processors. One of the reason for such incapability of application is lack of
programming language that support parallel programming.
Program level
• Execute programs in
parallel
• Multi-programming
• Multi-processing
Module 1 J Mathew RASET 8
Muti-programming :
Concurrently executing multiple programs with one processor. Actually parallel processing is not
possible with single processor. We may need to use co-processors like DMA to simulate parallel
processing.
Multi-processing :
Parallel execution of multiple programs with multiple processors. The central node (main
processor) acts as a distributor of load to each processor.
Procedure level
• Parallelism within a
processor
• Execute functions or tasks in
parallel
• Task scheduling
• Significance of stack?
Module 1 J Mathew RASET 9
Procedure level parallelism:
Various task scheduling algorithms are used to exploit parallelism in uniprocessor at procedure
level. When one procedure waits for an I/O to complete, the other procedure executes. It is a very
common phenomena in multi-tasking systems, such as RTOSs. Round Robin, first in first served,
largest job fist, smallest job first, etc are a few algorithms to mention. In many cases, two or more
algorithms together used. The events which are helping this scheduling are priority preemption,
interrupts, I/O, manual delay etc. Stack is used to save the context of each procedure.
How will you schedule the following tasks?
Task Priority Execution Time
T1 1 500ms (Highest priority)
T2 2 250ms
T3 2 100ms
T4 2 50ms
T5 3 25ms
T6 3 25ms
T7 3 25ms
Inter-instruction level
• Parallelism between instructions
• Independent instructions can be
executed in any order
• Fill the pipeline with independent
instructions
Module 1 J Mathew RASET 10
Inter-instruction level parallelism:
If the current instruction depends on the result of any previous instruction, dependency is said to
exist in the system. We cannot reorder the execution of instructions if any such dependency is
present. But, if a bunch of instructions have no dependency among them, various instruction level
parallelism can be employed.
ADD R0 R1 R2 // R0 = R1 + R2
SUB R3 R2 R1 // R3 = R2 + R1
ST @R4 R2 // Store value in R2 to address in R4
LD R5 @R2 // Load R5 with value in the address of R2
These instructions have no dependencies, and it can be executed in any order.
More about such dependencies can be learned on coming modules.
Intra-instruction level
• Instruction dependency
causes delays
• Identify independent
instructions
• Reorder instructions
• Keep dependency order
unchanged
Module 1 J Mathew RASET 11
Intra-instruction level parallelism:
Instruction reordering can help unwanted delays or stalls in a pipeline. Remember, speed of a
pipeline is the speed of its slowest stage. If a single instruction is delayed in the pipeline, all the
instruction in various stages of the pipeline will be delayed.
Parallelism in uniprocessor
• Known architecture
• Processor
• ALU, CU, Registers
• Memory
• Segments, hierarchy
• Input / Output
• Where is cache memory?
Module 1 J Mathew RASET 12
Parallelism in uniprocessor
1. Multiple
functional units
– Coprocessors
– Multi-port
RAM
– Instruction
cache (I$)
– Data cache
(D$)
• CDC 6600
Module 1 J Mathew RASET 13
Role of multiple functional units:
With a single copy of the resource, we can only achieve partial parallelism (by means of
scheduling and instruction level). Parallelism can be achieved only by means of multiple
resources. The important resources which are required to achieve true parallelism are processors
or co-processors, multi-port RAM, cache memory and efficient network.
Co-processors can execute in parallel with main processor by a process called cycle stealing.
Example is a DMA controller. Such controller is actually called and controlled by the main
processor. DMA controller performs memory load and store operations without interrupting or
delaying the main processor. But the initializations, such as starting address and ending address,
are issued by main processor. Without the main processor, DMA controller cannot do anything.
Even after the data transfer, the DMA controller informs the main processor about the completion
of its operations.
Multi-port RAM are RAM with multiple ports. More than one processor can access RAM at the
same time. Multi-port RAM usually acts as a tightly coupled memory to share information among
various processors.
Cache is the place where most recently used, frequently used and most private data of a processor
is stored. Cache is smaller in size than RAM, and is in-built within the processor. Hence very
privileged data are stored here. Cache is considered as loosely coupled memory, because the cache
content is less or not shared with other processors.
Parallelism in uniprocessor
2. Pipelining
– No fixed stages
– Pipelines with
32 stages exists
– Branch
prediction
– Speculation
– Hazards: RAW,
WAR, WAW
– Is RAR a Courtesy: Prof. Nigel Topham, UoE
hazard?
Module 1 J Mathew RASET 14
Parallelism in uniprocessor
3. Overlapped CPU and I/O
instructions
– CPU bound
– I/O bound
4. Use of Memory Hierarchy
– Update only the nearest
memory
– Periodic updates to lower
memory
– What is data contention?
Module 1 J Mathew RASET 15
Parallelism in uniprocessor
5. Multi-programming and time sharing
– Concept of concurrency
– Scheduling
– Mix I/O bound and CPU bound instructions
– Time slice – equal opportunities to all programs
– Virtual processors for multiple programs
– Time sharing suitable for interactive multiple terminals
– Performance depends heavily on OS
Module 1 J Mathew RASET 16
Module 1 J Mathew RASET 17
Parallelism in uniprocessor
6. By balancing band width of subsystems
– Performance varies from device to device
– Bottlenecks
– Bandwidth varies
– Memory has highest BW
– Processor has lesser BW
– Utilization BW is lesser than actual BW
– Concept of interleaved memory
Module 1 J Mathew RASET 18
Band width continues ..
– But, processor is faster than memory
– BW = (output) / (time)
– BW is the total number of output produced in unit time
– What is BW for a processor?
– What is BW for a memory?
– Can we use BW fully?
– BWm > BWp
– Need to balance BW for better performance
Module 1 J Mathew RASET 19
Bandwidth:
Bandwidth of a processor is number of instructions executed by that processor in unit time,
say a second.
Bandwidth of a memory is number of words read or written to the memory in unit time.
Bandwidth of a network is number of bytes transferred through it in unit time.
Bandwidth cannot be used in full, because of delays and conflicts in systems. The effective
bandwidth, in most cases, is less than the actual bandwidth.
Lecture 1 summary
– Concept of parallelism
– History
– Trends towards parallelism
– Technology advancement in OS
– Parallelism in uniprocessor
Module 1 J Mathew RASET 20
Parallel computer structures
• Pipeline computers
• Array processors
• Multi-processor systems
• Data flow computers
• Will be discussed in detail in coming modules
Module 1 J Mathew RASET 21
Pipeline computers
• Pipeline stages, in general
• Instruction Fetch (IF)
• Instruction Decode (ID)
• Operand Fetch (OF)
• Execution (EX)
• Memory Write (MW)
• Write Back (WB)
• What is the speed of the pipeline?
Module 1 J Mathew RASET 22
Pipeline speed:
Speed of the pipeline is speed of its slowest stage. It means that when a single instruction is
delayed, all the instructions in the pipeline get the same delay.
Pipeline computers
• Non pipelined Vs pipelined
BW = 2
Module 1 J Mathew RASET 23
Comparison:
If the instruction cycle is taken as 5 stages, unit time as 10 clock cycles, and no delay
is assumed, a non-pipelined processor throughput is given as 2 and a pipelined throughput is
given as 6. Obviously, pipelined computers are more efficient.
Pipeline computers
• Ideal for repeated operations
• Different instructions has different delays
• A stall in pipeline causes delays to every instruction
• Good for vector processing – same instruction is
executed repeatedly for similar set of data
• Floating point operations
• Matrix multiplication
• Weather processing
Module 1 J Mathew RASET 24
Pipeline computers
Module 1 J Mathew RASET 25
Array processors
• Synchronous parallel computer
• Multiple ALUs, called processing elements (PE)
• Central scalar control unit
• Each processor has associated memory
• Data routing network
• Broadcast Vector instructions to PE for distributed
execution
• PEs are passive devices – No instruction decoding
capability
Module 1 J Mathew RASET 26
Array processing
•Associative memory
•Associative processors
•Matrix multiplication
•Merge sort
•Fast Fourier Transform
•Example – Illiac IV
Module 1 J Mathew RASET 27
Multi-processor systems
• A system with two or more processors
• Shares common main memory, I/O channels, devices
• Each processor has private memory
• System controlled by single integrated OS (custom built)
• Interconnections
• Time shared common bus
• Crossbar switch network
• Multi-port memories
• Significance of OS capability
Module 1 J Mathew RASET 28
Multi-processor systems
Module 1 J Mathew RASET 29
Dataflow computers
• Von Neumann machines
• Use program counter
• Sequential execution
• Slow processing
• Known as control flow machines
• New concept - Data flow machines
• To exploit maximum parallelism
• Execute an instruction as soon as all operands
available, irrespective of order, but with dependency
Module 1 J Mathew RASET 30
Data flow representation
• Z = ((a+b) – c) * 10;
• Instruction as template: add [DST] [SRC1] [SRC2]
Module 1 J Mathew RASET 31
Lecture 2 summary
Parallel computer structures
– Pipeline computers
– Array processors
– Multi-processor systems
– Data flow computers
Module 1 J Mathew RASET 32
Parallel computer performance
• Performance is not linear
– A processor takes 10 second to complete a task
– In a multi-processor environment, there are 10 identical
processors
– 10 identical processors take at most 1 second to solve
the same task
– It may be a wrong assumption!
• 'n' processors cannot give 'n' times faster processing
Module 1 J Mathew RASET 33
Parallel computer performance
• Lower bound : log2 n (Minsky's conjecture)
• Upper bound : n / ln n
• Research showed, for n = 4 to n = 16, the actual speed up
was 2 to 4 times ( this study was performed on C.mmp and
S-1 systems)
• Use small numbers of fast processors than large numbers
of slow processors
Module 1 J Mathew RASET 34
Parallel computer performance
Module 1 J Mathew RASET 35
Parallel computer performance
• Pipelined uniprocessor systems are still dominating, cost
is less, OS is well developed
• Array processors custom designed, good for specific
applications, programming is complex
• Multi-processor systems are general purpose and are
flexible in parallel processing, OS capability is important
Module 1 J Mathew RASET 36
Architecture classifications
• Three classifications
• Multiplicity of I – D stream (Flynn scheme)
• SISD, SIMD, MISD and MIMD
• Serial Vs parallel processing (Feng scheme)
• WSBS, WPBS, WSBP and WPBP
• Parallelism Vs pipelining (Handler scheme)
• PCU, ALU, BLC
Module 1 J Mathew RASET 37
SISD
• Instruction stream –
sequence of instructions
• Data stream – input,
Michael J Flynn output, temporary results
• American professor • Multiplicity of hardware
emeritus • SISD - Overlapped
• Stanford University execution or pipelining
Flynn’s
•Module 1
taxonomy J Mathew RASET 38
SIMD
• Single instruction stream,
multiple data stream
• Instruction broadcast
• Each PE operates on
different data sets from
same stream
• Shared memory
subsystem
Module 1 J Mathew RASET 39
MISD
• Multiple instruction stream, single data stream
• Output of one PE is input to next PE
• No real implementation, yet
Module 1 J Mathew RASET 40
MIMD
• Multiple instruction stream, single data stream
• Common shared memory
• Multi-processor systems are listed here
Module 1 J Mathew RASET 41
Lecture 3 summary
Parallel computer performance analysis
Classifications of computer architecture
Flynn’s taxonomy
– SISD
– SIMD
– MISD
– MIMD
Module 1 J Mathew RASET 42
Serial Vs Parallel processing
• Tse-yun Feng suggested four types of processing methods
• Maximum parallelism degree, P(C) = (n, m)
– n is word length
– m is bit slice in pipeline (number of bits present in all
pipeline stages)
• Example: TI ASC has 64 word length 4 arithmetic pipeline.
Each pipeline has 8 stages.
• P(C) = (64, 4x8) = (64, 32)
Module 1 J Mathew RASET 43
Serial Vs Parallel processing
1. WSBS -Word serial and bit serial
Bit serial processing (word length n = 1, bit slice m = 1)
2. WPBS – Word parallel and bit serial
Bit slice processing (n = 1, m > 1); m bit slice together
3. WSBP - Word serial and bit parallel
Word slice processing (n > 1, m = 1); n bit word
together
4. WPBP – Word parallel and bit parallel
Fully parallel processing (n > 1, m > 1)
Module 1 J Mathew RASET 44
Parallelism Vs Pipelining
• Proposed by Wolfgang Handler
• Parallelism degree & pipelined degree combined
1. Process control unit (PCU) – single processor
2. Arithmetic logic unit (ALU) – processing elements
3. Bit level circuit (BLC) – combinational logic circuit
• Triplet, T(C) = <K x K', D x D', W x W'>
• K – No. of PCUs K' – No. of PCU pipelined
• D – No. of ALUs D' – No. of ALU pipelined
• W – Word length W' – No. of pipeline stages
Module 1 J Mathew RASET 45
Parallelism Vs Pipelining
• Example: Texas Instrument’s Advanced Scientific
Computer (TI ASC)
• Has one controller
• Four arithmetic pipelines
• 64 bit word length
• 8 stages
• T(ASC) = < 1x1, 4x1, 64x8 > = < 1, 4, 256 >
Module 1 J Mathew RASET 46
Lecture 4 summary
Classifications of computer architecture
Serial Vs Parallel processing
– WSBS
– WPBS
– WSBP
– WPBP
Parallelism Vs Pipelining
Module 1 J Mathew RASET 47
Module summary
• The concept • Architectural
classifications
• Trends
• Applications
• Parallelism in uniprocessor
• Contribution of India
• Parallel computer
structures
• Performance
Module 1 J Mathew RASET 48