HIGH PERFORMANCE COMPUTING Module I (15 hours) Introduction to parallel processing - Trends towards parallel processing Parallelism in uniprocessor
Parallel computer structures-Architecture classification schemes ,Amdahls law,Indian contribution to parallel processing
Introduction to Parallel Processing Basic concepts of parallel processing on high-performance computers are introduced in this unit. Parallel computer structures will be characterized as Pipelined computers, array processors and multiprocessor systems. Evolution of Computer Systems Over the past four decades the computer industry has experienced four generations of development. The first generation used Vacuum Tubes (1940 1950s) to discrete diodes to transistors (1950 1960s), to small and medium scale integrated circuits (1960 1970s) and to very large scale integrated devices (1970s and beyond). Increases in device speed and reliability and reduction in hardware cost and physical size have greatly enhanced computer performance. The relationships between data, information, knowledge and intelligence are demonstrated. Parallel processing demands concurrent execution of many programs in a computer. The highest level of parallel processing is conducted among multiple jobs through multiprogramming, time sharing and multiprocessing. Over the past four decades the computer industry has experienced four generations of development. Generations of Computer Systems First Generation (1938-1953) - Vacuum Tube
1938 - John W. Mauchly and J. Presper Eckert built the first digital electronic computer, ENIAC (Electronic Numerical Integrator & Computer) Electro mechanical relays were used as switching devices in the 1940s, and vacuumtubes were used in 1950s. CPU structure to be bit srial: arithmetic is done on a bit by bit fixed-point basis In 1950, the first stored program computer, EDVAC (Electronic Discrete Variable Automatic Computer), was developed. In 1952, IBM had announced its 701 electronic calculator Second Generation Computers (1952 -1963) Transistor Transistors were invented in 1948. TRADIC (Transistorized digital Computer), was built by Bell Laboratories in 1954. o Discrete transistors and diodes are used as building blocks o Printed circuits appeared o Assembly Languages, Fortan used in 1956 and Algol used in 1960 The first IBM scientific, transistorized computer, IBM 1620, became available in 1960. Third Generation Computers (1962 -1975) IC SSI & MSI circuits are the basic building blocks Multilayered Printed circuits were used
1968 - DEC introduced the first "mini-computer", the PDP-8, named after the miniskirt. 1969 - Development began on ARPAnet 1971 - Intel produced large scale integrated (LSI) circuits that were used in the digital
delay line, the first digital audio device.
Fourth Generation (1971-1991) microprocessor 1971 - Gilbert Hyatt at Micro Computer Co. patented the microprocessor 1972 - Intel made the 8-bit 8008 and 8080 microprocessors 1974 - Xerox developed the Alto workstation at PARC, with a monitor, a graphical user interface, a mouse, and an Ethernet card for networking 1984 - Apple Computer introduced the Macintosh personal computer January 24. Fifth Generation (1991 and Beyond) 1991 - World-Wide Web (WWW) was developed by Tim Berners-Lee and released by CERN. 1993 - The first Web browser called Mosaic was created by student Marc Andreesen and programmer Eric Bina at NCSA in the first 3 months of 1993. 1994 - Netscape Navigator 1.0 was released Dec. 1994 1996 - Microsoft failed to recognize the importance of the Web, but finally released the much improved browser Explorer 3.0 in the summer Trends towards Parallel Processing From an application point of view, the mainstream of usage of computer is experiencing a trend of four ascending levels of sophistication: Data processing Information processing Knowledge processing
Intelligence processing
Computer usage started with data processing, while is still a major task of todays computers. With more and more data structures developed, many users are shifting to computer roles from pure data processing to information processing. A high degree of parallelism has been found at these levels. As the accumulated knowledge bases expanded rapidly in recent years, there grew a strong demand to use computers for knowledge processing. Intelligence is very difficult to create; its processing even more so.
From an operating point of view, computer systems have improved chronologically in four phases: batch processing multiprogramming time sharing multiprocessing In these four operating modes, the degree of parallelism increase sharply from phase to phase. We define parallel processing as Parallel processing is an efficient form of information processing which emphasizes the exploitation of concurrent events in the computing process. Concurrency implies parallelism, simultaneity, and pipelining. Parallel processing demands concurrent execution of many programs in the computer. The highest level of parallel processing is conducted among multiple jobs or programs through multiprogramming, time sharing, and multiprocessing. Parallel processing can be challenged in four programmatic levels: Job or program level Task or procedure level Inter instruction level Intra instruction level The highest job level is often conducted algorithmically. The lowest intrainstruction level is often implemented directly by hardware means. Hardware roles increase from high to low levels. On the other hand, software implementations increase from low to high levels. PARALLELISM IN UNIPROCESSOR SYSTEMS
a) Basic Uniprocessor Architecture The typical Uniprocessor system consists of three major components: 1. Main memory, 2. Central processing unit (CPU) 3. Input-output (I/O) sub-system. The architectures of two available Uniprocessor computers are described below: 1. Fig below shows the architectural components of the super minicomputer VAX-11/780.
The CPU contains the Master Controller of the VAX system. There are sixteen 32-bit general purpose registers, one which serve as Program Counter (PC).There is also a special CPU status register containing information about the current state of the processor and of the program being executed. The CPU
contains an arithmetic and logic unit (ALU) with an optional floating-point accelerator, and some local cache memory with an optional diagnostic memory. The CPU, the main memory and the I/O subsystems are all connected to a common bus, the Synchronous Backplane Interconnect (SBI) through this bus, all I/O device scan communicate with each other, with the CPU, or with the memory. Peripheral storage or I/O devices can be connected directly to the SBI through the unibus and its controller or through a mass bus and its Controller. 2. Fig below shows the architectural components of the main frame computer IBM System 370/model 168 Uniprocessor.
The CPU contains the instruction decoding and execution units as well as a cache. Main memoryis divided into four units, referred to as logical storage units that are four-way interleaved. The storage controller provides multiport
connections between the CPU and the four LSUs.Peripherals are connected to the system via high speed I/O channels which operate asynchronously with the CPU. b) Parallel Processing Mechanism A number of parallel processing mechanisms have been developed in uniprocessor computers. We identify them in the following six categories: Multiplicity of functional units Parallelism and pipelining within the CPU Overlapped CPU and I/O operations Use of a hierarchical memory system Multiprogramming and time sharing Multiplicity of functional units Multiplicity of Functional Units Use of multiple processing elements under one controller Many of the ALU functions can be distributed to multiple specialized units These multiple Functional Units are independent of each other Example: IBM 360/91 2 parallel execution units Fixed point arithmetic Floating point arithmetic(2 Functional units) Floating point add-sub Floating point multiply-div The early computer has only one ALU in its CPU and hence performing a long sequence of ALU instructions takes more amount of time. The CDC-6600 has 10 functional units built into itsCPU.These 10 units are independent of each other and
may operate simultaneously. A score board is used to keep track of the availability of the functional units and registers being demanded. With 10 functional units and 24 registers available, the instruction issue rate can be significantly increased. System Architecture of the CDC-6600 computer Another good example of a multifunction uniprocessor is the IBM 360/91 which has 2 parallel execution units. One for fixed point arithmetic and the other for floating point arithmetic. Within the floating point E-unit are two functional units: one for floating point add- subtract and other for floating point multiply divide. IBM 360/91 is a highly pipelined, multifunction scientific uniprocessor Parallelism and Pipelining Within the CPU Parallelism & pipelining within theCPU Parallelism is provided by building parallel adders in almost all ALUs Pipelining Each task is divided into subtasks which canbe executed in parallel Overlapped CPU and I/O Operations I/O operations can be performed simultaneously with the CPUcomputations by using separate I/O controllers I/O channels I/O processors Use of Hierarchical Memory System Speed of CPU = 1000 times speed of Main memory hierarchical memory structure is used to close up the speed gap Cache memory Virtual memory Parallel memories for array processors The CPU is 1000 times faster than memory access. A hierarchical memory system can be used to close up the speed gap. The hierarchical order listed is registers Cache Main Memory
Magnetic Disk Magnetic Tape The inner most level is the register files directly addressable by ALU.Cache memory can be used to serve as a buffer between the CPU and the main memory. Virtual memory space can be established with the use of disks and tapes at the outer levels. Balancing Of Subsystem Bandwidth Balancing bandwidth between mainmemory and CPU Balancing bandwidth between mainmemory and I/O CPU is the fastest unit in computer. The bandwidth of a system is defined as the number of operations performed per unit time. In case of main memory the memory bandwidth is measured by the number of words that can be accessed per unit time. Bandwidth Balancing Between CPU and Memory The speed gap between the CPU and the main memory can be closed up by using fast cache memory between them. A block of memory words is moved from the main memory into the cache so that immediate instructions can be available most of the time from the cache. Bandwidth Balancing Between Memory and I/O Devices Input-output channels with different speeds can be used between the slow I/O devices and the main memory. The I/O channels perform buffering and multiplexing functions to transfer the data from multiple disks into the main memory by stealing cycles from the CPU. Multiprogramming &Time-sharing Multiprogramming Mix the execution of various types of programs (I/o bound, CPU bound)
The interleaving of CPU and I/O operationsacross several programs Time-sharing OS is used to avoid high-priority programs occupying the CPU for long Fixed or variable time-slices are used Creates a concept of virtual processors PARALLEL COMPUTER STRUCTURES Parallel computers are those systems that emphasize parallel processing. We divide parallel computers into three architectural configurations: Pipeline computers Array processors Multiprocessors In a pipelined computer successive instructions are executed in an overlapped fashion. In a non pipelined computer these four steps must be completed before the next instructions can be issued. An array processor is a synchronous parallel computer with multiple arithmetic logic units called processing elements (PE) that can operate in parallel in lock step fashion.By replication one can achieve spatial parallelism. The PEs is synchronized to perform the same function at the same time. A basic multiprocessor contains two or more processors of comparable capabilities. All processors share access to common sets of memory modules, I/O channels and peripheral devices.
Pipeline Computers The process of executing an instruction in a digital computer involves 4 major steps
Instruction fetch Instruction decoding Operand fetch Execution In a pipelined computer successive instructions are executed in an overlapped fashion. In a non pipelined computer these four steps must be completed before the next instructions can be issued. Instruction fetch : Instruction is fetched from the main memory Instruction decoding: Identifying the operation to be performed. Operand Fetch: If any operands is needed is fetched. Execution : Execution of the Arithmetic and logical operation An instruction cycle consists of multiple pipeline cycles. The flow of data (input operands, intermediate results and output results) from stage to stage is triggered by a common clock of the pipeline. The operations of all stages are triggered by a common clock of the pipeline. For non pipelined computer, it takes four pipeline cycles to complete one instruction. Once a pipeline is filled up, an output result is produced from the pipeline on each cycle. The instruction cycle has been effectively reduced to 1/4th of the original cycle time by such overlapped execution.
Array Processors An array processor is a synchronous parallel computer with multiple arithmetic logic units called processing elements (PE) that can operate in parallel in lock step. The PEs is synchronized to perform the same function at the same time. Scalar and control type of instructions are directly executed in the control unit (CU). Each PE consists of an ALU registers and a local memory. The PEs is interconnected by a data routing network. Vector instructions are broadcasted to the PEs for distributed execution over different component operands fetched directly from local memories. Array processors designed with associative memories are called as associative processors. Multiprocessor Systems
A basic multiprocessor contains two or more processors of comparable capabilities. All processors share access to common sets of memory modules, I/O channels and peripheral devices. The entire system must be controlled by a single integrated operating system providing interactions between processors and their programs at various levels. Multiprocessor hardware system organization is determined by the interconnection structure to be used between the memories and processors. Three different interconnection are Time shared Common bus Cross Bar switch network Multiport memories ARCHITECTURAL CLASSIFICATION SCHEMES Introduction The Flynns classification scheme is based on the multiplicity of instruction streams and data streams in a computer system. Flynns scheme is based on serial versus parallel processing. Handlers classification is determined by the degree of parallelism and pipelining in various subsystem levels. Architectural Classification Schemes 1. Flynns Classification The most popular taxonomy of computer architecture was defined by Flynn in 1966.Flynn's classification scheme is based on the notion of a stream of information. Two types of information flow into a processor: instructions and data. The instruction stream is defined as the sequence of instructions performed by the processing unit. The data stream is defined as the data traffic exchanged between the memory and the processing unit. According to Flynn's classification, either of the instruction or data streams can be single or multiple. Computer architecture can be classified into the following four distinct categories:
single-instruction single-data streams (SISD); single-instruction multiple-data streams (SIMD); multiple-instruction single-data streams (MISD); and Multiple-instruction multiple-data streams (MIMD). General Notes: Conventional single-processor von Neumann computers are classified as SISD systems. Parallel computers are either SIMD or MIMD. When there is only one control unit and all processors execute the same instruction in a synchronized fashion, the parallel machine is classified as SIMD. In a MIMD machine, each processor has its own control unit and can execute different instructions on different data. In the MISD category, the same stream of data flows through a linear array of processors executing different instruction streams. In practice, there is no viable MISD machine; however, some authors have considered pipelined machines (and perhaps systolic-array computers) as examples for MISD. An extension of Flynn's taxonomy was introduced by D. J. Kuck in 1978. In his classification, Kuck extended the instruction stream further to single (scalar and array) and multiple (scalar and array) streams. The data stream in Kuck's classification is called the execution stream and is also extended to include single (scalar and array) and multiple (scalar and array) streams. The combination of these streams results in a total of 16 categories of architectures. SISD Architecture A serial (non-parallel) computer Single instruction: only one instruction stream is being acted on by the CPU during any one clock cycle Single data: only one data stream is being used as input during any one clock cycle Deterministic execution
This is the oldest and until recently, the most prevalent form of computer Examples: most PCs, single CPU workstations and mainframes
SIMD Architecture A type of parallel computer Single instruction: All processing units execute the same instruction at any given clock cycle Multiple data: Each processing unit can operate on a different data element This type of machine typically has an instruction dispatcher, a very highbandwidth internal network, and a very large array of very small-capacity instruction units. Best suited for specialized problems characterized by a high degree of regularity, such as image processing. Synchronous (lockstep) and deterministic execution
Two varieties: Processor Arrays and Vector Pipelines Examples: _ Processor Arrays: Connection Machine CM-2, Maspar MP-1, MP-2 _ Vector Pipelines: IBM 9000, Cray C90, Fujitsu VP, NEC SX-2, Hitachi
CU-control unit PU-processor unit MM-memory module SM-Shared memory IS-instruction stream DS-data stream MISD Architecture There are n processor units, each receiving distinct instructions operating over the same data streams and its derivatives. The output of one processor becomes input of the other in the macro pipe. No real embodiment of this class exists. A single data stream is fed into multiple processing units. Each processing unit operates on the data independently via independent instruction streams.
Few actual examples of this class of parallel computer have ever existed. One is the experimental Carnegie-Mellon C.mmp computer (1971). Some conceivable uses might be: multiple frequency filters operating on a single signal stream Multiple cryptography algorithms attempting to crack a single coded message.
MIMD Architecture Multiple-instruction multiple-data streams (MIMD) parallel architectures are made of multiple processors and multiple memory modules connected together via some interconnection network. They fall into two broad categories: shared memory or message passing. Processors exchange information through their central shared memory in shared memory systems, and exchange information through their interconnection network in message passing systems. Currently, the most common type of parallel computer. Most modern computers fall into this category. Multiple Instruction: every processor may be executing a different instruction stream
Multiple Data: every processor may be working with a different data stream Execution can be synchronous or asynchronous, deterministic or nondeterministic Examples: most current supercomputers, networked parallel computer "grids" and multiprocessor SMP computers - including some types of PCs. A shared memory system typically accomplishes interprocessor coordination through a global memory shared by all processors. These are typically server systems that communicate through a bus and cache memory controller. A message passing system (also referred to as distributed memory) typically combines the local memory and processor at each node of the interconnection network. There is no global memory, so it is necessary to move data from one local memory to another by means of message passing.
Fengs Classification Tse-yun Feng suggested the use of degree of parallelism to classify various computer architectures. Serial versus Parallel Processing The maximum number of binary digits that can be processed within a unit time by a computer system is called the maximum parallelism degree P.A bit slice is a string of bits one from each of the words at the same vertical position. There are 4 types of methods under above classification Word Serial and Bit Serial (WSBS) Word Parallel and Bit Serial (WPBS) Word Serial and Bit Parallel(WSBP) Word Parallel and Bit Parallel (WPBP) WSBS has been called bit parallel processing because one bit is processed at a time. WPBS has been called bit slice processing because m-bit slice is processes at a time. WSBP is found in most existing computers and has been called as Word Slice processing because one word of n bit processed at a time. WPBP is known as fully parallel processing in which an array on n x m bits is processes at one time. Handlers Classification Parallelism versus Pipelining Wolfgang Handler has proposed a classification scheme for identifying the parallelism degree and pipelining degree built into the hardware structure of a computer system. He considers at three subsystem levels: Processor Control Unit (PCU) Arithmetic Logic Unit (ALU)
Bit Level Circuit (BLC) Each PCU corresponds to one processor or one CPU. The ALU is equivalent to Processor Element (PE). The BLC corresponds to combinational logic circuitry needed to perform 1 bit operations in the ALU. A computer system C can be characterized by a triple containing six independent entities T(C) = <K x K', D x D', W x W' > Where K = the number of processors (PCUs) within the computer D = the number of ALUs under the control of one CPU W = the word length of an ALU or of an PE W' = the number of pipeline stages in all ALUs or in a PE D' = the number of ALUs that can be pipelined K' = the number of PCUs that can be pipelined AMDAHLS LAW The performance gain that can be obtained by improving some portion of a computer can be calculated using Amdahls Law. Amdahls Law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used. Amdahls Law defines the speedup that can be gained by using a particular feature. What is speedup? Suppose that we can make an enhancement to a computer that will improve performance when it is used. Speedup is the ratio Speedup = Performance for entire task using the enhancement when possible Performance for entire task without using the enhancement Alternatively, Speedup =Execution time for entire task without using the enhancement Execution time for entire task using the enhancement when possible Speedup tells us how much faster a task will run using the computer with the
enhancement as opposed to the original computer. Amdahls Law gives us a quick way to find the speedup from some enhancement, which depends on two factors: 1. The fraction of the computation time in the original computer that can be converted to take advantage of the enhancementFor example, if 20 seconds of the execution time of a program that takes 60 seconds in total can use an enhancement, the fraction is 20/60. This value, which we will call Fraction enhanced, is always less than or equal to 1. 2. The improvement gained by the enhanced execution mode; that is, how much faster the task would run if the enhanced mode were used for the entire program This value is the time of the original mode over the time of the enhanced mode. If the enhanced mode takes, say, 2 seconds for a portion of the program, while it is 5 seconds in the original mode, the improvement is 5/2. We will call this value, which is always greater than 1, Speedupenhanced. The execution time using the original computer with the enhanced mode will be the time spent using the unenhanced portion of the computer plus the time spent using the enhancement: Execution timenew = Execution timeold (1-fraction enhanced) + farction enhanced Speedup enhanced The overall speedup is the ratio of the execution times: Speedupoverall = Execution timeold = Execution timenew 1 ( 1-fraction enhanced) + farction enhanced Speedup enhanced
Amdahls Law expresses the law of diminishing returns: The incremental improvement in speedup gained by an improvement of just a portion of the computation diminishes as improvements are added. An important corollary of Amdahls Law is that if an enhancement is only usable for a fraction of a task, we cant speed up the task by more than the reciprocal of 1 minus that fraction.
Amdahls Law can serve as a guide to how much an enhancement will improve performance and how to distribute resources to improve costperformance. The goal, clearly, is to spend resources proportional to where time is spent. Amdahls Law is particularly useful for comparing the overall system performance of two alternatives, but it can also be applied to compare two processor design alternatives. INDIAN CONTRIBUTION TO PARALLEL PROCESSING India launched a major initiative in parallel computing in 1988. There arefive or six independent projects to construct parallel processing systems. This wasmotivated by the need for advanced computing, a vision of developing its owntechnology, and difficulties (political and economic) obtaining commercial products. The creation of th e Center for Development of Advanced Computing (C-DAC) andconcurrently other efforts at National Aerospace Laboratory (NAL), Bangalore,Advanced Numerical Research & Analysis Group (ANURAG), Hyderabad, BhabhaAtomic Research Center (BARC), Bombay, Center for Development of Telematics(C-DOT), Bangalore, marked the beginning of high performance computing inIndia. Today, India has designed its own high performance computers in the form of PARAM by C-DAC ANUPAM by BARC PACE by ANURAG FLOSOLVER by NAL CHIPPS by C-DOT MTPPS by BARC