Computer Architecture Unit 1
Unit 1 Fundamentals of Computer Architecture
Structure:
1.1 Introduction
Objectives
1.2 Computational Model
The basic items of computations
The problem description model
The execution model
1.3 Evolution of Computer Architecture
1.4 Process and Thread
Concept of process
Concept of thread
1.5 Concepts of Concurrent and Parallel Execution
1.6 Classification of Parallel Processing
Single instruction single data (SISD)
Single instruction multiple data (SIMD) Multiple instruction single
data (MISD) Multiple instruction multiple data (MIMD)
1.7 Parallelism and Types of Parallelism
1.8 Levels of Parallelism
1.9 Summary
1.10 Glossary
1.11 Terminal Questions
1.12 Answers
1.13 Introduction
As you all know computers vary greatly in terms of physical size, speed of
operation, storage capacity, application, cost, ease of maintenance and
various other parameters. The hardware of a computer consists of physical
parts that are connected in some way so that the overall structure achieves
the pre-assigned functions. Each hardware unit can be viewed at different
levels of abstraction. You will find that simplification can go on to still deeper
levels. You will be surprised to know that many technologies exist for
manufacturing microchips.
The complexity of integration is likely to go on increasing with time. As a
Manipal University Jaipur B1648 Page No. 1
Computer Architecture Unit 1
consequence smaller and more powerful computers will go on appearing.
Evidently, which components are used and how they are interconnected,
dictates what the resulting computer will be good at doing. Thus, in a faster
computer, you will find special components connected in a special way that
enhances the speed of operation of the designed computer.
Different computer designs can have different components. Moreover, the
same components can be interconnected in variety of ways. Each design will
provide a different performance to the users. Exactly what components
interconnected in what ways will produce what performance is the subject of
Computer Architecture. In this unit, we will study about the basics of Computer
Architecture.
Objectives:
After studying this unit, you should be able to:
• explain computational model and its types
• state the different levels of evolution of computer architecture
• differentiate between process and thread
• describe the concepts of concurrent and parallel execution
• identify the various classification of parallel processing
• list the types of parallelism
• list the levels of parallelism
1.2 Computational Model
Computer architecture may be defined as “The Structure and behavior of a
Conceptual model of a Computer System to perform the required
functionalities”.
Computer Architecture deals with the issue of selection of hardware
components and interconnecting them to create computers that achieve
specified functional, performance and cost goals.
Progressing in the earlier mentioned way, the hardware (at least the electronic
part) breaks down to the following simple digital components.
• Registers
• Counters
• Adders
• Multiplexers
Manipal University Jaipur B1648 Page No. 2
Computer Architecture Unit 1
• De-multiplexers
• Coders
• Decoders
• I/O Controllers
A common foundation or paradigm that links the computer architecture and
language groups is called a Computational Model. The concept or idea of
computational model expresses a higher level of abstraction than can be
achieved by either the computer architecture or the programming language
alone, and includes both.
The computational model consists of the subsequent three abstractions:
1. The basic items of computations
2. The problem description model
3. The execution model
Unlike the ordinary delusions, the set of abstractions that must be selected to
state computational models is not very clear. Some criteria will define fewer
but relatively basic computational models, while a wide variety of criteria will
result in a fairly a huge quantity of different models.
3.1.1 The basic items of computations
This concept recognises the basic items of computation. This is a requirement
of the items to which the computation is referred and the sort of computations
(operations) that are executed on them. For example, in the von Neumann
computational model, the fundamental items of computation are data.
This data will normally be characterised by individual bodies so as to be
capable of distinguishing among several different data items in the course of
a computation. These identifiable bodies are commonly called variables in
programming languages and are put into operation by register addresses or
memory in architectures.
The acknowledged computational models, such as Turing model, the von
Neumann model and the data flow model stand on the theory of data. These
models are briefly explained as below:
The Turing machine architecture operates by manipulating symbols on a
tape. In other words, a tape with innumerable slots exists, and at any one point
in time, the Turing machine is in a specific slot. The machine can change the
symbol and shift to a different slot based on the symbol read at that slot. All of
Manipal University Jaipur B1648 Page No. 3
Computer Architecture Unit 1
this is inevitable.
The von-Neumann architecture explains the stored-program computer
where data and instructions are stored in memory and the machine works by
varying its internal state, In other words, an instruction operates on some data
and changes the data. So naturally, there is a state maintained in the system.
Dataflow architecture expressively distinguishes the conventional von
Neumann architecture or control flow architecture. There is a lack of a program
counter in Dataflow architectures. The execution of instructions in dataflow
systems is exclusively concluded depending on the accessibility of input
arguments to the instructions. Even though dataflow architecture has not been
used in any commercially successful computer hardware, it is extremely
appropriate in many software architectures such as database engine designs
and parallel computing frameworks.
On the other hand, there are various models independent of data. In these
models, the basic items of computation are:
• Messages or objects sent to them needing an associated manipulation (as
in the object-based model)
• Arguments and the functions applied on them (applicative model)
• Elements of sets and the predicates declared on them (predicate-logic-
based model).
1.2.2 The problem description model
The problem description model implies in cooperation the style and method of
problem description. The problem description style specifies the way troubles
in a specific computational model are expressed. The style is either procedural
or declarative. The algorithm to work out the problem is shown in a procedural
style. A particular result is then stated in the form of an algorithm. In a
declarative style, all the facts and dealings significant to the specified problem
have to be stated.
There are two modes for conveying these relationships and facts. The first
employs functions, as in the applicative model of computation, while the
second declares the relationships and facts in the form of predicates, as in the
predicate-logic-based computational model. Now, we will study the second
component of the problem description model that is the problem description
method. It is understood in a different way for the procedural and the
declarative style. In the procedural style, the problem description model states
Manipal University Jaipur B1648 Page No. 4
Computer Architecture Unit 1
the way in which the clarification of the known problem has to be explained.
On the contrary, while using the declarative style, it states the method in which
the difficulty itself has to be explained.
1.2.3 The execution model
This is the third and the final constituent of computational model. It can be
divided into three stages.
• Interpretation of how to perform the computation
• Execution semantics
• Control of the execution sequences
The first stage pronounces the analysis of the computation, which is strongly
linked to the problem description method. The selection of problem description
method and the analysis of the computation are mutually dependent on one
another.
The subsequent stage of the execution model states the execution semantics.
This is taken as a rule that identifies the way a particular execution step is to
be performed. This rule is, certainly, linked with the selected problem
description method and the way the implementation of the computation is
understood. The final stage of the model states the rule of the execution
sequences. In the basic models, implementation is either control driven or data
driven or demand driven.
• In Control driven execution, it is supposed that there is a program
consisting of a succession of instructions. The execution sequence is then
absolutely specified by the command of the directions.
Nevertheless, explicit control instructions can also be used to identify an
exit from the implied execution sequence.
• Data-driven execution is symbolised by the rule that an operation is made
active instantly after all the needed input data is available. Data- driven
execution control is characteristic of the dataflow model of computation.
• In Demand-driven execution, the operations will be made active only when
their implementation is required to attain the ultimate result. Demand-
driven execution control is normally used in the applicative computational
model.
Self Assessment Questions
1. The _________ model refers to both the style and method of problem
description.
Manipal University Jaipur B1648 Page No. 5
Computer Architecture Unit 1
2. In a _________ , the algorithm for solving the problem is stated.
3. _________ execution is characterised by the rule that an operation is
activated as soon as all the needed input data is available.
1.3 Evolution of Computer Architecture
With the advent of revolutionary development in area of semiconductor
technology, the computer architecture has gradually evolved in stages over
the years. The main target of such evolution is to enhance the performance of
the processors. History of computers begins with the invention of the abacus
in 3000 BC, followed by the invention of mechanical calculators in 1617. The
years beyond 1642 till 1980 are marked by inventions of zeroth, first, second
and third generation computers. The years beyond 1980 till today, are marked
by fourth generation computers. Fifth generation computers are still under
research and development.
Zeroth Generation Computers: The zeroth generation of computers (1642-
1946) was distinctly made available by the invention of largely mechanical
computers. In 1642, a French mathematician named Blaise Pascal invented
the first mechanical device which was called Pascaline. In 1822, Charles
Babbage, an English mathematician, invented a machine called Difference
Engine to compute tables of numbers for naval navigation. Later on, in the
year 1834, Babbage attempted to build a digital computer, called Analytical
Engine. The analytical engine had all the parts of a modern computer i.e. the
store (memory unit), the mill (computation unit), the punched card reader (input
unit) and the punched/ printed output (output unit). As all the basic parts of
modern computers were thought out by Charles Babbage, he is known as
Father of Computers.
First Generation Computers: The first generation of computers (19461954)
was marked by the use of vacuum tubes or valves as their basic electronic
component. Although these computers were faster than earlier mechanical
devices, they had many disadvantages. First of all, they were very large in
size. They consumed too much power and generated too much heat, when
used for even short duration of time. They were very unreliable and broke
down frequently. They required regular maintenance and their components
had also to be assembled manually.
Some examples of first generation computers are ENIAC (Electronic
Numerical Integrator and Calculator), EDVAC (Electronic Discrete Variable
Manipal University Jaipur B1648 Page No. 6
Computer Architecture Unit 1
Automatic Computer), EDSAC (Electronic Delay Storage Automatic
Calculator), UNIVAC I (Universal Automatic Calculator) and IAS machine
(Institute for Advanced Study machine built by Princeton’s Institute for
Advanced Study). The basic design of first generation computer is shown in
figure 1.1.
Figure 1.1: Basic Design of a First Generation Computer
IAS machine was a new version of the EDVAC, which was built by von
Neumann. The basic design of IAS machine is now known as von Neumann
machine, which had five basic parts - the memory, the arithmetic logic unit, the
program control unit, the input and output unit as shown in figure 1.2.
Manipal University Jaipur B1648 Page No. 7
Computer Architecture Unit 1
Second Generation Computers: The first generation of computers became
out-dated, when in 1954, the Philco Corporation developed transistors that can
be used in place of vacuum tubes. The second generation of computers (1953-
64) was marked by the use of transistors in place of vacuum tubes. Transistors
had a number of advantages over the vacuum tubes. As transistors were made
from pieces of silicon, so they were more compact than vacuum tubes.
The second-generation computers were smaller in size and generated less
heat than first generation computers. Although they were slightly faster and
more reliable than earlier computers, they also had many disadvantages.
They had limited storage capacity, consumed more power and were also
relatively slow in performance. Some examples of second generation
computers are IBM 701, PDP-1 and IBM 650.The basic design of a second
generation computer is shown in figure 1.3.
Manipal University Jaipur B1648 Page No. 8
Computer Architecture Unit 1
Figure 1.3: Basic Design of Second Generation Computer
Third Generation Computers: Second generation computers became out-
dated after the invention of ICs. The third generation of computers (1964-
1978) was marked by use of Integrated Circuits (ICs) in place of transistors.
As hundreds of transistors could be put on a single small circuit, so ICs were
more compact than transistors. The third generation computers removed
many drawbacks of second generation computers. The third generation
computers were even smaller in size, very less heat generated and required
very less power as compared to earlier two generation of computers. These
computers required less human labour at the assembly stage.
Some examples of third generation computers are IBM 360, PDP-8, Cray-1
and VAX. The basic design of a third generation computer is shown in figure
1.4.
Manipal University Jaipur B1648 Page No. 9
Computer Architecture Unit 1
Figure 1.4: Basic Design of a Third Generation Computer
Fourth Generation Computers: The third generation computers became out-
dated, when it was found in around 1978, that thousands of ICs could be
integrated onto a single chip, called LSI (Large Scale Integration).
The fourth generation of computers (1978-till date) was marked by use of
large-scale Integrated (LSI) circuits in place of ICs. As thousands of ICs could
be put onto a single circuit, so LSI circuits are still more compact than ICs. In
1978, it was found that millions of components could be packed onto a single
circuit, known as Very Large Scale Integration (VLSI). VLSI is the latest
technology of computer that led to the development of the popular Personal
Computers (PCs), also called as Microcomputers.
Some examples of fourth generation computers are IBM PC, IBM PC/AT, 386,
486, Pentium and CRAY-2. The basic design of a fourth generation computer
is shown in figure 1.5.
Manipal University Jaipur B1648 Page No. 10
Computer Architecture Unit 1
Fifth Generation Computers: Although fourth generation computers offer too
many advantages to users, still they have one main disadvantage. The major
drawback of these computers is that they have no intelligence on their own.
Scientists are now trying to remove this drawback by making computers, which
would have artificial intelligence. The fifth generation computers (Tomorrow's
computers) are still under research and development stage. These computers
would have artificial intelligence.
They will use USLI (Ultra Large-Scale Integration) chips in place of VLSI chips.
One USLI chip contains millions of components on a single IC. Robots have
some features of fifth generation computers.
Self Assessment Questions
4. ______ was the first mechanical device, invented by Blaise Pascal.
5. ________ was a new version of the EDVAC, which was built by von
Neumann.
6. The fourth generation of computers was marked by use of Integrated
Circuits (ICs) in place of transistors. (True/ False)
7. Personal Computers (PCs), also called as Microcomputers.
(True/ False)
Manipal University Jaipur B1648 Page No. 11
Computer Architecture Unit 1
Activity 1:
Using the Internet, find out about Fifth Generation Computer Systems
project (FGCS), idea behind it, implementation, timeline and outcome
1.4 Process and Thread
Every process presents the resources required to execute a program. A
process has an executable code, a virtual address space, open handles to
system objects, a unique process identifier, a security context, minimum and
maximum working set sizes, environment variables, a priority class, and at
least one thread of execution. Each process is begun with a single thread,
often called the primary thread, but can create additional threads from any of
its threads.
A thread is the entity within a process that can be scheduled for execution. All
threads of a process share its system resources and virtual address space.
Additionally, each thread maintains exception handlers, thread local storage,
a scheduling priority, a unique thread identifier, and a set of structures the
system will utilise to save the thread context until it is scheduled. The thread
context includes the thread's set of machine registers, a thread environment
block, the kernel stack and a user stack in the address space of the thread's
process. Threads can also have their own security context, which is valuable
in impersonating clients.
The basic difference between process and thread is that every process has its
own data memory location but all related threads can share same data
memory and have their individual stacks. A process is a collection of virtual
memory space, code, data and system resources whereas thread is a code
which will be serially executed within a process.
Let’s study these concepts in detail.
1.4.1 Concept of process
In operating system terminology, instead of the term ‘program’, the notion of
process is used in connection with execution. It designates a commission or
job, or a quantum of work dealt with as an entity. Consequently, the resources
required, such as address space, are typically allocated on a process basis.
Each process has a life cycle, which consists of creation, an execution phase
and termination.
Process creation involves the following four main actions:
Manipal University Jaipur B1648 Page No. 12
Computer Architecture Unit 1
• Setting up the process description: Usually, operating systems describe
a process by means of a description table which is called the Process
Control Block or PCB. A PCB contains all the information relevant to the
whole life cycle of a process. It holds basic data such as process
identification, owner, process status, description of the
allocated address space and so on.
• Allocating address space: Allocation of address space to a process for
execution is the second major component of process creation. This
consists of two approaches: sharing the address space among the created
processes (shared memory) or allocating distinct address spaces to each
process (per-process address spaces).
• Loading the program into the allocated address space: Subsequently,
the executable program file will usually be loaded into the allocated
memory space.
• Passing the process description to the scheduler: Finally, the process
thus created is passed to the process scheduler which allocates the
processor to the competing processes. The process scheduler manages
processes typically by setting up and manipulating queues of PCBs. Thus,
after creating a process the scheduler puts the PCB into ready-to-run
processes.
Process scheduling involves three key concepts: the declaration of distinct
process states, the specification of the state transition diagram and the
statement of a scheduling policy. As far as process states are concerned, there
are three basic states connected with scheduling:
• The ready-to-run state
• The running state and
• The wait (or blocked) state.
In the wait state, they are suspended or blocked waiting for the occurrence of
some event before getting ready to run again. When the scheduler selects a
process for execution, its state is changed from ready-to-run to running.
Finally, a process in the wait can go into the ready-to-run state, if the event it
is waiting for has occurred. You can see various process states in figure 1.6.
Manipal University Jaipur B1648 Page No. 13
Computer Architecture Unit 1
1.4.2 Concept of thread
A thread is a fundamental unit of CPU consumption, which consists of a
program counter, a stack, and a set of registers and a thread ID. Conventional
heavyweight processes consist of a single thread of control. In other words,
there is one program counter, and one sequence of instructions that can be
carried out at any specified time.
At present, multi-threaded applications have taken the place of single thread
applications. These have multiple threads within a single process, each having
their own program counter, stack and set of registers, but sharing common
code, data, and certain structures such as open files. See figure 1.7 to find out
the differences between the two processes.
Manipal University Jaipur B1648 Page No. 14
Computer Architecture Unit 1
Figure 1.7: Single and Multithreaded Processes
Threads are of great use in recent programming particularly when a process
has multiple tasks to perform in parallel of the others. This is mainly helpful
when one of the tasks may block, and it is needed to permit the other tasks to
continue without blocking. For example, in a word processor, a background
thread may check spelling and grammar while a foreground thread processes
user input (keystrokes), while yet a third thread loads images from the hard
drive, and a fourth does periodic automatic backups of the file being edited.
Self Assessment Questions
8. All threads of a process share its virtual address space and system
resources. (True/ False)
9. When the scheduler selects a process for execution, its state is changed
from ready-to-run to the wait state. (True/ False)
1.5 Concepts of Concurrent and Parallel Execution
Concurrent execution is the temporal behaviour of the N-client 1-server model
where one client is served at any given moment. This model has a dual nature;
Manipal University Jaipur B1648 Page No. 15
Computer Architecture Unit 1
it is sequential in a small time scale, but simultaneous in a rather large time
scale. In this situation, the key problem is how the competing clients, let us
say processes or threads, should be scheduled for service (execution) by the
single server (processor). The scheduling policy may be viewed as covering
the following two aspects:
Pre-emption rule: It deals with whether servicing a client can be interrupted
or not and, if so, on what occasions. The pre-emption rule may either specify
time-sharing, which restricts continuous service for each client to the duration
of a time slice, or can be priority based, interrupting the servicing of a client
whenever a higher priority client requests service.
Selection rule: It states how one of the competing clients is selected for
service. The selection rule is typically based on certain parameters, such as
priority, time of arrival, and so on. This rule specifies an algorithm to determine
a numeric value, which we will call the rank, from the given parameters. During
selection, the ranks of all competing clients are computed and the client with
the highest rank is scheduled for service.
Parallel execution: Parallel execution is associated with N-client N-server
model. Having more than one server, allows the servicing of more than one
client at the same time; this is called parallel execution. Parallel computing
is the simultaneous use of multiple compute resources to solve a
computational problem. It may take the use of multiple CPUs. A problem is
broken into discrete parts that can be solved concurrently. Each part is further
broken down to a series of instructions and instructions from each part execute
simultaneously on different CPUs as shown in figure 1.8.
problem instruction
tN t3 t2 t1
Figure 1.8: Parallel Computing Systems
Manipal University Jaipur B1648 Page No. 16
Computer Architecture Unit 1
Thus, we can say that a computer system is said to be Parallel Processing
System or Parallel Computer if it provides facilities for simultaneous
processing of various set of data or simultaneous execution of multiple
instruction. On a computer with more than one processor each of several
processes can be assigned to its own processor, to allow the processes to
progress simultaneously. If only one processor is available the effect of parallel
processing can be simulated by having the processor run each process in turn
for a short time.
Self Assessment Questions
10. Concurrent execution is the temporal behaviour of the _____ Model.
11. During selection, the ranks of all competing clients are computed and the
client with the highest rank is scheduled for service. (True/ False)
1.6 Classification of Parallel Processing
The core element of parallel processing is Central Processing Units (CPUs).
The essential computing process is the execution of sequence of instruction
on asset of data. The term stream is used here to denote a sequence of items
as executed by single processor or multiprocessor. Based on a number of
instruction and data streams can be processed simultaneously,
Flynn classifies the computer system into four categories. They are:
(a) Single Instruction Single Data (SISD)
(b) Single Instruction Multiple Data (SIMD)
(c) Multiple Instruction Single Data (MISD)
(d) Multiple Instruction Multiple Data (MIMD)
Let’s learn more about them.
(e) .1 Single instruction single data (SISD)
Computers with a single processor that is capable of executing scalar
arithmetic operations using a single instruction stream and a single data
stream are called SISD (Single Instruction Single Data) computers. They are
characterised by:
Single instruction: Only single instruction stream/linearised set is being
acted on by the CPU during any one clock-cycle.
Single data: Merely a distinct data stream is being used as input during any
Manipal University Jaipur B1648 Page No. 17
Computer Architecture Unit 1
one clock-cycle.
This is the oldest and of late, the most widespread structure of computer.
Examples: Most PCs, single CPU workstations and mainframes.
Figure 1.9 shows an example of SISD.
1.6.2 Single instruction multiple data (SIMD)
Computers with a single processor that is capable of executing vector
arithmetic operations using a single instruction stream but multiple data
streams are called SIMD (Single Instruction Multiple Data) computers. They
are characterised by:
Single instruction: Every processing unit perform the identical instruction at
every known clock-cycle.
Multiple data: Each processing unit can operate on a different data element.
This category of machine characteristically has an instruction dispatcher, a
very high-bandwidth in-house arrangement, and a very large array of very
small-capacity instruction units. It is best suitable for specialised problems
characterised by a high level of consistency, such as image processing. Figure
1.10 shows an example of SIMD processing.
Manipal University Jaipur B1648 Page No. 18
Computer Architecture Unit 1
1.6.3 Multiple instruction single data (MISD)
Computers with multiple processors that are capable of executing different
operations using multiple instruction streams but single data stream are called
MISD (Multiple instruction Single Data) computers. They are characterised by:
Multiple Instructions: Every processing unit functions on the data alone via
independent instruction streams.
Single data: A single data stream is entered into multiple processing units.
Some conceivable uses of this architecture are in multiple frequency filters
functional on a single signal stream and multiple cryptography algorithms
trying to crack a single coded message. Figure 1.11 shows an example of
MISD processing.
Manipal University Jaipur B1648 Page No. 19
Computer Architecture Unit 1
Figure 1.11: MISD Process
1.6.4 Multiple Instruction Multiple Data (MIMD)
Computers with multiple processors that are capable of executing vector
arithmetic operations using multiple instruction streams and multiple data
streams are called MIMD (Multiple Instruction Multiple Data) computers. They
are characterised by:
Multiple Instructions: Each processor may be performing a dissimilar
instruction stream.
Multiple Data: Each processor may be working with a dissimilar data stream.
It is the most common type of parallel computer. Most modern computers fall
into this category. Execution can be synchronous or asynchronous,
deterministic or non-deterministic.
Examples: most current supercomputers, networked parallel computer “grids”
and multi-processor computers. Figure 1.12 shows a case of MISD
processing.
Manipal University Jaipur B1648 Page No. 20
Computer Architecture Unit 1
Self Assessment Questions
12. In _________ all processing units execute the same instruction at any
given clock cycle.
13. In which system a single data stream is fed into multiple processing units?
14. ________ is the most common type of parallel computer.
1.7 Parallelism and Types of Parallelism
A parallel computer is a set of processors that are able to work cooperatively
to solve a computational problem. This definition broadly includes parallel
supercomputers that have more than hundreds of
processors, networks of workstations, embedded systems and multiple-
processor workstations. Parallel computers have the potential to
concentrate computational resources like processors, memory, or I/O
bandwidth on important computational problems. The following are the various
types of parallelism:
Bit-level parallelism: Bit-level parallelism is a form of parallel computing
based on increasing processor word size. From the advent of very-large- scale
integration (VLSI) computer chip fabrication technology in the 1970s until
about 1986, advancements in computer architecture were conducted by
increasing bit-level parallelism
Instruction-level parallelism: A computer program is a stream of linearised
instructions carried out by a processor. These commands can be rearranged
Manipal University Jaipur B1648 Page No. 21
Computer Architecture Unit 1
and united into groups which are then acted upon in parallel without altering
the outcome of the program. This is known as instructionlevel parallelism.
Advances in instruction-level parallelism dominated computer architecture
from the mid-1980s until the mid-1990s.
Data parallelism: Data parallelism is parallelism inbuilt in program loops. It
centres at allocating the data across various computing nodes to be processed
in parallel. Parallelising loops recurrently leads to related (not necessarily
identical) operation sequences or functions being performed on elements of a
large data structure. Many scientific and engineering applications display data
parallelism.
Self Assessment Questions
15. Parallel computers offer the potential to concentrate computational
resources on important computational problems. (True/ False)
16. Advances in instruction-level parallelism dominated computer architecture
from the mid-1990s until the mid-2000s. (True/False)
1.8 Levels of Parallelism
Parallelism is one of the most popular ideas in computing. Architectures,
compilers and operating system have been striving for more than two decades
to extract and utilise as much parallelism as possible in order to speed up
computation. The notion of parallelism is used in two different contexts. Either
it designates available parallelism in programs or it refers to parallelism
occurring during execution, called utilised parallelism.
Types of available parallelism: Problem solutions may contain two different
kinds of available parallelism, called functional parallelism and data
parallelism.
Functional parallelism is that kind of parallelism which arises from the logic of
a problem solution. On the contrary, data parallelism comes from using data
structures that allow parallel operations on their elements, such as vectors or
matrices, in problem solutions. From another point of view, parallelism can be
considered as being either regular or irregular. Data parallelism is regular,
whereas functional parallelism, with the execution of loop-level parallelism, is
usually irregular.
Levels of available functional parallelism: Programs written in imperative
languages may represent functional parallelism at different levels, that is, at
Manipal University Jaipur B1648 Page No. 22
Computer Architecture Unit 1
different sizes of granularity. In this respect, we can identify the following four
levels and corresponding granularity sizes:
• Parallelism at the instruction level (fine-grained parallelism): Available
instruction-level parallelism means that particular instructions of a program
may be executed in parallel. To this end, instructions can be either
assembly (machine-level) or high-level language instructions. Usually,
instruction-level parallelism is understood at the machinelanguage
(assembly-language) level.
• Parallelism at the loop level (middle-grained parallelism): Parallelism
may also be available at the loop level. Here, consecutive loop iterations
are candidates for parallel execution. However, data dependencies
between subsequent loop iterations, called recurrences, may restrict their
parallel execution.
• Parallelism at the procedure level (middle-grained parallelism): Next,
there is parallelism available at the procedure level in the form of parallel
executable procedures. The extant of parallelism exposed at this level is
subject mainly to the kind of the problem solution considered.
• Parallelism at the program level (coarse-grained parallelism): Lastly,
different programs (users) are obviously independent of each other. Thus,
parallelism is also available at the user level (which we consider to be
coarse-grained parallelism). Multiple, independent users are a key source
of parallelism occurring in computing scenarios.
Utilisation of functional parallelism: Available parallelism can be utilised by
architectures, compilers and operating systems conjointly for speeding up
computation. Let us first discuss the utilisation of functional parallelism.
In general, functional parallelism can be utilised at four different levels of
granularity, that is,
• Instruction
• Thread
• Process
• User level
It is quite natural to utilise available functional parallelism, which is inherent in
a conventional sequential program, at the instruction level by executing
instructions in parallel. This can be achieved by means of architectures
capable of parallel instruction execution. Such architectures are referred to as
instruction-level function-parallel architectures or simply instruction-level
parallel architectures, commonly abbreviated as ILP-architectures.
Manipal University Jaipur B1648 Page No. 23
Computer Architecture Unit 1
Available functional parallelism in a program can also be utilised at the thread
and/or at the process level. Threads and processes are selfcontained
execution entities embodying an executable chunk of code. Threads and
processes can be created either by the programmer using parallel languages
or by operating systems that support multi-threading or multitasking. They can
also be automatically generated by parallel compilers during compilation of
high-level language programs. Available loop and procedure-level parallelism
will often be exposed in the form of threads and processes.
Self Assessment Questions
17. Parallelism occurring during execution is called --------------------- .
18. Parallelism at the instruction level is also called middle-grained
parallelism. (True/ False)
19. Data parallelism is regular, whereas functional parallelism, with the
execution of loop-level parallelism, is usually irregular. (True/ False)
Activity 2:
Decide which architecture is most appropriate for a given application. First
determine the form of parallelisation which would best suit the application, then
decide both hardware and software for running your parallelised application
1.9 Summary
Let us recapitulate the important concepts discussed in this unit:
• Computer Architecture deals with the issue of selection of hardware
components and interconnecting them to create computers that achieve
specified functional, performance and cost goals.
• The concept of a computational model represents a higher level of
abstraction than can be achieved by either the computer architecture or
the programming language alone, and covers both.
• History of computers begins with the invention of the abacus in 3000 BC,
followed by the invention of mechanical calculators in 1617. Fifth
generation computers are still under research and development.
• Each process provides the resources needed to execute a program.
• A thread is the entity within a process that can be scheduled for execution.
• Concurrent execution is the temporal behaviour of the N-client 1-server
model where one client is served at any given moment.
Manipal University Jaipur B1648 Page No. 24
Computer Architecture Unit 1
• Parallel execution is associated with N-client N-server model.
• Based on a number of instruction and data streams can be processed
simultaneously, Flynn classifies the computer system into four
categories.
• The notion of parallelism is used in two different contexts and three
different types. Either it designates available parallelism in programs or it
refers to parallelism occurring during execution, called utilised parallelism.
1.10 Glossary
• EDSAC: Electronic Delay Storage Automatic Calculator
• EDVAC: Electronic Discrete Variable Automatic Computer
• ENIAC: Electronic Numerical Integrator and Calculator
• IC: Integrated Circuit where hundreds of transistors could be put on a
single small circuit.
• LSI: Large Scale Integration, it can pack more than a million transistors
• MSI: Medium Scale Integration, it packs as many as 100 transistors
• PCB: Process Control Block, it is a description table which contains all the
information relevant to the whole life cycle of a process.
• SSI: Small Scale Integration, it can pack 10 to 20 transistors in a single
chip.
• UNIVAC I: Universal Automatic Calculator
• USLI: Ultra Large-Scale Integration, it contains millions of components on
a single IC
• VLSI: Very Large Scale Integration, it can have up to 1000 transistors
1.11 Terminal Questions
1. Explain the concept of Computational Model. Describe its various types.
2. What are the different stages of evolution of Computer Architecture?
Explain in detail.
3. What is the difference between process and thread?
4. Explain the concepts of concurrent and parallel execution.
5. State Flynn’s classification of Parallel Processing.
6. Explain the types of parallelism.
7. What are the various levels of parallelism?
1.12 Answers
Self Assessment Questions
1. Problem description
Manipal University Jaipur B1648 Page No. 25
Computer Architecture Unit 1
2. Procedural style
3. Data-driven
4. Pascaline
5. IAS machine
6. False
7. True
8. True
9. False
10. N-client 1-server
11. True
12. Single Instruction Multiple Data
13. Multiple Instruction Single Data
14. Multiple Instruction Multiple Data
15. True
16. False
17. Utilised parallelism
18. False
19. True
Terminal Questions
1. A common foundation or paradigm that links the computer architecture
and language classes is called a Computational Model. Refer Section 1.2.
2. History of computers begins with the invention of the abacus in 3000 BC,
followed by the invention of mechanical calculators in 1617. The years
beyond 1642 till 1980 are marked by inventions of zeroth, first, second and
third generation computers. Refer Section 1.3.
3. A thread is the entity within a process that can be scheduled for execution.
Refer Section 1.4.
4. Concurrent execution is the temporal behaviour of the N-client 1-server
model where one client is served at any given moment. Parallel execution
is associated with N-client N-server model. Refer Section 1.5.
5. Flynn classifies the computer system into four categories. Refer Section
1.6.
6. There are three types of parallelism. Refer section 1.7.
7. The notion of parallelism is used in two different contexts. Either it
designates available parallelism in programs or it refers to parallelism
Manipal University Jaipur B1648 Page No. 26
Computer Architecture Unit 1
occurring during execution, called utilised parallelism. Refer Section 1.8.
References:
• Hwang, K. (1993) Advanced Computer Architecture. McGraw-Hill, 1993.
• D. A. Godse & A. P. Godse (2010). Computer Organization. Technical
Publications. pp. 3-9.
• John L. Hennessy, David A. Patterson, David Goldberg (2002)
"Computer Architecture: A Quantitative Approach", Morgan Kaufmann; 3rd
edition.
• Dezso Sima, Terry J. Fountain, Peter Kacsuk (1997) Advanced computer
architectures - a design space approach. Addison-Wesley- Longman: I-
XXIII, 1-766
E-references:
• www.cs.clemson.edu/~mark/hist.html
• www.people.bu.edu/bkia/
• www.ac.upc.edu/
• www.inf.ed.ac.uk/teaching/courses/car/
Unit 2 Fundamentals of Computer Design
Structure:
2.1 Introduction
Objectives
2.2 Changing Face of Computing
Desktop computing
Servers
Embedded computers
2.3 Computer Designer
2.4 Technology Trends
2.5 Quantitative Principles in Computer Design
Advantages of parallelism
Principle of locality
Focus on the common case
2.6 Power Consumption
Manipal University Jaipur B1648 Page No. 27
Computer Architecture Unit 1
2.7 Summary
2.8 Glossary
2.9 Terminal Questions
2.10 Answers
2.1 Introduction
In the previous unit, you studied about the computational model and the
evolution of computer architecture. Also, you studied the concept of process
thread. We also covered two types of execution - concurrent and parallel and
also the types and level of parallelism. In this unit, we will throw light on the
changing face of computing, the task of computer designer and its quantitative
principles. We will also examine the technology trends and understand the
concept of power consumption and efficiency of the matrix.
You can define computer design as an activity that converts the architecture
design of the computer into a programming structure implementation of a
particular organisation. Thus, computer design is also referred to as computer
implementation. Computer designer is responsible for the hardware
architecture of the computer.
Objectives:
After studying this unit, you should be able to:
• identify the changing face of computing
• explain the tasks of the computer designer
• describe the technology trends
• discuss the quantitative principles of the computer design
• describe power consumption and efficiency of the matrix
2.2 Changing Face of Computing
Computer technology has come up with drastic changes in the past 60 years
when the first general-purpose computer was invented. It was in late 70s when
the microprocessor made its entrance. The microprocessor had the ability to
integrate the functions of a computer’s Central Processing Unit (CPU) on a
single-integrated circuit. This improved the growth of the computers by about
35% per year in terms of computer performance. The cost advantage in mass-
production of computer microprocessors combined with this 35% growth rate
would lead to an increase in the computer business based on the
microprocessor.
Manipal University Jaipur B1648 Page No. 28
Computer Architecture Unit 1
In 1960s, the main-frame computers used to be the most prevalent ones.
These computers required huge investments in terms of monitoring support
operators. Main-frame computers used to support distinctive applications like
business data support and large-scale scientific computing. Then in 1970s
came the minicomputers which was a smaller-sized computers and supported
applications in scientific laboratories. These minicomputers soon expanded
out to the popularity of multi-sharing, i.e., multiple users sharing the
computers.
In the late 1970s, we observed the emerging of supercomputers that were
high-performance computers for scientific computing. This class of computers
lead the way to innovation that later reduced the investments required for a
computer.
In 1980s came the desktop computers - based on microprocessors - in the
form of personal computers (also known as PCs) and workstations. The
personal computers facilitated the rise of servers - computers that were highly
reliable, supported long-term data storage and access, and improved
computing power.
Individually-owned computers, in 1990s, gave rise to more personalised
services in order to enhance the communication with other computers all over
the world. This resulted in the origination of Internet and World Wide Web
(www) - the personally assisted devices (or the PDAs). By 2000s, the face of
computing started changing with the coming of cell phones and their
extraordinary popularity. This raised the need for embedded computers. An
embedded computer system is designed for particular control functions within
a larger system such as a cell phone. It is embedded as part of a complete
device. Embedded computers are computers installed into other devices. As
a matter of fact, 98% of computing devices are embedded in all kinds of
electronic equipments. Computers are moving away from the desktops and
laptops and are finding use in everyday devices like mobile phones, credit
cards, planes and cars and even in homes in everyday appliances such as
stoves, refrigerators, microwaves, dishwashers, and driers. The trends have
been shown in figure 2.1.
Manipal University Jaipur B1648 Page No. 29
Computer Architecture Unit 1
These changes have dramatically changed the face of computing and the
computing applications. This has led to three different computer markets each
adapted with different requirements, specifications and applications. These
are explained as follows:
2.2.1 Desktop computing
Desktop computers have the largest market in terms of costs. It varies from
low-end systems to very high-end heavily configured computer systems.
Throughout this range the cost and the competence also varies in terms of
performance. This blend of the performance and the price concerns most to
the customers in the market and thus, to the computer designers.
Consequently, the latest, the highest-performance microprocessors and cost-
reduced microprocessors are largely sold in the category of the desktop
systems.
Characteristics of desktop computing
The important characteristics of desktop computing are:
1. Ease-of-use: In desktop computers, all the computer parts come as
separate detachable components of the computer. Thus, making the use
easy and comfortable for the user.
2. Extensive graphic capabilities: It provides extensive graphics
Manipal University Jaipur B1648 Page No. 30
Computer Architecture Unit 1
capabilities for data visualisation and manipulation. It supports two- and
three- dimensional visuals.
2.2.2 Servers
The existence and popularity of servers emerged with the evolution of the
desktop computers. The role of servers expanded to provide more reliable
usage, storage and access of data. It helped the users provide with large-
scale computing services. The web-based services accelerated this trend of
servers tremendously. These servers are successful in replacing the
traditional main-frame computers and have become the backbone of large
enterprises to help the users with high memory storage.
Characteristics of a server
For servers, different characteristics are important to be understood. They are
explained below:
1. Dependability: The server’s dependability is critical. Breakdown of this
type of a server is extremely more disastrous than the breakdown of a
single independent system.
2. Scalability: Servers are highly scalable in terms of the increasing demand
or requirements. They can be scaled up in the computing services,
memory, storage capacity, etc.
3. Efficiency: Servers are highly efficient and cost-effective. The
responsiveness to individual request remains high.
2.2.3 Embedded computers
An embedded computer is a computer system designed to perform a particular
function or task. It is embedded as component of a bigger complete device.
Embedded systems contain microcontrollers dedicated to handle a specific
task. An embedded system can also be defined as a single purpose computer
embedded in a devise to control some particular function of that bigger devise.
Nowadays, embedded computers are the quickly growing division of the
computer market. These devices cover a range of electronics viz., microwave
oven, washing machines, air conditioners, printers, all contain simple
embedded computers - to digital cell phones, set-top boxes, play stations, etc.
Embedded computers are based on different microprocessors - 8-bit, 16-bit,
or a 32-bit - that execute millions of instructions at a time (in a second). Even
though the variety of embedded computers is large in the market, price is a
chief feature in designs of these computers. Performance requirement is also
Manipal University Jaipur B1648 Page No. 31
Computer Architecture Unit 1
crucial, but the chief objective is to meet the performance need at the minimum
cost.
Characteristics of embedded computers
1. Real-time performance: The performance requisite in an embedded
application is real-time execution. Speed, though in varying degrees, is an
important factor in all architectures. The ability to assure real-time
performance acts as a constraint on the speed needs of the system. Real-
time performance means that the agent is assured to perform within
certain time restraints as specified by the task and the environment.
2. Soft real-time: In a number of applications, a more advanced requisite
exists: the standard time for a particular job is constrained and the number
of occurrences when the maximum time is exceeded. Such techniques are
occasionally called soft real-time and they occur when it is possible to
sometimes miss the time limitation on an incident, provided that not plenty
of them are missed.
3. Need to minimise memory size: Memory can be a considerable element
of the system cost. Thus, it is vital to limit the memory size according to
the requirement.
4. Need to minimise memory power: Larger memory also means high
power need. Emphasis on low power is made by the use of batteries.
Unnecessary usage of power needs to be avoided to keep the power need
low.
Self Assessment Questions
5. The __________ had the ability to integrate the functions of a
computer’s Central Processing Unit (CPU) on a single-integrated circuit.
6. _____________ computers used to support typical applications like
business data support and large-scale scientific computing.
7. The performance requirement in an embedded application is real-time
execution. (True/False)
8. ______________ is the chief objective of embedded computers.
2.3 Computer Designer
Computer Designer is a person who designs CPUs or computers that are
actually built for considerable use. He also plays an important role in the further
development of computer designs. Computer designers are also known as
Manipal University Jaipur B1648 Page No. 32
Computer Architecture Unit 1
computer architects. The world’s first designer was Charles Baggage, (1791
- 1871) (See Figure 2.2). He is considered as the father of computers and
holds the credit of inventing the first mechanical computer that eventually led
to more complex designs.
Figure 2.2: Charles Babbage
He delivered the perfectly functioning computer in 1991. Parts of his
uncompleted mechanisms are on display in the London Science Museum.
Nine years later, the Science Museum completed the printer Babbage had
designed for the difference engine.
Tasks of a computer designer: The tasks of a computer designer are
complex and challenging. The designer needs to determine the attributes that
are necessary for a new computer, then design a computer to maximise the
performance keeping in mind the constraints of the designing - cost, time,
power, size, memory, etc. Computer designers often create and customise
computer systems necessary for performing a company's daily tasks. See
Table 2.1 for functional requirements and features to be met.
Table 2.1: Functional Requirements and features
Manipal University Jaipur B1648 Page No. 33
Computer Architecture Unit 1
The designing includes consideration of a variety of technologies, from
compilers and operating systems to logic design and packaging. Initially, the
computer designing process only involved the instruction set design.
The other stages were known as the implementation. Whereas, in reality, the
job of a computer designer is much more than just the instruction set design,
and the technical obstacles are much more challenging than those faced in
the instruction set. Now let us quickly review the instruction set architecture.
Instruction Set Architecture (ISA)
The Instruction Set Architecture (ISA) is the part of the processor that is visible
to the programmer or compiler writer. The ISA acts as the boundary between
software and hardware. It includes the native data types, instructions,
registers, addressing modes, memory architecture, interrupt and exception
handling, and external I/O. ISA can be classified into the following categories:
1. Complex instruction set computer (CISC) - It consists of a variety expert
instructions and may just not be frequently used in practical programs.
2. Reduced instruction set computer (RISC) - This executes only the
instructions that are commonly used in programs and thus, makes the
Manipal University Jaipur B1648 Page No. 34
Computer Architecture Unit 1
processor simpler. The extraordinary operations are executed as
subroutines, where the extra processor execution time is offset by their
infrequent use.
3. Very long instruction word (VLIW) - In this, the processor receives many
instructions encoded and retrieved in one instruction word.
Figure 2.3: shows the Instruction Set Architecture.
Figure 2.3: Instruction Set Architecture
Now, we will discuss the low-level implementation of the 80x86 instruction set.
Computers cannot execute high-level language constructs like ones found in
C. Rather they execute a relatively small set of machine instructions, such as
addition, subtraction, Boolean operations, and data transfers. Thus, the
engineers decided to encode the instructions in a
Manipal University Jaipur B1648 Page No. 35
Computer Architecture Unit 1
numeric format (suitable for storage in memory). The structure of the ISA is
given below:
1. Class of ISA: The operands are registers or memory locations and
approximately all ISAs are now categorised as general-purpose register
architectures. The 80x86 has 16 general-purpose registers and 16 registers
for floating-point data. The two accepted editions of this class are register-
memory ISAs, which can access memory only with load or store-instruction.
Figure 2.4 shows the structure of a programming model consisting of General
Purpose Registers and Memory.
Figure 2.4: Programming Model: General-Purpose Registers (GPRs) and
Memory
2.Memory Addressing: Virtually all desktop and server computers, including
the 80x86, use byte addressing to access memory operands. Some designs
require the objects to be aligned. The 80x86 does not require alignment, but
accesses are generally faster if operands are aligned.
3.Addressing Mode: Every instruction of a computer states an operation on
certain data. There are a numerous ways of specifying address of the data
to be operated on. These different ways of specifying data are called the
addressing modes In addition to stating registers and constant operands,
addressing modes mentions the method to calculate the effective memory
address of an operand by using information held in registers.
Manipal University Jaipur B1648 Page No. 36
Computer Architecture Unit 1
The 80x86 supports some addressing modes for code or data:
i) Absolute/direct address
—
I load | reg | address I
—
Zzzeotive address = address aa given in instruction}
It needs large space in an instruction for long address. It is generally
accessible on CISC machines that have variable-length instructions.
ii) Indexed absolute address
। ------1 ----1 ----- 1 --------------------------------------1
I load | reg |indexl address |
। ------1 ----1 ----- 1 --------------------------------------1
(Effective address = address + contents of specified index register)
Even this needs large space in an instruction for large address. The address
is the beginning of an array and the particular array element needed could be
selected by the index.
iii) Base plus index plus offset
। ------- + --------+ ----- 1 ----- 1---------------------- 1
I lead | reg I base|indexl offset I
। ------- + --------+ ----- 1 ----- 1---------------------- 1
(Effective address = offset + contents of specified base register
+ contents of specified index register)
Manipal University Jaipur B1648 Page No. 37
Computer Architecture Unit 1
The beginning address of the array could be stored in the base register, the
index will choose the particular record needed and the offset can choose the
field inside that record.
iv) Scaled
+ --------- +---- + ----------- + +
I load I reg I base I index I
+ --------- +---- +-------------- + +
(Effective address = contents of specified base register
scaled contents of specified index register)
The beginning of an array or vector is stored in the base register and the index
could contain number of the particular array element needed.
v) Register Indirect
+ --------------- + ---------- + -------- +
I load I regl I base I
+ ------------ +--------- +------- +
(Effective address - contents of base register)
This is a distinctive addressing mode. Many computers just use base plus
offset with an offset value of 0.
4. Types and sizes of operands: Machine instructions are operated on
operands of several types. Some types supported by ISAs include
character (e.g., 8-bit ASCII or 16-bit Unicode), signed and unsigned
integers, and single- and double-precision floating-point numbers. ISAs
typically support various sizes for integer numbers.
For example, arithmetic instructions which operate on 8-bit integers 16- bit
integers (short integers), and 32-bit integers are included in a 32-bit
architecture. Signed integers are represented using two’s complement
binary representation.
5. Instructions: Machine Instructions are of two types control flow instructions
and data processing. Data processing instructions manipulate operands
Manipal University Jaipur B1648 Page No. 38
Computer Architecture Unit 1
in memory locations and registers. These support arithmetic operations,
logic operations, shift operations and data transfer operations. Control flow
instructions help us to change the implementation flow to an instruction
other than the subsequent one in the sequence.
6. Encoding an ISA - There are several factors such as the architecture type,
the number of general purpose registers, the number and type of
instructions, the number of operands, etc. that affects encoding. The
Variable length technique states that every operation can work with almost
all addressing modes compatible with ISAs. The fixed length instruction
encoding, depicts that the opcode is united with addressing mode
specifiers. A third technique known as hybrid is a combination of both. It
reduces inconsistency in instruction encoding, but permits multiple
instruction lengths.
Implementation: Implementation of the instruction set architecture
comprises of two components: organisation and the hardware. The high- level
attributes of a computer’s architecture, such as the memory system, the
memory integration, and the architecture of the internal processor or CPU, are
components of the term organisation. CPU i.e., the central processing unit is
where arithmetic, logic, branching, and data transfer are implemented.
Hardware are the particulars of a computer, as well as the comprehensive
logic design and the packaging technology of the computer. Frequently, a line
of computers contains computers with different comprehensive hardware
implementation. But these computers are identical instruction set architectures
and nearly identical organisations. Figure 2.5 shows the components of
architecture.
Manipal University Jaipur B1648 Page No. 39
Computer Architecture Unit 1
Figure 2.5: Components of Architecture
Here, in this unit, the word architecture covers all three aspects of computer
design - instruction set architecture, organisation, and hardware. Thus,
computer designers must design a computer keeping in mind the functional
requirements as well as price, power, performance, and goals. The functional
requirements also have to be determined by the computer architect, which is
a tedious job. The requirements are determined after reviewing the market
specific features. Also, the computer designers must be aware of the
technology trends in the market and the use of computers to avoid
unnecessary costs and failure of the architecture system. Thus, we will study
some important technology trends in the following section.
Self Assessment Questions
5. The world’s first designer was __________________
6. _________________ acts as the boundary between software and
hardware.
7. ISA has __________________ general-purpose registers.
8. CISC stands for __________________ .
Activity 1:
Visit any two organisations. Now make a list of the different type of computers
they are using - desktop, servers and embedded computers - and compare
with one another. What proportion of each type of computing are they using?
2.4 Technology Trends
Technology trends need to be studied on a regular basis in order to cope with
Manipal University Jaipur B1648 Page No. 40
Computer Architecture Unit 1
the dynamic and rapidly changing market. The instruction set should be
designed such to adapt the rapid changes of the technology. The designer
should plan for the technology changes that would lead to the success of the
computer.
There are mainly four main changes that are essential to modern
implementations. These are as follows:
1. Integrated circuit logic technology: Integrated circuits or microchips are
electronic circuits manufactured by forming interconnections between
semiconductor devices. Changes in this technology occur very soon.
Some examples are the evolution of mobile phones, digital microwave
ovens, etc.
2. Semiconductor DRAM (dynamic random-access memory): DRAM
uses a capacitor to store each bit of data, and the level of charge on each
capacitor determines whether that bit is a logical 1 or 0. However these
capacitors do not hold their charge indefinitely, and therefore the data
needs to be refreshed periodically. It is a semiconductor memory that is
equipped in personal computers and workstations. It increases by about
40% every year.
3. Magnetic disk technology: Magnetic disks include floppy disks, compact
disks, hard disks, etc. The disk facing the drive is coated with magnetic
particles into microscopic areas called domains. These domains acts like
a tiny magnet with north and south poles. This technology is currently
increasing 30% every year.
4. Network technology: Networks may be referred to the range of
computers and its hardware components connected together through the
communication channels. Communication protocols lead the
communication in the network and provide the basis for network
programming. The performance depends both on the switches and the
transmission systems.
These rapidly changing technologies mould the design of the computer that
will have a life of more than five years. It has been observed that with the help
of the study of these technology trends, the computer designers have been
able to reduce the costs at the rate at which the technology changes.
Self Assessment Questions
9. The designer should never plan for the technology changes that would
Manipal University Jaipur B1648 Page No. 41
Computer Architecture Unit 1
lead to the success of the computer. (True/False)
10. ______________ are electronic circuits manufactured by forming
interconnections between semiconductor devices.
2.5 Quantitative Principles in Computer Design
Now that we have understood the changing face of computing and the tasks
of a computer designer, we can explore principles that are useful in design
and analysis of computer design. Let us study some important evaluations and
equations. Figure 2.6 depicts the principles of a Computer Design.
Figure 2.6: Quantitative Principles of Computer Design
2.5.1 Advantages of parallelism
Performance of the computer is improved by taking advantage of parallelism.
The exploitation of parallelism to a great extent enhances the performance.
Firstly, the parallelism should be used at the system level. By using the
multiple processors and multiple disks the performance gets better at a typical
server benchmark. The multiple processors and disks help the workload and
instructions to be spread over. These multiple processors and the disks can
be further expanded. This is an important feature of the servers and is known
as scalability. Secondly, parallelism among instructions can be exploited
through pipelining, at an individual processor level. The main objective of
pipelining is to extend beyond the instruction implementation to cut the total
time taken to complete the instruction series.
In parallel, it is feasible to execute every instruction completely or partially, as
not every instruction depend on its immediate processor. This is the key factor
that permits pipelining to work. Thirdly, parallelism can also be discovered at
Manipal University Jaipur B1648 Page No. 42
Computer Architecture Unit 1
the level of digital designing. Caches that are usually looked for in parallel use
multiple memory banks to find a desired item. Modern ALUs use parallelism
to increase their speed of the process of calculating sums from linear to
logarithmic in the number.
2.5.2 Principle of locality
Principle of locality is an important program property as programs tend to
reuse the data and instructions they have already used. Principle of locality
follows the rule that it can help us foresee the data and instructions that a
program might require in the near future. This forecast is made depending on
the trend of usage of the data and instructions in its history.
There are two different kinds of localities: Temporal Locality which declares
that the items referred in the recent times are potential to be accessed in the
near future and Spatial Locality which states that the items nearby the location
of the recently used items may also be referred close together in time. The
localities are stored in a component called cache memory, which is located
between the CPU (or processor) and the main memory as shown in the figure
2.7.
Figure 2.7: Cache Memory Position
2.5.3 Focus on the common case
This is the most important and widely used principle of computer design. It
states to look into the regular case over the rare case during the designing of
the trade-off. It focuses on planning the spending of resources. This is due to
the recurrent nature of the case as it will increase the impact of improvement.
Focussing on the common case will work positively both for power and
resource allocation, thus, leading to advancement. We need to optimise the
Manipal University Jaipur B1648 Page No. 43
Computer Architecture Unit 1
instruction fetching and coding unit of a processor first, as it may be used more
often than a multiplier. It works on dependability as well.
The optimising of the recurrent case is more beneficial and faster than the
non-recurrent case. It is simpler too. For example, it is rare that an overflow
may occur when adding any two numbers in the processor. Thus, it improves
the performance by optimising the more common case of no overflow. To
apply this principle, all we need to do is analyse what the common case is and
what level of performance can be achieved by improving its speed. To quantify
this, we will study the Amdahl’s Law below.
Amdahl’s Law
This law helps compute the performance gain that can be obtained by
improving any division of the computer. Amdahl’s law states that “the
performance improvement to be gained from using some faster mode of
execution is limited by the fraction of the time the faster mode can be used.”
(Hennessey and Patterson)
Figure 2.8 shows the predicted speed using Amdahl’s law in a graphic form.
Figure 2.8: Predicted Speed using Amdahl’s Law
The law defines the Speedup ratio that can be achieved by improving any
element of the computer. Speedup is:
Performance for entire task using the enhancement when possible
Speedup = ------------------------------------------------------------------------------------
Performance for entire task without using the enhancement
Manipal University Jaipur B1648 Page No. 44
Computer Architecture Unit 1
Or,
Execution time for entire task without using the enhancement
Speedup = -------------------------------------------------------------------------------------
Execution time for entire task using the enhancement when possible
Amdahl’s law helps us to find the speedup from some enhancement. This
depends on the following two factors:
1. The fraction of the computation time in the original computer that can be
converted to take advantage of the enhancement - For example, if 20
seconds of the execution time of a program that takes 60 seconds in total
can use an enhancement, the fraction is 20/60. This value, which we will
call Fraction enhanced, is always less than or equal to 1.
2. The improvement gained by the enhanced execution mode; that is, how
much faster the task would run if the enhanced mode were used for the
entire program - This value is the time of the original mode over the time
of the enhanced mode. If the enhanced mode takes, say, 2 seconds for a
portion of the program, while it is 5 seconds in the original mode, the
improvement is 5/2. We will call this value, which is always greater than 1,
Speedup enhanced.
The execution time using the original computer with the enhanced mode will
be the time spent using the unenhanced portion of the computer plus the time
spent using the enhancement:
Execution time = Execution timeoid x|(1 - FracUonEnhan d ) + Fraction Enhanced k
edu
new Spe pEnhanced
The overall speedup is the ratio of the execution times:
Execution time 1
Speedupoverall = ----- ------ ; --- ;--- — = --------------------------------p --- -------------
Execution time FractionEnhanced
new ( 1-
FraCtiO^nhanced ) + -------- Enhanced
SpeedupEnhanced
In the above equations, often it is difficult to calculate the new and the old
times directly.
Self Assessment Questions
11. Performance of the computer is improved by __________________ .
12. The ability of the servers to expand its processors and disks is known as
____________________ .
Manipal University Jaipur B1648 Page No. 45
Computer Architecture Unit 1
13. The main objective of _____________________ is to extend beyond
the instruction implementation to cut the total time taken to complete the
instruction series.
14. __________________ declares that the item referred in the recent
times has potential to be accessed in the near future.
15. __________________ states that the items nearby the location of
the recently used items may also be referred close together in time.
2.6 Power Consumption
Power consumption is another important design criterion that affects the
design of the modern computers. The power efficiency can normally be dealt
for performance or cost benefits. Recent processor designs put more
emphasis on the power efficiency. Also, in the upcoming world of totally
embedded computers, power efficiency has been the major concern of the
computer designers.
It is now widely accepted that power is the primary concern of the modern
microprocessors, rather it has become a constraint in most of the cases.
Power is a function of both static and dynamic power. Static Power is
proportional to the number of transistors, whereas Dynamic Power is generally
the product of the transistor switching and the switching rate. Static power is
generally the concern at the design stage, while when operating, dynamic
power is the dominant energy consumer.
Technologists estimated the rough percentage of usage by each component
of the computer. This is represented in the following figure 2.9, which shows
CPUs only drawing about 5 percent of a PC's total power.
Manipal University Jaipur B1648 Page No. 46
Computer Architecture Unit 1
Figure 2.9: Pie Chart Showing Power Consumption Distribution
Most techniques used for improved performance, viz. multiprogramming and
multithreading, will of course increase the energy consumption. But the
question here is: Does it increase power consumption at a higher rate than the
increase in performance.
Unfortunately, the current techniques used by the programmers to improve
the performance are inefficient from the point of view of power consumption.
This occurs due to the following two characteristics:
1. Delivering multiple instructions earns some overhead in logic that develops
faster than the issue rate develops. Thus, without voltage reductions to
decrease power, it is probable to lead to lower rate of performance per
watt.
2. There has been observed a growing gap between high issue rates and
continual performance. The number of switching transistor rate is
proportional to the high issue rates and the performance is proportional to
the sustained performance. This causes a growing gap between both of
them leading to increased energy consumption per unit. This gap arises
from many issues.
For example, if we want to sustain four instructions per clock, we must
fetch more, issue more, and initiate execution on more than four
instructions. This will create the same situation of gap and such
techniques cannot improve the long-term power efficiency.
Manipal University Jaipur B1648 Page No. 47
Computer Architecture Unit 1
Self Assessment Questions
16. _________________ can normally be dealt for performance or cost
benefits.
17. _________________ is the product of the transistor switching and
the switching rate.
18. The number of switching transistor rate is proportional to __________
and the performance is proportional to _____________ .
2.7 Summary
Let us recapitulate the important concepts discussed in this unit:
• There are two types of execution - concurrent and parallel.
• Computer design is an activity that converts the architecture design of the
computer into a programming structure implementation of a particular
organisation.
• Computer technology has made drastic changes in the past 60 years when
the first general-purpose computer was invented.
• Desktop computers have the largest market in terms of costs. It varies
from low-end systems to very high-end heavily configured computer
systems.
• The world’s first designer was Charles Baggage and is considered as the
father of computers.
• Computer designer needs to determine the attributes that are necessary
for a new computer, then design a computer to maximise the performance.
• The Instruction Set Architecture (ISA) is the part of the processor that is
visible to the programmer or compiler writer.
• Performance of the computer is improved by taking advantage of
parallelism.
• Focussing on the common case will work positively both for power and
resource allocation, thus, leading to advancement.
2.8 Glossary
• CISC: Complex instruction set computer
• Computer designer: A person who design CPUs or computers that are
actually built and are into considerable use and influence the further
development of computer designs.
• Desktop computers: These are in the form of personal computers (also
Manipal University Jaipur B1648 Page No. 48
Computer Architecture Unit 1
known as PCs) and workstations.
• Embedded computer: A computer system designed to perform a
particular function or task.
• Instruction Set Architecture (ISA): A part of the processor that is
visible to the programmer or compiler writer.
• Integrated circuits: An electronic circuit manufactured by the patterned
diffusion of trace elements into the surface of a thin substrate of
semiconductor material.
• RISC: Reduced instruction set computer
• Supercomputers: These are high-performance computers for scientific
computing.
• VLIW: Very long instruction word
2.9 Terminal Questions
1. Describe the three types of computer markets.
2. Explain the characteristics of embedded computers.
3. Who is a computer designer? Explain the job of a computer designer.
4. What are the components of Instruction Set architecture? Discuss in
brief.
5. Explain the technology trends in computer design.
6. Discuss briefly the quantitative principles in computer design.
7. Elucidate Amdahl’s Law.
2.10 Answers
Self Assessment Questions
1. Microprocessor
2. Main-frame
3. True
4. Minimum cost
5. Charles Baggage
6. ISA
7. 16
8. Complex instruction set computer
9. False
10. Integrated circuits or microchips
11. Adopting parallelism
Manipal University Jaipur B1648 Page No. 49
Computer Architecture Unit 1
12. Scalability
13. Pipelining
14. Temporal Locality
15. Spatial Locality
16. Power efficiency
17. Dynamic Power
18. High issue rates, sustained performance
Terminal Questions
1. Desktop computers have the largest market in terms of costs. It varies from
low-end systems to very high-end heavily configured computer systems.
Refer Section 2.2.
2. An embedded system is a single purpose computer embedded in a devise
to control some particular function of that bigger devise. The performance
requirement of an embedded application is real-time execution. Refer
Section 2.2.
3. Computer Designer is a person who has designed CPUs or computers that
were actually built and came into considerable use and influenced the
further development of computer designs. Refer Section 2.3.
4. Architecture covers all three aspects of computer design - instruction set
architecture, organisation, and hardware. Refer Section 2.3.
5. Technology trends need to be studied on a regular basis in order to cope
with the dynamic and rapidly changing market. The instruction set should
be designed such to adapt the rapid changes of the technology. Refer
Section 2.4.
6. Quantitative principles in computer design are: Take Advantage of
Parallelism, Principle of Locality, Focus on the Common Case and
Amdahl’s Law. Refer Section 2.5.
7. Amdahl’s law states that the performance improvement to be gained from
using some faster mode of execution is limited by the fraction of the time
the faster mode can be used. Refer Section 2.5.
References:
• David Salomon, (2008), Computer Organisation, NCC Blackwell.
• John L. Hennessy and David A. Patterson, Computer Architecture: A
Quantitative Approach, (4th Ed.), Morgan Kaufmann Publishers
Manipal University Jaipur B1648 Page No. 50
Computer Architecture Unit 1
• Joseph D. Dumas II; Computer Architecture; CRC Press
• Nicholas P. Carter; Schaum’s outline of computer Architecture; Mc.
Graw-Hill Professional
E-references:
• http://publib.boulder.ibm.com/infocenter/zos/basics/topic/com.ibm.zos.z
concepts/zconcepts_75.html/ Retrieved on 30-03-2012
• http://www.ibm.com/search/csass/search?sn=mh&q=multiprocessing
%20system&lang=en&cc=us&/ Retrieved on 31-03-2012
Manipal University Jaipur B1648 Page No. 51
Computer Architecture Unit 1
Unit 3 Instruction Set Principles
Structure:
3.1 Introduction
Objectives
3.2 Classifying instruction set architecture
Zero-address instructions
One-address instructions
Two-address instructions Three-address instructions
3.3 Memory Addressing
3.4 Address Modes for Signal Processing
3.5 Operations in the instruction sets
Fetch & decode
Execution cycle (instruction execution)
3.6 Instructions for Control Flow
3.7 MIPS Architecture
3.8 Summary
3.9 Glossary
3.10 Terminal Questions
3.11 Answers
3.1 Introduction
In the previous unit, you have studied about fundamentals of computer
architecture and design. Now we will study in detail about the instruction set
and its principles.
The instruction set or the instruction set architecture (ISA) is the set of basic
instructions that a processor understands. In other words, an instruction set,
or instruction set architecture (ISA), is the part of the computer architecture
related to programming, including the native data types, instructions, registers,
addressing modes, memory architecture, interrupt and exception handling,
and external I/O. There are a number of instructions in a program that have to
be accessed in a particular sequence. This encourages us to describe the
issue of instruction and its sequence which we will study in this unit. In this
unit, you will study the fundamentals involved in instruction set architecture
and design. Firstly, the operations in the instruction sets, instruction set
architecture, memory locations and addresses, memory addressing, abstract
model of the main memory, and instructions for control flow need to be
Manipal University Jaipur B1648 Page No. 52
Computer Architecture Unit 1
categorised. Also, we will discuss about MIPS (Microprocessor without
Interlocked Pipeline Stages) architecture.
Objectives:
After studying this unit, you should be able to:
• classify instruction set architecture
• identify memory addressing
• explain address modes for signal processing
• list the various operations in the instruction sets
• recognise instructions for control flow
• describe MIPS architecture along with its characteristics
3.2 Classifying Instruction Set Architecture
The reference manuals provided with a computer system contain a description
of its physical and logical structure. This gave a description of the internal
construction of the CPU, as well as the processor registers available and their
logical competences. The manuals explain all the hardware-executed
instructions, their binary code format, and an accurate definition of each
instruction very well. The control unit of the CPU deduces each code
instruction and present the essential control functions required to process the
instruction.
The instruction format is generally represented in a rectangular box denoting
the bits of the instruction as they appear in memory words or in a control
register. The bits of the instruction are separated into groups called fields.
The most common fields found in instruction formats are:
1. An operation code field that specifies the operation to be performed.
2. An address field that designates a memory address or a processor
register.
3. A mode field that specifies the way the operand or the effective address is
determined.
Apart from these fields some other special fields can also be employed, for
example a field that gives the number of shifts in a shift-type instruction.
Instruction’s operation code field is known as a collection of bits that describes
a variety of processor operations, such as add, subtract, complement, and
shift. A variety of alternatives for choosing the operands from the given
address is specified by the bits that define the mode field of the instruction
code. Execution of operations is done by some data stored in memory or
Manipal University Jaipur B1648 Page No. 53
Computer Architecture Unit 1
processor registers through specification received by computer instructions.
Operands are identified by a memory address that resides in memory while
the ones residing in processor registers are given by a register address.
A register address is a binary number of k bits that defines one of 2k registers
in the CPU. Thus, a CPU with 16 processor registers R0 through R15 will have
a register address field of four bits. The binary number 0101, for example, will
designate register R5. Instructions in computers can be of different lengths
containing varying number of addresses. The following are the different types
of instruction formats:
3.2.1 Zero-address instructions
In zero-address machines, both operands are assumed to be stored at a
default location. The stack is used as the source of the input operands
machines and the result goes back into the stack. Stack is a LIFO (last-in- first-
out) data structure which is supported by all the processors, whether or not
they are zero-address machines. LIFO implies that the last item placed on the
stack is the first item to be taken out of the stack.
All operations on this type of machine assume that the required input operands
are the top two values on the stack. The result of the operation is placed on
top of the stack. Table 3.1 gives some sample instructions for the stack
machines. Notice that the first two instructions are not zero-address
instructions. These two are special instructions that use a single address and
are used to move data between memory and stack.
Table 3.1: Sample Stack Machine Instructions
Instiuction Sanautics
push addr
Places the value at address addr on top of the stack push([addr])
pop addr Stores the top value on the stack at memory address addr M(addr) =
pop
add
Adds the top two values on the stack and pushes the result onto the
stack pusbtpop ♦ pop)
sub
Subtracts the second top value horn the top value of the stack and
pushes the result onto the stack puslrtpop pop)
mult
Multiplies the top two values in the stack and pushes the result onto tlie
stack pusbtpop ♦ pop)
1 ।
Manipal University Jaipur B1648 Page No. 54
Computer Architecture Unit 1
The zero-address format is used by all other instructions. Now, we will see
how the stack machine converts the arithmetic expression we studied the
earlier subsections. In these machines, the statement:
A=B+C*D-E+F+A
is translated to the following code: push E ; <E> push C ; <C, E> push D ;
<D, C, E> mult ;<C*D, E> push B ; <B, C*D, E> add
;<B+C*D, E> sub ;<B+C*D-E> push F ; <F, B+D*C-E>
add ;<F+B+D*C-E> push A ; <A, F+B+D*C-E> add
;<A+F+B+D*C-E> pop A ; <>
On the right, we show the state of the stack after executing each instruction.
The top element of the stack is shown on the left. Notice that we pushed E
early because we need to subtract it from (B+C*D).
The top portion of the stack is made internal to the processor to implement the
stack machines. This is known as the stack depth. The remaining stack is kept
in memory. Thus, to use the top values that are within the stack depth, we do
not have to access the memory.
3.2.2 One-address instructions
Earlier, memory used to be costly and time-consuming, so unique sets of
registers were used to provide an input operand and to receive the result from
the ALU. Due to this, the registers are known as accumulators. Mostly, there
is only one accumulator register in a machine. This type of design, called
accumulator machines, is prevalent only if memory is expensive.
Most operations, in accumulator machines, are performed on the contents of
the accumulator and the operand supplied by the instruction. Therefore, these
machines’ instructions need to state only the address of an individual operand.
A few sample accumulator machine instructions are shown in table 3.2.
Table 3.2: Sample Accumulator Machine Instructions
Instruction Semantics
load addr
Copies the value at address addr into the accumulator accumulator
= [addr]
store addr
Stores the value in the accumulator at the memoiy address addr
M(addr) = accumulator
add addr
Adds the contents of the accumulator and value at address addr
accumulator = accumulator -I- [addr]
Manipal University Jaipur B1648 Page No. 55
Computer Architecture Unit 1
sub addr
Subtracts the value at memory address addr from the contents of
the accumulator
accumulator = accumulator [addr]
mult addr Multiplies the contents of the accumulator and value at address addr
accumulator = accumulator * [addr]
In these machines, the C statement:
A=B+C*D-E+F+A
is converted to the following code:
load C ; load C into the accumulator
mult D ; accumulator = C*D
add B ; accumulator = C*D+B
sub E ; accumulator = C*D+B-E
add F ; accumulator = C*D+B-E+F
add A ; accumulator = C*D+B-E+F+A
store A ; store the accumulator contents
3.2.3 Two-address instructions
Here each address field determines two address fields i.e either a memory
word or the processor register. Usually, we use dest (as in table 3.3) to
indicate that the address used for destination. Also, this address supplies one
of the source operands. The Pentium is an example processor that uses two
addresses. Table 3.3 gives some sample instructions of a two- address
machine. On these machines, the C statement
A=B+C*D-E+F+A
is converted to the following code:
load T,C ; T = C
mult T,D ; T = C*D
add T,B ; T= B+ C*D
sub T,E ; T= B+ C*D - E
add T,F ; T= B+ C*D - E+ F
add A,T ; A= B + C*D - E + F + A
Table 3.3: Sample Two-Address Machine Instructions
Manipal University Jaipur B1648 Page No. 56
Computer Architecture Unit 1
3.2.4 Three-address instructions
Here each address field determines 3 addresses. The general format of an
instruction is: operation dest, op1, op2
where:
• operation - operation to be carried out;
• dest - address to store the result
• op1, op2 - operands on which instruction is to be executed.
All three addresses are carried openly by the three-address machines. Three
addresses are used by the RISC processors use. Table 3.4 gives some
sample instructions of a three-address machine.
In these machines, the C statement:
A=B+C*D-E+F+A
is converted to the following code:
mult T,C,D ; T = C*D
add T,T,B ; T = B + C*D
sub T,T,E ; T = B + C*D - E
add T,T,F ; T = B +C*D - E + F
add A,T,A ; A = B +C*D - E + F
+A
Table 3.4: Sample Three-Address Machine Instructions
Manipal University Jaipur B1648 Page No. 57
Computer Architecture Unit 1
The three-address format results in short programs while assessing arithmetic
expressions. This is the biggest benefit of the three-address format. The
shortcoming is that the binary-coded instructions need too many
Manipal University Jaipur B1648 Page No. 58
Computer Architecture Unit 1
bits to specify three addresses. For example the Cyber 170 is a commercial
computer which uses three-address instructions (See figure 3.1). The
instruction formats in the Cyber computer are restricted to either three- register
address fields or two-register address fields and one-memory address field.
Figure 3.1: Cyber 170 CPU Architecture
A comparison
There are several advantages of each of the four different types of addressing
examined above. The number of instruction statement that needs to be
executed increases as the number of addresses is reduced. Now, let us
Manipal University Jaipur B1648 Page No. 59
Computer Architecture Unit 1
assume that the number of memory accesses depicts our performance metric;
and the lower this number, it is better.
In the three-address machine, every instruction takes four memory accesses:
one access to read the instruction, two for getting the two input operands, and
a final one to write the result back in memory. As there are a total of five
instructions, this machine generates a total of 20 memory accesses.
Similar to the three-address machine, in the two-address machine, each
arithmetic instruction takes four accesses. Remember, one address is used to
double as a source and destination address. Thus, the five arithmetic
instructions require 20 memory accesses. Additionally, we have the load
instruction that needs three accesses. As a result, it gives a total of 23 memory
accesses.
Reading or writing to an accumulator does not require a memory access as
the accumulator is a register and thus, the number of accumulator machine is
better. In this machine, there are only two accesses required by each
instruction. As there are seven instructions, this machine generates 14
memory accesses. In the end, if it is assumed that the stack depth is
adequately big enough that all our push and pop operations are under the limit
if this value, the stack machine takes 19 accesses. This number is obtained
by noting that each push or pop instruction takes two memory accesses,
whereas the five arithmetic instructions take one memory access each.
This comparison shows us that the accumulator machine is the fastest. The
comparison is done keeping in mind that both accumulator and the stack
machines assume the existence of registers. Conversely, the same cannot be
said for the other two machines. Though the three address instruction can all
be register address, but here in particular, it is assumed that there are no
registers on the three- and two-address machines. In case we assume that
these two machines have a single register to hold the temporary T, the count
for the three-address machine falls down to 12 memory accesses. The
corresponding number for the two-address machine is 13 memory accesses.
This simple example shows that as we reduce the number of addresses, we
tend to increase the number of memory accesses.
Self Assessment Questions
1. The bits of the instruction are divided into groups called __________ .
2. _____________ use an implied accumulator (AC) register for all data
manipulation.
Manipal University Jaipur B1648 Page No. 60
Computer Architecture Unit 1
Activity 1:
After knowing about the different instructions format, find out in which
instruction format your computer is based upon and compare the format with
other formats.
3.3 Memory Addressing
Memory addressing is the logical structure of a computer’s random-access
memory (RAM). We all know that a cell is the general term used for the
smallest unit of memory that the CPU can read or write. The size of a cell in
most modern computers is 8 bits. 8 bits join to form 1 byte. Hardware-
accessible units of memory larger than one cell are called words.
At present, 32 bits (4 bytes) and 64 bits (8 bytes) are the most common word
sizes. Each memory cell has an exclusive integer address, thus, the CPU
accesses a cell by using its address. Addresses of logically adjacent cells differ
by 1. Thus, the address space of a processor is the range of possible integer
addresses, typically (0: 2n - 1).
Any operation to be performed is specified by the operation field of the
instruction. The execution of the operation is performed on some data stored
in computer registers or memory words. Selection of operands during program
execution depends on the addressing mode of the instruction. There are
various ways of specifying address of the data to be operated on. These
different ways of specifying data are called the addressing modes. In other
words, Addressing modes are the method used to determine which part of
memory is being referred to by a machine instruction. RAM is divided into
number of sections which are referenced individually through the addressing
modes. The CPU accesses that portion of memory and performs the action
specified by the machine instruction. Depending upon the type of computer
architecture, the addressing mode is selected. The purpose of using address
mode techniques by the computer is to accommodate one or both of the
following provisions:
1. To give programming versatility to the user by providing such facilities as
pointers to memory, counters for loop control, indexing of data, and
program relocation.
2. To reduce the number of bits in the addressing field of the instruction.
Self Assessment Questions
3. Selection of operands during program execution does not depend on the
Manipal University Jaipur B1648 Page No. 61
Computer Architecture Unit 1
addressing mode of the instruction. (True/ False)
4. Hardware-accessible units of memory larger than one cell are called
words. (True/ False)
3.4 Address Modes for Signal Processing
A distinct addressing mode field is required in instruction format for signal
processing as shown in figure 3.2. The operation code (opcode) specifies the
operation to be performed. The mode field is responsible for locating the
operands needed for the operation.
An address field in an instruction may or may not be present. If it’s there, it
may designate a memory address and if not, then a processor register may be
designated. It is noticeable that each address field may be associated with its
own specific addressing mode.
Opcode Mode Address
Figure 3.2: Instruction Format with Mode Field
The following are the different types of address modes:
Implied Mode: The operands in this mode are specified implicitly in the
explanation of the instruction. For example, the instruction ‘‘complement
accumulator’’ is considered as an implied mode instruction as the description
of the instruction implies the operand in the accumulator register. In fact, all
register references instructions that use an accumulator are implied mode
instructions. Zero-address introductions are implied mode instructions.
For example, the operation:
<a: = b + c;>
can be done using the sequence
<load b; add c; store a;>
The destination (the accumulator) is implied in every "load" and "add"
instruction; the source (the accumulator) is implied in every "store" instruction.
Immediate Mode: The operand in this mode is stated in the instruction itself,
i.e. there is an operand field rather than an address field in the immediate
mode instruction. The operand field contains the actual operand to be used in
union with the operation specific in the instruction. For example:
MVI B, #20h
Means the value 20 is moved to operand B
Manipal University Jaipur B1648 Page No. 62
Computer Architecture Unit 1
ADD r0,#50h; (Add 50 to the contents of R0)
Register Mode: In this mode, the operands are in registers that reside within
the CPU. The register required is chosen from a register field in the instruction.
For example:
Add R4, R3
Means Add the value of R4 and R3 and store in R4
MOV AL, BL
Means, Move the content of register BL to AL
Register Indirect Mode: In this mode, the instruction specifies a register in
the CPU that contains the address of the operand and not the operand itself.
Usage of register indirect mode instruction necessitates the placing of memory
address of the operand in the processor register with a previous instruction.
For example:
ADD R4, (R1)
MOV CX, [BX]
Means the contents of BX (representing the memory address) register will be
moved to CX register
Auto-increment or Auto-decrement Mode: After execution of every
instruction from the data in memory it is necessary to increment or decrement
the register. This is done by using the increment or decrement instruction.
Given upon its sheer necessity, some computers use special mode that
increments or decrements the content of the registers automatically. For
example
Auto-increment:
Add R1, (R2)+
Auto-decrement:
Add R1,-(R2)
Direct Addressing Mode: In this mode, the operand resides in memory and
its address is given directly by the address field of the instruction such that the
affective address is equal to the address part of the instruction. For example:
LD Acc, [5]
(Load the value in memory location 5 into the accumulator)
MOV A, 30h
This instruction will read the data out of the internal RAM address 30
Manipal University Jaipur B1648 Page No. 63
Computer Architecture Unit 1
(hexadecimal) and store it in the Accumulator.
Indirect Addressing Mode: Unlike direct address mode, in this mode, the
address field gives the address where the effective address is stored in
memory. The instruction from memory is fetched through control to read the
address part in order to access memory again to read the effective address.
A few addressing modes require that the address field of the instruction be
added to the content of a specific register in the CPU. The effective address
in these modes is obtained from the following equation:
Effective address = Address part of instruction + Context of CPU register
The CPU Register used in the computation may be the program counter, Index
Register or a base Register. For example:
LD Acc, [5]
(Load the value stored in the memory location pointed to by the operand into
the accumulator)
Relative Address Mode: This mode is applied often with branch type
instruction where the branch address position is relative to the instruction word
address. As such in this mode, the program counter contents are added to the
address element of the instruction so as to acquire the effectual address
whose location in memory is relative to the address of the following instruction.
Since the relative address can be specified with the smaller number of bits
than those required to design the entire memory address, it results in a shorter
address field in the instruction format. For example:
JMP +2
(Will tell the processor to move 2 bytes ahead)
MOV CL, [BX+4]
Guides to move to 4 bytes ahead and move the value to register CL
Indexed Addressing Mode: In this mode, the effective address is acquired
by adding the index register content to an instruction’s address element. The
index register is a unique CPU register which contains an index value and can
be added after its value is used to access the memory. For example:
Add R3, (R1 + R2)
My_array DB ‘1’, ‘2’, ‘3’,’4’,’5’;
MOV AL, My_array [3];
Manipal University Jaipur B1648 Page No. 64
Computer Architecture Unit 1
So AL holds value 4.
Base Register Addressing Mode: In this mode, the affective address is
obtained by adding the content of a base register to the part of the instruction
like that of the indexed addressing mode though the register here is a base
register and not an index register.
MOV AX, [1000+BX]
This instruction adds the contents of BX with 1000 to produce the address of
the memory value to fetch. This instruction is useful for accessing elements of
arrays, records, and other data structures.
The difference between the base register and indexed addressing modes is
based on their usage rather than their computation. An index register is
assumed to hold an index number that is relative to the address part of the
instruction.
A base register is assumed to hold a base address and the address field of
the instruction, and gives a displacement relative to this base address. The
base register addressing mode is handy for relocation of programs from one
memory to another as required in multi programming systems.
The address values of instruction must reflect this change of position with a
base register, the displacement values of instructions do not have to change.
Only the value of the base register requires updating to reflect the beginning
of a new memory segment.
Self Assessment Questions
5. ____________ instructions are implied mode instructions.
6. Relative Address Mode is applied often with __________ instruction.
7. The ______________ is a special CPU register that contains an index
value.
3.5 Operations in the Instruction Sets
A program has a sequence of instructions and it is located in the computer’s
memory unit. The program is implemented by following a cycle for each
instruction. Every instruction cycle is subdivided into a series of sub cycles or
phases. The following describes the parts of an instruction cycle:
1. Fetch an instruction from memory.
2. Decode the instruction.
Manipal University Jaipur B1648 Page No. 65
Computer Architecture Unit 1
3. Read the effective address from memory if the instruction has an indirect
address.
4. Execute the instruction.
After the completion of step 4, the control goes back to step 1 to fetch, decode
and execute the next instruction. This process continues indefinitely unless a
HALT instruction is encountered. In an improved instruction execution cycle,
we can introduce a third cycle known as the interrupt cycle. Figure 3.3 illustrate
how the interrupt cycle fits into the overall cycle.
Manipal University Jaipur B1648 Page No. 66
Computer Architecture Unit 1
Figure 3.3: Instruction Cycle with Interrupts
5. 5.1 Fetch & decode
To bring the instructions from main memory into the instruction register, the
CPU first places the value of PC into memory address register. The PC always
points to the next instruction to be executed. The memory read is initiated and
the instruction from that location gets copied in Instruction Register (IR). PC is
also incremented by one simultaneously so that it points to the next instruction
to be executed. This completes the fetch cycle for an instruction as shown in
figure 3.4.
Figure 3.4: Instructions Cycle
Decoding means interpretation of the instruction. Each and every instruction
initiates a sequence of steps to be executed by the CPU. Decoding means
Manipal University Jaipur B1648 Page No. 67
Computer Architecture Unit 1
deciding which course of action is to be taken for execution of the instruction
and what sequence of control signals must be generated for it. Before
execution, operands, i.e. necessary data is fetched from the memory.
6. 5.2 Execution cycle (instruction execution)
As studied in the previous sections, the fundamental task performed by a
computer is the implementation of a program. The program, that is to be
executed, is a set of instructions, and is stored in memory. The task is
completed when instructions of the program are executed by the central
processing unit (CPU). The CPU is mainly responsible for the instruction
execution. Now, lets examine several typical registers some of which are
generally available in the machines.
These registers are:
Memory Address Register (MAR): It identifies the address of memory
location from where the data or instruction is to be accessed (for read
operation) or where the data is to be stored (for write operations).
Memory Buffer Register (MBR): It is a register that temporarily stores the
data that is to be written in the memory (for write operations) or the data
received from the memory (for read operation).
Program Counter (PC): The program counter keeps a record of the
instruction that is to be performed after the instruction in progress.
Instruction Register (IR): Here, loading of the instructions take place before
they are executed.
The model of instruction processing can simply be stated in a two-step
process. Firstly, the CPU reads (fetches) instructions (codes) from the memory
one by one, and executes or performs the operation specified by this
instruction.
The instruction fetch is performed for every instruction. Instruction fetch
involves reading of an instruction from a position, where it is stored, in the
memory to the CPU. The execution of this instruction may entail various
operations as per the nature of the instruction.
An instruction cycle refers to the processing needed for a single instruction
(fetch and execution). The instruction cycle consist of the fetch cycle and the
execute cycle.
Program execution comes to an end if:
Manipal University Jaipur B1648 Page No. 68
Computer Architecture Unit 1
• The electric power supply is stopped or
• Any irrecoverable error occurs, or
• When a program is executed in sequence.
The fetched instruction is in the form of binary code and is loaded into an
instruction register (IR), in the CPU. The CPU interprets the instruction and
takes the required action.
Self Assessment Questions
8. In an improved instruction execution cycle, we can introduce a third cycle
known as the ____________________________ .
9. Write the full form of:
a. MAR b. MBR
3.6 Instructions for Control Flow
Memory locations are storage houses for instructions. When processed in the
CPU, the instructions are fetched from consecutive memory locations and
implemented. Each time an instruction is fetched from memory, the program
counter is simultaneously incremented with the address of the next instruction
in sequence. Once a data transfer or data manipulation instruction is executed,
control returns to the fetch cycle with the program counter containing the
address of the instruction next in sequence.
In case of a program control type of instruction, execution of instruction may
change the address value in the program counter and cause the flow of control
to be altered. The conditions for altering the content of the program counter
are specified by program control instruction, and the conditions for data-
processing operations are specified by data transfer and manipulation
instructions.
As a result of execution of a program control instruction, a change in value of
program counter occurs, which causes a break in the sequence of instruction
execution. This is an important feature in digital computers, as it provides
control over the flow of program execution and a capability for branching to
different program segments. Some typical program control instructions are
listed in table 3.5.
Table 3.5: Typical Program Control Instructions
Name Mnemonic
Branch BR
Jump JMP
Manipal University Jaipur B1648 Page No. 69
Computer Architecture Unit 1
Skip SKP
Call CALL
Return RET
Compare (by subtraction) CMP
Test (by ANDing) TST
The branch and jump instructions are identical in their use but sometimes they
are used to denote different addressing modes. The branch is usually a one-
address instruction. Branch and jump instructions may be conditional or
unconditional.
An unconditional branch instruction, as a name denotes, causes a branch to
the specified address without any conditions. On the contrary the conditional
branch instruction specifies a condition such as branch if positive or branch if
zero. If the condition is met, the program counter is loaded with the branch
address and the next instruction is taken from this address. If the condition is
not met, the program counter remains unaltered and the next instruction is
taken from the next location in sequence.
The skip instruction does not require an address field and is, therefore, a zero-
address instruction. A conditional skip instruction will skip the next instruction,
if the condition is met. This is achieved by incrementing the program counter
during the execute phase in addition to its being incremented during the fetch
phase. If the condition is not met, control proceeds with the next instruction in
sequence where the programmer inserts an unconditional branch instruction.
Thus, a skip-branch pair of instructions causes a branch if the condition is not
met, while a single conditional branch instruction causes a branch if the
condition is met.
The call and return instructions are used in conjunction with subroutines. The
compare instruction performs a subtraction between two operands, but the
result of the operation is not retained. However, certain status bit conditions
are set as a result of the operation. In a similar fashion, the test instruction
performs the logical AND of two operands and updates certain status bits
without retaining the result or changing the operands. The status bits of
interest are the carry bit, the sign bit, a zero indication, and an overflow
condition.
The four status bits are symbolised by C, S, Z, and V. The bits are set or
cleared as a result of an operation performed in the ALU.
1. Bit C (carry) is set to 1 if the end carry C8 is 1 .It is cleared to 0 if the carry
Manipal University Jaipur B1648 Page No. 70
Computer Architecture Unit 1
is 0.
2. Bit S (sign) is set to 1 if the highest-order bit F7 is 1. It is set to 0 if the bit
is 0. s=0 defines positive number and s=1 defines negative number.
3. Bit Z (zero) is set to 1 if the result of the ALU contains all 0’s. It is cleared
to 0 otherwise. In other words, Z = 1 if the result is zero and Z = 0 if the
result is not zero.
4. Bit V (overflow) is set to 1 if the exclusive-OR of the last two carries is equal
to 1, and cleared to 0 otherwise. This is the condition for an overflow when
negative numbers are in 2’s complement. For the 8-bit ALU, V = 1 if the
result is greater than +127 or less than -128.
As you can see in figure 3.5, the status bits can be checked after an ALU
operation to determine certain relationships that exist between the values of A
and B. If bit V is set after the addition of two signed numbers, it indicates an
overflow condition.
Figure 3.5: Status Register Bits
If Z is set after an exclusive-OR operation, it indicates that A = B. This is so
because x = 0, and the exclusive-OR of two equal operands gives an all-0’s
result which sets the Z bit. A single bit in A can be checked to determine if it is
0 or 1 by masking all bits except the bit in question and then checking the Z
status bit.
Self Assessment Questions
10. When processed in the CPU, the instructions are fetched from
locations and implemented.
11. The ______________ and _____________ are identical in their use
but sometimes they are used to denote different addressing modes.
3.7 MIPS Architecture
Manipal University Jaipur B1648 Page No. 71
Computer Architecture Unit 1
After considerable research on efficient processor organisation and VLSI
integration at Stanford University, the MIPS (Microprocessor without
Interlocked Pipeline Stages) architecture evolved. At the same time, a
research group at Berkeley designed the RISC-I chip based on almost the
same ideas. Today, the acronym RISC is interpreted as "regular instruction
set computer", and the RISC ideas are used in every current microprocessor
design. To get a better idea about MIPS Architecture, look at figure 3.6.
Figure 3.6: MIPS Architecture
The principal features of the MIPS architecture are as follows:
• It has a five-stage execution pipeline: fetch, decode, execute, memory-
access, write-result.
• It has a regular instruction set where all instructions are 32-bit.
• There are three-operand arithmetical and logical instructions.
• It consists of 32 general-purpose registers of 32-bits each.
• There are no status register or instruction side-effects.
• There are no complex instructions (like stack management, string
operations, etc.)
• It has optional coprocessors for system management and floating-point.
• It consists of only the load and store instruction access memory.
• It has a flat address space of 4 Gbytes of main memory (232 bytes).
• The Memory-management unit (MMU) maps virtual to actual physical
addresses.
• Optimising C compiler replaces hand-written assembly code.
Manipal University Jaipur B1648 Page No. 72
Computer Architecture Unit 1
• Hardware structure does not check dependencies.
• Its software tool chain knows about hardware and generates correct code.
MIPS Corporation originated in 1984. R2000 microprocessor was their first
product, followed by the R2010 floating-point coprocessor. The early computer
units effectively utilised both chips. R3000 was the next MIPS processor. It
was a variant of the R2000 with an identical instruction set, but optimised for
low-cost embedded systems. This processor and its system- on-a-chip
implementations are still admired and used extensively even in the present
day. Since then, several improved variants of the original instruction set have
been introduced:
• MIPS-I: This is the original 32-bit instruction set; and is still common.
• MIPS-II: It is an improved instruction set with dozens of new instructions.
• MIPS-III: It has a 64-bit instruction set used by the R4000 series.
• MIPS-IV: It is an upgrade of the MIPS III.
The most significant characteristic of the MIPS architecture is the regular
register set. It consists of the 32-bit wide program counter (PC), and a bank of
32 general-purpose registers called r0, .......... , r31, each of which is 32-bit
wide. All general-purpose registers can be used as the target registers and
data sources for all logical, arithmetical, memory access, and control-flow
instructions. Only r0 is unique because it is internally hardwired to zero.
Reading r0 always returns the value 0x00000000, and a value written to r0 is
mistreated and misplaced.
Self Assessment Questions
12. One of the key features of the MIPS architecture is the ___________ .
13. Two separate 32-bit registers called ___________ and __________
are provided for the integer multiplication and division instructions.
Activity 2:
Visit a computer hardware store and try to collect as much information as
possible about the MIPS processor. Compare its features with other
processors.
3.8 Summary
• Each computer has its own particular instruction code format called its
Instruction Set.
• The different types of instruction formats are three-address instructions,
two-address instructions, one-address instructions and zero-address
Manipal University Jaipur B1648 Page No. 73
Computer Architecture Unit 1
instructions.
• A distinct addressing mode field is required in instruction format for signal
processing.
• The program is executed by going through a cycle for each instruction.
• The prototype chip of MIPS architecture demonstrated that it is possible to
integrate a microprocessor with five-stage execution pipeline and cache
controller into a single silicon chip.
3.9 Glossary
• Cell: The smallest unit of memory that the CPU can read or write is cell.
• Decoding: It means interpretation of the instruction.
• Fields: Groups containing bits of instruction.
• Instruction set: Each computer has its own particular instruction code
format called its Instruction Set.
• MIPS: Microprocessor without Interlocked Pipeline.
• Operation: It is a binary code that instructs the computer to perform a
specific operation.
• RISC: Reduced Instruction Set Computer
• Words: Hardware-accessible units of memory larger than one cell are
called words.
3.10 Terminal Questions
1. What are instruction sets? Explain the fields found in instruction formats.
2. Give the classification of the various instruction sets.
3. Define memory addressing.
4. Explain the different types of addressing modes.
5. Describe the instruction cycle and its various phases.
6. Explain the instructions required for control flow.
7. Write a short note on MIPS architecture.
3.11 Answers
Self Assessment Questions
1. Fields
2. One-address instructions
3. False
4. True
5. Zero-address
Manipal University Jaipur B1648 Page No. 74
Computer Architecture Unit 1
6. Branch type
7. Index register
8. Interrupt cycle
9. a. Memory Buffer Register
10. b. Memory Address Register
11. Consecutive memory
12. Branch, jump instructions
13. Regular register set.
14. HI, LO
Terminal Questions
1. Each computer has its own particular instruction code format called its
Instruction Set. Refer Section 3.2.
2. The different types of instruction formats are three-address instructions,
two-address instructions, one-address instructions and zero-address
instructions. Refer Section 3.2.
3. Memory addressing is the logical structure of a computer’s randomaccess
memory (RAM). Refer Section 3.3.
4. A distinct addressing mode field is required in instruction format for signal
processing. Refer Section 3.4.
5. The program is executed by going through a cycle for each instruction.
Each instruction cycle is now subdivided into a sequence of sub cycles or
phases. Refer Section 3.5.
6. The conditions for altering the content of the program counter are specified
by program control instruction, and the conditions for data- processing
operations are specified by data transfer and manipulation instructions.
Refer Section 3.6.
7. After considerable research on efficient processor organisation and VLSI
integration at Stanford University, the MIPS architecture evolved. Refer
Section 3.7.
References:
• Hwang, K. (1993) Advanced Computer Architecture. McGraw-Hill, 1993.
• D. A. Godse & A. P. Godse (2010). Computer Organization. Technical
Publications. pp. 3-9.
• John L. Hennessy, David A. Patterson, David Goldberg (2002)
"Computer Architecture: A Quantitative Approach", Morgan Kaufmann; 3rd
edition.
Manipal University Jaipur B1648 Page No. 75
Computer Architecture Unit 1
• Dezso Sima, Terry J. Fountain, Peter Kacsuk (1997) Advanced computer
architectures - a design space approach. Addison-Wesley- Longman: I-
XXIII, 1-766.
E-references:
• http://tams-www.informatik.uni-hamburg.de/applets/hades/webdemos/
mips.html http://www.withfriendship.com/user/servex/mips-architecture.
php
• http://en.wikipedia.org/wiki/File:CDC_Cyber_170_CPU_architecture.png
Unit 4 Pipelined Processor
Structure:
4.1 Introduction
Objectives
4.2 Pipelining
4.3 Types of Pipelining
4.4 Pipelining Hazards
4.5 Data Hazards
4.6 Control Hazards
4.7 Techniques to Handle Hazards
Minimising data hazard stalls by forwarding
Reducing pipeline branch penalties
4.8 Performance Improvement Pipeline
4.9 Effects of Hazards on Performance
4.10 Summary
4.11 Glossary
4.12 Terminal Questions
4.13 Answers
4.1 Introduction
In the previous unit, you studied about the changing face of computing. Also,
you studied the meaning and tasks of a computer designer. We also covered
the technology trends and the quantitative principles in computer design. In
this unit, we will introduce you to pipelining processing, the pipeline hazards,
structural hazards, control hazards and techniques to handle them. We will
also examine the performance improvement with pipelines and understand the
effect of hazards on performance.
A parallel processing system can carry out concurrent data processing to
attain quicker execution time. For example, as an instruction is being executed
Manipal University Jaipur B1648 Page No. 76
Computer Architecture Unit 1
in the ALU, the subsequent instruction can be read from memory. The system
may have more than one ALU and be able to execute two or more instructions
simultaneously. Additionally, the system may have two or more processors
operating at the same time. The rationale of parallel processing is to speed up
the computer processing potential and increase it all through.
Parallel processing can be viewed from various levels of complexity. A
multifunctional organisation is usually associated with a complex control unit
to coordinate all the activities among the various components.
There are a variety of ways in which parallel processing can be done. We
consider parallel processing under the following main topics:
1. Pipeline processing
2. Vector processing
3. Array processing
Out of these, we will study the pipeline processing in this unit.
Objectives:
After studying this unit, you should be able to:
• explain the concept of pipelining
• list the types of pipelining
• identify various pipeline hazards
• describe data hazards
• discuss control hazards
• analyse the techniques to handle hazards
• describe the performance improvement with pipelines
• explain the effect of hazards on performance
4.2 Pipelining
An implementation technique by which the execution of multiple instructions
can be overlapped is called pipelining. This pipeline technique splits up the
sequential process of an instruction cycle into sub-processes that operates
concurrently in separate segments. As you know computer processors can
execute millions of instructions per second. At the time one instruction is
getting processed, the following one in line also gets processed within the
same time, and so on. A pipeline permits multiple instructions to get executed
at the same time. Without a pipeline, every instruction has to wait for the
previous one to be complete. The main advantage of pipelining is that it
increases the instruction throughput, which is defined as the number of
Manipal University Jaipur B1648 Page No. 77
Computer Architecture Unit 1
instructions completed per unit time. Thus, a program runs faster.
In pipelining, several computations can run in distinct segments at the same
time. A register is associated with each segment in the pipeline to provide
isolation between each segment. Thus, each segment can operate on distinct
data simultaneously. Pipelining is also called virtual parallelism as it provides
an essence of parallelism only at the instruction level. In pipelining, the CPU
executes each instruction in a series of following stages:
1. Instruction Fetching (IF)
2. Instruction Decoding (ID)
3. Instruction Execution (EX)
4. Memory access (MEM)
5. Register Write back (WB)
The CPU while executing a sequence of instructions can pipeline these
common steps. However, in a non-pipelined CPU, instructions are executed
in strict sequence following the steps mentioned above. In pipelined
processors, it is desirable to determine the outcome of a conditional branch as
early as possible in the execution sequence. To understand pipelining, let us
discuss how an instruction flows through the data path in a five- segment
pipeline.
Consider a pipeline with five processing units, where each unit is assumed to
take 1 cycle to finish its execution as described in the following steps:
a) Instruction fetch cycle (IF): In the first step, the address of the instruction
to be fetched is taken from memory into Instruction Register (IR) and is
stored in PC register.
b) Instruction decode fetch cycle (ID): The fetched instruction is decoded
and instruction is send into two temporary registers. Decoding and reading
of registers is done in parallel.
c) Instruction execution cycle (EX): In this cycle, the result is written into
the register file.
d) Memory access completion cycle (MEM): In this cycle, the address of
the operand calculated during the prior cycle is used to access memory.
In case of load and store instructions, either data returns from memory and
is placed in the Load Memory Data (LMD) register or is written into
memory. In case of branch instruction, the PC is replaced with the branch
destination address in the ALU output register.
e) Register write back cycle (WB): During this stage, both single cycle and
two cycle instructions write their results into the register file.
Manipal University Jaipur B1648 Page No. 78
Computer Architecture Unit 1
These steps of five-segment pipelined processor are shown in figure 4.1.
Figure 4.1: A Five-Segment Pipelined Processor
The segments are isolated by registers. The simplest way to visualise a
segment is to think of a segment consisting of an input register and a
combinational circuit that processes the data stored in register. See table 4.1
for examples of sub operations performed in each segment of pipeline
Table 4.1: Sub operations Performed in Each Segment of Pipeline
Ri < An. R2 ^ Bn, R3 < Cn. R4 ^ Dn Input An. Bn. Cn Dn
R5 ^ An *Bn, R6 < Cn* Dn. Multiply
Add and store in
R7 < R- + R6
Register R7
Manipal University Jaipur B1648 Page No. 79
Computer Architecture Unit 1
Now we will study an example of pipeline in figure 4.2.
An* Bn+ Cn* Dn n = 1, 2, 3, ...........
A BC D
n nn
Figure 4.2: Example of Pipeline Processing
In the figure, each segment has one or three registers with combinational
circuits. Each register is loaded with a new data on start of new time segment.
Refer table 4.2 for an example of contents of Registers in Pipeline.
On 1st clock pulse, data is loaded in registers R1, R2, R3, and R4.
On 2nd clock pulse, product is stored in registers R5 and R6.
On 3rd clock pulse, the data in R5, R6 are added and stored in R7.
So it required a total of 3 clock periods only, to compute An* Bn + Cn* Dn.
Table 4.2: Contents of Registers in Pipeline Example
Segment 1 Segment 2 Segment 3 Segment 4
Clock Pulse
R1 R2 R3 R4 R5 R6 R7
1 A1 B1 C1 D1 - -
2 A2 B2 C2 D2 A1*B1 C1*D1
Manipal University Jaipur B1648 Page No. 80
Computer Architecture Unit 1
3 A3 B3 C3 D3 A *B C *D A *B +C *D
4 - - - - A *B C *D A *B +C *D
5 - - - - - - A *B +C *D
An instruction pipeline functions on a flow of instructions by overlapping the
fetch decade and executes phases of the instruction cycle. High speed
computers usually consist of pipeline arithmetic units. The execution of floating
point operations, multiplication of fixed point numbers and similar
computations encountered in scientific problems are done through these
pipeline arithmetic units.
A pipeline multiplier is essentially an array multiplier with special address
designed to minimise the carry propagation time through the partial products.
Floating point operations are easily decomposed into sub operations.
During the time when the previous instructions are being performed in other
sections, an instruction pipeline reads successive instructions from the
memory. Due to this, the instruction and execute phases have common
characteristics and perform synchronised operations. One possible digression
associated with such a scheme is that an instruction may cause a branch other
than the sequence. In that case, all the instructions that have been read from
memory after the branch instruction must be discarded and the pipeline must
be emptied.
The instruction fetch section can be applied by means of a first-in first-out
(FIFO) buffer. This forms a queue rather than a stack.
The instruction pipeline design will be most competent if the instruction cycle
is divided into segments of equal interval. The time taken by each step to
accomplish its job depends on the instruction and the manner in which it is
executed.
Self Assessment Questions
1. An implementation technique by which the execution of multiple
instructions can be overlapped is called _______ .
2. Pipelining is also called ______________ .
3. LMD is the short for _________________ .
Manipal University Jaipur B1648 Page No. 81
Computer Architecture Unit 1
4. The instruction fetch segment can be implemented by means of a
4.3 Types of Pipelining
Pipelines are of two types - Linear and Non-linear.
a) Linear pipelines: These pipelines perform only one pre-defined fixed
functions at specific times in a forward direction from one stage to next
stage. A linear pipeline can be visualised as a collection of processing
segments, where each segment completes a part of an instruction. The
result obtained from the processing in each segment is transferred to the
next segment in the pipeline. As in these pipelines, repeated evaluations
of the same function are performed with different data for some specified
period of time, these pipelines are also called static pipelines.
b) Non-linear pipelines: These pipelines can perform more than one
operation at a time as they have the provision to be reconfigured to
execute variable functions at different times. As these pipelines can
execute different functions at different times, they are called dynamic
pipelines.
An example of a non-linear pipeline is a three-stage pipeline that performs
subtraction and multiplication on different data at the same time as
illustrated in figure 4.3.
In this three-stage pipeline, the input data must go through stages 1, 2 and 3
to perform multiplication and through stages 1 and 3 only to perform
subtraction. Therefore, dynamic pipelines require feed forward and feedback
connections in addition to the streamline connections between the stages.
Manipal University Jaipur B1648 Page No. 82
Computer Architecture Unit 1
Self Assessment Questions
5. ___________ pipelines perform only one pre-defined fixed functions
at specific times in a forward direction from one stage to next stage.
6. _______________ pipelines can perform more than one operation at
a time as they have the provision to be reconfigured to execute variable
functions at different times.
7. Non-Linear pipelines are also called ___________________ .
4.4 Pipelining Hazards
Hazards are the situations that stop the next instruction in the instruction
stream from being executed during its designated clock cycle. Hazards reduce
the performance from the ideal speedup gained by pipelining. In general, there
are three major categories of hazards that can affect normal operation of a
pipeline.
1. Structural hazards (also called resource conflicts): They occur from
resource conflicts when the hardware cannot support all possible
combinations of instructions in simultaneous overlapped execution. These
are caused by multiple accesses to memory performed by segments. In
most cases this problem can be resolved by using separate instruction and
data memories.
2. Data hazards (also called data dependency): They occur when an
instruction depends on the result of a previous instruction in a way that is
exposed by the overlapping of instructions in the pipeline. This happens
arise when an instruction requires the previous output and output is not
yet present. This is explained in detail in the section 4.5.
3. Control hazards (also called branch difficulties): Branch difficulties arise
from branch and other instructions that change the content of PC (Program
Counter). This is explained in detail in the section 4.6.
Stalling can become essential due to the hazards present in the pipelines. The
processor can stall on different events:
4. A cache miss: Before and after the instruction ends up in a miss, a
cache miss stalls all the instructions on pipeline.
5. A hazard in pipeline: When a hazard is removed, it allows some
instructions in the pipeline to proceed whereas some others are delayed.
Once an instruction is stalled, all the instructions following this instruction
are stalled. Instructions in the line preceding the stalled instruction must
keep going, or else the hazard will never clear.
Manipal University Jaipur B1648 Page No. 83
Computer Architecture Unit 1
Self Assessment Questions
8. ______________ are the situations that stop the next instruction in
the instruction stream from being executed during its designated clock
cycle.
9. Structural Hazards are also called _______________ .
10. Data Hazards are also called ___________________ .
11. Control Hazards are also called _____________ .
4.5 Data Hazards
Pipelining has a major effect on changing the relative timing of instructions by
executing them at the same time. This leads to data and control hazards. In
pipelining, the data hazards arise when the sequence of read/write accesses
to operands thus, altering the sequence of the sequential execution in an
unpipelined machine. In simple terms, data hazard occurs when attempted to
use the data before it is ready. The pipelined execution of such instructions is
given below:
The instructions following the ADD make use of the end result of the ADD
instruction (in R1). The ADD instruction writes the value of R1 in the write back
(WB) pipe stage, but the value is read by the SUB instruction during the
instruction decode (ID) stage (ID sub). This problem is referred to as the data
hazard because a wrong value is read by the sub instruction and an attempt
is made to use it.
The WB stage of ADD will get complete when an interrupt occurs between the
ADD and SUB instructions, and the value of R1 at that point will be the
outcome of the ADD.
As we can see in figure 4.4 and 4.5, AND instruction is also effected by data
hazard, the write of R1 does not finish till the end of clock cycle 5. Therefore,
AND instruction when executed at cycle 4 will not retrieve the correct results.
Figure 4.4: Pipelined Execution of the Instruction
Manipal University Jaipur B1648 Page No. 84
Computer Architecture Unit 1
The SUB instruction reads the wrong value as it reads the data (cycle 3) before
the ADD instruction writes the value (cycle 5). The register read of XOR
instruction occurs in clock cycle 6. This is performed correctly as it is done
after the register write by ADD. The OR instruction can function without any
problem. To attain this, the register files reads are performed in the second
half of the cycle and writes in the first half. In cycle 5, the first of the cycle
performs the write to register file by ADD and the second half of the cycle will
perform the read of registers by OR.
Manipal University Jaipur B1648 Page No. 85
Computer Architecture Unit 1
Figure 4.5: Clock Cycles and Execution Order of Instructions
Self Assessment Questions
12. Pipelining has a major effect on changing the relative timing of instructions
by overlapping their execution. (True/False)
13. The register read of XOR instruction occurs in clock cycle
4.6 Control Hazards
Control Hazards cause a greater performance failure for a pipeline as
compared to data hazards. On execution of a branch, the PC may or may not
change from 4 added to its current value. If the PC is changed by the branch
to its target address, then it is known as taken branch; else it is known as not
taken or untaken. Control hazards are also known as Branching hazard and
occur with branches. In this case, the processor will
Manipal University Jaipur B1648 Page No. 86
Computer Architecture Unit 4
not know the outcome of the branch when it needs to insert a new instruction
into the pipeline stage).
As soon as the branch is detected, the method used to deal with branches is
to stall the pipeline. Until the instruction is confirmed to be a branch, there is
no need to stall the pipeline. Thus, the stall does not occur until after the ID
stage. The pipelining behaviour looks as in figure 4.6.
Branch instruction IF ID EX MEM WB
Branch successor IF sail sail IF ID EX MEM WB
Branch successor +1 IF ID EX MEM WB
Branch successor + 2 IF ID EX MEM
Branch successor + 3 IF ID EX
Branch successor + 4 IF ID
Branch successor + 5 IF
Figure 4.6: Three-Cycle Stall in the Pipeline
The control hazard stall is not implemented in the same way as the data
hazard stall, since the instruction fetch (IF) cycle is to be repeated as soon the
branch cycle is known. Thus, the first IF cycle is definitely a stall, as it never
performs essential tasks. By setting the ID/IF to zero, we can implement the
stall for the three cycles. The repetition of the IF stage is not required, if the
branch is untaken, since the correct instruction may already have been
fetched.
Self Assessment Questions
14. _________ cause a greater performance failure for a pipeline than
15. If the PC is changed by the branch to its target address, then it is known
as __________________ branch; else it is known as __________ .
4.7 Techniques to Handle Hazards
In this section, we will discuss the techniques to handle data and control
hazards. Now, let us start with the concept of forwarding technique to handle
data hazard.
4.7.1 Minimising data hazard stalls by forwarding
The problem posed due to data hazards can be solved with a simple hardware
technique called forwarding (also called bypassing and sometimes short-
circuiting). This is a technique to handle data hazards. The key insight in
B1648 Page No. 87
Computer Architecture Unit 1
forwarding is that the result is not really needed by the SUB instruction until
after the ADD actually produces it. The only problem is to make it available for
SUB when it needs it. If the result can be moved from where the ADD produces
it (execute/memory access (EX/MEM) register), to where it is required by the
SUB (ALU input latch), then a stall requirement can be ignored.
Using this study, the mechanism of forwarding is as follows:
1. The ALU result from the EX/MEM register is reversed backside to the ALU
input latches.
2. If it is detected by the forwarding hardware that the register corresponding
to a source for the current ALU operation is written by the previous ALU
operation, the forwarded result is selected as the ALU input by the control
logic rather than the register file reading the value.
If the SUB instruction is stalled, with forwarding, the ADD instruction will be
completed and the bypass is not made active. This is also true for the case of
a disruption between the two instructions.
Figure 4.5 shows that the results of not only the immediate previous instruction
are forwarded but also from an instruction initiated two cycles earlier. The
bypass paths and the highlights of the timing of the register reads and writes
are shown in figure 4.7. We can execute this code sequence without stalls.
Manipal University Jaipur B1648 Page No. 88
Computer Architecture Unit 1
Figure 4.7: Example with the Bypass Paths in Place
We can generalise forwarding to take account of passing a result directly to
the functional unit that needs it. The output of one unit passes its result to the
input of another, rather than forwarding the result of a unit to the input of the
same unit. For example, let’s consider the following sequence:
By forwarding the result of R1 and R4 from the pipeline registers to the ALU
and data memory inputs, we can prevent a stall.
Store requires an operand during MEM, and forwarding of that operand is
shown in the given figure 4.8(a)
Manipal University Jaipur B1648 Page No. 89
Computer Architecture Unit 1
Figure: 4.8 (a) Forwarding Example
• The first forwarding is for value R1 from EX add to EXlw.
• The second forwarding is also for value R1 from MEM add to EXsw
• The third forwarding is for value R4 from MEMlw to MEMsw
Figure 4.8 (b) shows all the forwarding paths for this example.
A forwarding path in DLX, may be required for the input of any functional unit
from any pipeline register. Forwarding paths are required from both the
ALU/MEM and MEM/WB registers to their inputs, as operands are accepted
by both the ALU and data memory. Additionally, a zero detection unit is used
by DLX (RISC processor architecture). This unit operates during the EX cycle,
and requires forwarding as well. Later in this section, we will explore all the
necessary forwarding paths and the control of those paths.
Figure 4.8(b): Forwarding Paths of the Above Example
The memory input stores the result of the load that is forwarded from the
memory output in memory access/write back (MEM/WB). Also, the ALU output
Manipal University Jaipur B1648 Page No. 90
Computer Architecture Unit 1
is forwarded to the ALU input for the address calculation of both the load and
the store. This is similar to forwarding another ALU operation. If the store is
dependant on an immediately preceding ALU operation, the result would need
to be forwarded to prevent a stall.
4.7.2 Reducing pipeline branch penalties
In this section, we discuss four simple compile-time schemes for dealing with
the pipeline stalls that are caused by branch delay. These four schemes have
actions for a branch that are immobile - they are unchanging for every branch
throughout the whole execution.
The branch penalty can be minimised by the software using knowledge of the
hardware scheme and branch behaviour. The branch optimisations rely on
compile-time branch prediction technology. Hence, we will discuss this
technology after these schemes. Below given are the techniques to handle
branch prediction.
1. Freeze or Flush the pipeline: This is the simplest scheme to handle
branches. In this scheme, till the branch destination is identified, any
instruction is held or deleted after the branch. This solution is considered
attractive due to its simplicity for hardware and software. This solution is
shown in the pipeline in figure 4.6. Here, the branch penalty is static and
cannot be reduced by software.
2. Assume each branch as not-taken: In this scheme, every branch is
treated as not taken; it simply allows the hardware to carry on as if the
branch were not executed. Here, you should be careful that no change
should take place in the machine state until the branch result is certainly
identified. A complication may come up from the need to know when the
situation might be transformed by an instruction and how to “back out” a
change. This complexity persuades us to prefer the simpler solution of
flushing the pipeline in machines with complex pipeline structures.
3. Predict-not-taken or predict-untaken scheme: This scheme focuses on
carrying on the fetching of instructions as if the branch was a standard
instruction. The pipeline seems as if nothing usual is occurring. If the
branch is taken, however, the fetched instruction needs to be turned into
a no-op (simply by clearing the IF/ID register) and the fetch at the target
address need to be restarted. This is shown in figure 4.8.
Lntaken branch instruction IF ID EX MEM WB
Manipal University Jaipur B1648 Page No. 91
Computer Architecture Unit 1
Instruction i + 1 IF ID EX MEM WB
Instruction i * 2 IF ID EX MEM WB
Instruction i + 3 IF ID EX MEM WB
Instruction i + 4 IF ID EX MEM WB
laken branch instruction IF ID EX MEM WB
Instruction i + 1 IF idle idle idle idle
Branch target IF ID EX MEM WB
Branch target +1 IF ID EX MEM WB
Branch target + 2 IF ID EX MEM WB
Figure 4.9: Predict-Not-Taken Scheme
An optional scheme says to consider every branch as taken. Once we decode
the branch and compute the target address, the branch is assumed to be taken
and fetching and executing at the target is initiated. In DLX pipeline, since the
target address is not known before the branch outcome is identified, DLX has
no advantage in this approach. In some machines, where the target addresses
is known before the branch outcome, a predict- taken scheme might make
sense. In both predict-taken or predict-not-taken scheme, the compiler can
improve performance by organising the code so that the most common path
matches the hardware‘s selection. Additional opportunities for the compiler to
improve performance are provided by our fourth scheme.
The fourth scheme that is used some machines is known as delayed branch.
Many microprogrammed control units use this technique. In a delayed branch,
the execution cycle with a branch delay of length n is
Branch instruction
Sequential successor1
Sequential successor2
Sequential successorn Branch target if taken
Manipal University Jaipur B1648 Page No. 92
Computer Architecture Unit 4
The branch-delay slots consist of the sequential successors. Execution of
these instructions, take place whether or not the branch is taken. Figure 4.9
shows the pipeline behaviour, having one branch-delay slot.
Figure 4.10: Behaviour of a Delayed Branch
In reality, there is only a single instruction delay in all machines with delayed
branch, and we emphasize on that case.
Self Assessment Questions
16. The problem posed due to data hazards can be solved with a simple
hardware technique called __________________ .
17. Forwarding is also called _________ or _________________ .
18. ____________ is the method of holding or deleting any instructions
after the branch until the branch destination is known.
19. ________________ technique simply allows the hardware to
continue as if the branch were not executed.
4.8 Performance Improvement with Pipeline
Performance is a relation of CPI (cycles per instruction), Clock cycle and
Instruction count. Reducing any of the three factors will lead to improved
performance.
Firstly, it is necessary to relate concept of pipelining to the instruction
execution process i.e. overlap computations of diverse tasks by operating on
them simultaneously in different stages. This will decrease the clock cycle,
and decrease effective time taken by the CPU in comparison to Manipal
University Jaipur B1648 Page No.
94
Computer Architecture Unit 1
original clock cycle. Instruction execution process lends itself naturally to
pipelining the overlap of subtasks of instruction fetch, decode and execute.
Figure 4.11: Pipeline Clock and Timing
In the above given figure 4.11;
Clock cycle of the pipeline: T
Latch delay: d
t = max {tm} + d
Pipeline frequency: f
f=1/t
Performance counters: Performance counters are the components of real
processors that follow a variety of actions carried out by a processor to
facilitate the understanding of its performance. Here, we will study four
memory-mapped performance counters:
• Cycle count - 0xFF00: These are the number of cycles from the time when
the processor was last reset
• Instruction count - 0xFF01: This is the number of actual instructions
implemented since the processor was last retuned.
• Load-stall count - 0xFF02: It states the number of cycles lost to loaduse
stalls. Load-use stalls are the quantity of cycles where no instructions are
executed because of a load-use stall.
• Branch stall count - 0xFF03: This counts the cycles lost to branch mis-
predictions and/or stalls. It shows the number of cycles where no
instruction are executed because of a branch mis-prediction.
• In the single-issue pipeline, for every cycle one (and only one) of the
instruction count, load-stall, or branch stall counters is incremented. As
Manipal University Jaipur B1648 Page No. 94
Computer Architecture Unit 1
such, the cycle count should be equal to the sum of these three registers.
• In the dual-issue processor, only one of the instruction count, load stall, or
branch stall counters is increased, but the instruction count register may
sometimes be incremented by two (for cycles in which two instructions
execute). As such, the sum of these three registers will be greater than or
equal to the cycle count.
During the write back stage of the pipeline, performance counters should be
counted by the processor. To be precise, it is neither a branch stall nor a load
stall cycle. The current value of these counters can be determined by using a
LD or LDR instruction to access them. The LD instruction takes a source label
and stores its address into the destination register. The source register's value
plus an immediate value offset is stored in the LDR and then the destination
register stores it.
To avoid the complexities, the value of the registers is not changed by the
stores to these locations, the contents of memory may still be updated by the
stores. This hardly makes any change as, anytime these locations are
retrieved, the value in the counter is used rather than the value in the memory.
Basically, these counters can be reset to zero only when the entire system is
reset.
Self Assessment Questions
20. ____________ states the number of cycles lost to load-use stalls.
21. ____________ instruction takes a source label and stores its address
into the destination register.
22. ____________ stores the source register's value plus an immediate
value offset and stores it in the destination register.
4.9 Effect of Hazards on the Performance
Hazards are of various types. They render the speed of the performance
improvement. The pipeline performance degrades the ideal performance due
to a stall hazard.
Average instruction time unpipelined
Speedup from pipehnrng= ------------:::::------------------------------
Average instruction time pipelined
Manipal University Jaipur B1648 Page No. 95
Computer Architecture Unit 1
CPIunpipelined X Clock Cycle Timeunpipelined
CPIpipelined X Clock Cycle Timepipelined
CPI is cycles per Instruction which determine the cycle count for each
instruction. The ideal CPI on a pipelined machine is almost always 1.
Therefore, the pipelined CPI is:
CPIpipelined = Ideal CPI + Pipeline stall clock cycles per instruction
= 1 + Pipeline stall clock cycles per instruction
If the cycle time overhead of pipelining is ignored and the stages are all
assumed to be perfectly balanced, then the two machines have an equal cycle
time and:
CPI
unpipelined
Speedup=
1+ Pipeline stall cycles per instruction
If all instructions take the same number of cycles, which must also equal the
number of pipeline stages (the depth of the pipeline) then unpipelined CPI is
equal to the depth of the pipeline, leading to
Pipeline depth
Speedup=
1 + Pipeline stall cycles per instruction
If there are no pipeline stalls, this leads to the intuitive result that pipelining
can improve performance by the depth of pipeline.
Self Assessment Questions
23. A __________ hazard causes the pipeline performance to degrade
the ideal performance.
24. CPI is the abbreviation for ___________ .
Activity 1:
Pick any two hazards from the organisation you previously visited. Now
implement the handling techniques to these hazards.
4.10 Summary
Let us recapitulate the important concepts discussed in this unit:
• A parallel processing system is able to perform concurrent data
Manipal University Jaipur B1648 Page No. 96
Computer Architecture Unit 1
processing to achieve faster execution time.
• An implementation technique by which the execution of multiple
instructions can be overlapped is called pipelining. In pipelining, the CPU
executes each instruction in a series of following small common steps:
• Instruction Fetching (IF)
• Instruction Decoding (ID)
• Instruction Execution(EX)
• Memory Access (MEM)
• Write back (WB)
• The segments are isolated by registers. The simplest way to visualise a
segment is to think of a segment consisting of an input register and a
combinational circuit that processes the data stored in register.
• Linear pipelines perform only one pre-defined fixed functions at specific
times in a forward direction from one stage to next stage. Non-Linear
pipelines can perform more than one operation at a time as they have the
provision to be reconfigured to execute variable functions at different
times.
• Hazards are the situations that stop the next instruction in the instruction
stream from being executed during its designated clock cycle. Structural
Hazards occur from resource conflicts when the hardware cannot support
all possible combinations of instructions in simultaneous overlapped
execution.
• Data Hazards occur when an instruction depends on the result of a
previous instruction in a way that is exposed by the overlapping of
instructions in the pipeline.
• Pipelining has a major effect on changing the relative timing of instructions
by overlapping their execution. This leads to data and control hazards.
Control Hazards do cause a greater performance failure for a pipeline than
do data hazards.
• The problem posed due to data hazards can be solved with a simple
hardware technique called forwarding.
4.11 Glossary
• CPI: Cycles per Instruction
• EX: Instruction Execution
• FIFO: First-in first-out
• Freeze or Flush the pipeline: Holding or deleting any instructions after
Manipal University Jaipur B1648 Page No. 97
Computer Architecture Unit 1
the branch until the branch destination is known.
• Forwarding: A simple hardware technique that can solve the problem
posed due to data hazards.
• Hazards: Situations that stop the next instruction in the instruction stream
from being executed during its designated clock cycle.
• ID: Instruction Decoding
• IF: Instruction Fetching
• Pipelining: An implementation technique by which the execution of
multiple instructions can be overlapped
• Pipeline multiplier: An array multiplier with special address designed to
minimise the carry propagation time through the partial products.
• WB: Write back
4.12 Terminal Questions
1. What do you understand by Parallel Processing? What are the different
types of Parallel Processing?
2. Describe Pipelining Processing. Explain the sequence of instructions in
Pipelining.
3. Explain briefly the types of Pipelining.
4. What do you mean by Hazards? Explain the types of Hazards.
5. Explain in detail the techniques to handle Hazards.
4.13 Answers
Self Assessment Questions
1. Pipelining
2. Virtual parallelism
3. Load Memory Data
4. First-in first-out (FIFO) buffer
5. Linear
6. Non-Linear
7. Dynamic pipelines
8. Hazards
9. Resource conflicts
10. Data dependency
11. Branch difficulties
Manipal University Jaipur B1648 Page No. 98
Computer Architecture Unit 1
12. True
13. 6
14. Control Hazards, data hazards
15. Taken, not taken or untaken
16. Forwarding
17. Bypassing or short-circuiting
18. Freeze or flush the pipeline
19. Assume each branch as not-taken
20. Load-stall count
21. LD
22. LDR
23. Stall
24. Cycles per Instruction
Terminal Questions
1. The concurrent use of two or more CPU or processors to execute a
program is called parallel processing. For details -Refer Section 4.1.
2. An implementation technique by which the execution of multiple
instructions can be overlapped is called pipelining. Refer Section 4.2 for
more details.
3. There are two types of pipelining-Linear and non-linear. Refer Section 4.3
for more details.
4. Hazards are the situations that stop the next instruction in the instruction
stream from being executed during its designated clock cycle. Refer
Section 4.4.
5. There are two techniques to handle hazards namely minimising data
hazard stalls by forwarding and reducing pipeline branch penalties. Refer
Section 4.7.
References:
• David Salomon, Computer Organisation, 2008, NCC Blackwell
• John L. Hennessy and David A. Patterson, Computer Architecture: A
Quantitative Approach, Fourth Edition, Morgan Kaufmann Publishers
• Joseph D. Dumas II; Computer Architecture; CRC Press
• Nicholas P. Carter; Schaum’s outline of computer Architecture; McGraw-
Hill Professional
Manipal University Jaipur B1648 Page No. 99
Computer Architecture Unit 1
E-references:
• http://www.lc3help.com/tutorials/Basic_LC-3_Instructions/ Retrieved on
03-04-2012
• http://www.scribd.com/doc/4596293/LC3-Instruction-Details Retrieved
on 02-04-2012
• http://xavier.perseguers.ch/programmation/mips-
assembler/references/5-stage-pipeline.html
Manipal University Jaipur B1648 Page No. 100
Computer Architecture Unit 1
Unit 5 Design Space of Pipelines
Structure:
5.1 Introduction
Objectives
5.2 Design Space of Pipelines
Basic layout of a pipeline Dependency resolution
5.3 Pipeline Instruction Processing
5.4 Pipelined Execution of Integer and Boolean Instructions
The design space
Logical layout of FX pipelines
Implementation of FX pipelines
5.5 Pipelined Processing of Loads and Stores Subtasks of load and store
processing The design space
Sequential consistency of instruction execution Instruction issuing
and parallel execution
5.6 Summary
5.7 Glossary
5.8 Terminal Questions
5.9 Answers
5.1 Introduction
In the previous unit, you studied pipelined processors in great detail with a
short review of pipelining and examples of some pipeline in modern
processors. You also studied various kinds of pipeline hazards and the
techniques available to handle them.
In this unit, we will introduce you to the design space of pipelines. Day-by- day
increasing complexity of the chips had lead to higher operating speeds. These
speeds are provided by overlapping instruction latencies or by implementing
pipelining. In the early models, discrete pipeline was used. Discrete pipeline
performs the task in stages like fetch, decode, execution, memory, and write-
back operations. Here every pipeline stage requires one cycle of time, and as
there are 5 stages so the instruction latency is of five cycles. Longer pipelines
over more cycles can hide instruction latencies.
This provides processors to attain higher clock speeds. Instruction pipelining
has significantly improved the performance of today’s processors. In this unit,
you will study the design space of pipelines which is further divided into basic
Manipal University Jaipur B1648 Page No. 101
Computer Architecture Unit 1
layout of a pipeline and dependency resolution. We focus primarily on
pipelined execution of Integer and Boolean instructions and pipelined
processing of loads and stores.
Objectives:
After studying this unit, you should be able to:
• explain design space of pipelines
• describe pipeline instruction processing
• identify pipelined execution of Integer and Boolean instructions
• discuss pipelined processing of loads and stores
5.2 Design Space of Pipelines
In this section, we will learn about the design space of pipelines. The design
space of pipelines can be sub divided into two aspects as shown in figure 5.1.
Figure 5.1: Principle Aspects of Design Space of Pipelines
Let’s discuss each one of them in detail.
5.2.1 Basic Layout of a pipeline
To understand a pipeline in depth, it is necessary to know about those
decisions which are fundamental to the layout of a pipeline. Let’s discuss them
below:
The number of pipeline stages used to perform a given task are:,
1. Specification of the subtasks to be performed in each of the pipeline
stages,
2. Layout of the stage sequence, that is, whether the stages are used in a
strict sequential manner or some stages are recycled,
3. Use of bypassing, and
4. Timing of the pipeline operations, that is, whether pipeline operations are
Manipal University Jaipur B1648 Page No. 102
Computer Architecture Unit 1
controlled synchronously or asynchronously.
Figure 5.2 depicts these stages diagrammatically.
Basic layout of a pipeline
6 0 0 0 0
Number of Specification Layout of the Uta of Timing of the
stages of the subtasks stage aequence bypassing pipeline operations to bo performed In
each of the stages
Figure 5.2: Overall Stage Layout of a pipeline
5.2.2 Dependency resolution
Pipeline design has another aspect called the dependency resolution. Earlier,
some pipelined computers used the Microprocessor without Interlocked
Pipeline Stages (MIPS approach) and used a static dependency resolution
which is also called static scheduling or software interlock resolution.
Here the detection and proper resolution of dependencies is done by the
compiler. Examples of static dependency resolution are:
• Original MIPS designs (like the MIPS and the MIPS-X)
• Some less famous RISC processors (like RCA, Spectrum)
• Intel processor (i860) which has both VLIW and scalar operation modes.
A further advanced resolution scheme is the combined static/dynamic
dependency resolution. This has been employed by MIPS R processors like
R2000, R3000, R4000, R4200 and R6000. In the first MIPS processors
(R2000, R3000) hardware interlocks were used for the long latency
operations, such as multiplication, division and conversion, while the
resolution of short latency operations relied entirely on the compiler. Newer R-
series implementations have extended the range of hardware interlocks
further and further, first to the load/store hazards (R6000) and then to other
short latency operations as well (R4000). In the 84000, the only instructions
which rely on a static dependency resolution are the coprocessor control
instructions.
In recent processors dependencies are resolved dynamically, by extra
hardware. Nevertheless, compilers for these processors are assumed to
Manipal University Jaipur B1648 Page No. 103
Computer Architecture Unit 1
perform a parallel optimisation by code reordering, in order to increase
performance. Figure 5.3 shows the various possibilities of resolving the
pipeline hazards.
Figure 5.3: Possibilities for Resolving Pipeline Hazards
Self Assessment Questions
1. The full form of MIPS is ____________________ .
2. In recent processors dependencies are resolved ________________ , by
extra hardware.
Activity 1:
Visit a library and find out the features ofR2000, R3000, R4000, R4200 and
R6000. Compare them in a chart.
5.3 Pipeline Instruction Processing
An Instruction pipeline operates on a stream of instructions by overlapping and
decomposing the three phases (fetch, decode and execute) of the instruction
cycle. It has been extensively used in RISC machine and many high-end
mainframes as one of the major contributor to achieve high performance. A
typical instruction execution in pipeline architecture consists of a sequence of
following operations:
1. Fetch instruction: In this operation, the next expected instruction is read
into a buffer from cache memory.
2. Decode instruction/register fetch: The Instruction Decoder reads the
Manipal University Jaipur B1648 Page No. 104
Computer Architecture Unit 1
next instruction from the memory, decode it, optimize the order of
execution and further sends the instruction to the destinations.
3. Calculate operand address: Now, the effective address of each source
operand is calculated.
4. Fetch operand/memory access: Then, the memory is accessed to fetch
each operand. For a load instruction, data returns from memory and is
placed in the Load Memory Data (LMD) register. If it is a store, then data
from register is written into memory. In both cases, the operand address
as computed in the prior cycle is used.
5. Execute instruction: In this operation, the ALU perform the indicated
operation on the operands prepared in the prior cycle and store the result
in the specified destination operand location.
6. Write back operand: Finally, the result into the register file is written or
stored into the memory.
These six stages of instruction pipeline are shown in a flowchart in figure 5.4.
Manipal University Jaipur B1648 Page No. 105
Computer Architecture Unit 1
Figure 5.4: Flowchart of an Instruction Pipeline
Self Assessment Questions
3. In _______________ the result into the register file is written or stored
into the memory.
4. In Decode Instruction/Register Fetch operation, the ______________
and the _______________ are determined and the register file is
accessed to read the registers.
5.4 Pipelined Execution of Integer and Boolean Instructions
Now let us discuss Pipelined execution of integer and Boolean instructions
Manipal University Jaipur B1648 Page No. 106
Computer Architecture Unit 1
with respect to the design space.
5.4.1 The design space
In this section, first we will overview the salient aspects of pipelined execution
of FX instructions. (In this section, the abbreviation FX will be used to denote
integer and Boolean.) With reference to figure 5.5 we emphasise two basic
aspects of the design space: how FX pipelines are laid out logically and how
they are implemented.
Figure 5.5: Design Space of the Pipelined Execution of FX instructions
A logical layout of an FX pipeline consists, first, of the specification of how
many stages an FX pipeline has and what tasks are to be performed in these
stages. These issues will be discussed in Section 5.4.2 for RISC and CISC
pipelines. The other key aspect of the design space is how FX pipelines are
implemented. In this respect we note that the term FX pipeline can be
interpreted in both a broader and a narrower sense. In the broader sense, it
covers the full task of instruction fetch, decode, execute and, if required, write
back. In this case, it is usually employed for the execution of Local Store (LS)
and branch instructions and is termed a master pipeline. By contrast, in the
narrower sense, an FX pipeline is understood to deal only with the execution
and writeback phases of the processing of FX instructions. Then, the
preceding tasks of instruction fetch, decode and, in the case of superscalar
execution, instruction issue are performed by a separate part of the processor.
Manipal University Jaipur B1648 Page No. 107
Computer Architecture Unit 1
5.4.2 Logical layout of FX pipelines
Integer and Boolean instructions account for a considerable proportion of
programs. Together, they amount to 30-40% of all executed instructions.
Therefore, the layout of FX pipelines is fundamental to obtaining a high-
performance processor.
In the following topic, we discuss how FX pipelines are laid out. However, we
describe the FX pipelines for RISC and C1SC processors separately, since
each type has a slightly different scope. While processing operates
instructions, RISC pipelines have to cope only with register operands. By
contrast, CISC pipelines must be able to deal with both register and memory
operands as well as destinations.
Pipeline in RISC architecture: Before discussing pipelines in RISC
machines, let us first discuss what is a RISC machine? The term RISC stands
for Reduced Instruction Set Computing. RISC computers reduce chip
complexity by using simpler instructions. As a result, RISC compilers have to
generate software routines to perform complex instructions that would have
been done in hardware by CISC (Complex Instruction Set Computing)
computers. The salient features of RISC architecture are as follows:
• RISC architecture has instructions of uniform length.
• Instruction sets are streamlined to carry efficient and important
instructions.
• Memory addressing method is simplified. The complex references are split
up into various reduced instructions.
• The numbers of registers are increased. RISC processors can have
minimum 16 and maximum 64 registers. These registers get hold of
variables that are frequently used.
Pipelining is a standard feature in RISC processors. A typical RISC processor
pipeline operates in the following steps:
1. Fetch instructions from the memory
2. Read the registers and then decode instruction
3. Either execute instruction or compute the address
4. Access the operand stored at that memory location
5. Write the calculated result into the register
RISC instructions are simpler as compared to the instructions used in CISC
processors. It is due to the pipelining feature used there. CISC instructions are
Manipal University Jaipur B1648 Page No. 108
Computer Architecture Unit 1
of variable length, while RISC instructions are of same length. RISC
instructions can be fetched in a single operation. Theoretically, one clock cycle
should be taken by each stage in RISC processor so that the processor
completes execution of one instruction in one clock cycle. But practically, RISC
processors take more than one cycle for one instruction. The processor may
sometimes stall due to branch instructions and data dependencies. Data
dependency takes place if an instruction waits for the output of previous
instruction. Delay can also be due to the reason that instruction is waiting for
some data which is not currently available in the register. So, the processor
cannot finish an instruction in one clock cycle.
Branch instructions are those that tell the processor to make a decision about
what the next instruction to be executed. They are generally based on the
results of another instruction. So, they can also create problems in a pipeline
if a branch is conditional on the results of an instruction, which has not yet
finished its path through the pipeline. In that case also, the processor takes
more than one clock cycle to finish one instruction.
Pipeline in CISC architecture: CISC is an acronym for Complex Instruction
Set Computer. The CISC machines are easy to program and make efficient
use of memory. Since the earliest machines were programmed in assembly
language and memory was slow and expensive, the CISC philosophy was
commonly implemented in large computers such as PDP-11. Most common
microprocessor designs such as the Intel 80x86 and Motorola 68K series have
followed the CISC philosophy. The CISC instructions sets have the following
main features:
• Two-operand format; here instructions have both source & destination.
• Register to register, memory to register and register to memory
commands.
• Multiple addressing modes for memory, having specialised modes for
indexing through arrays
• Depending upon the addressing mode, the instruction length varies
• Multiple clock cycles required by instructions to execute.
Intel 80486, a CISC machine, uses 5-stage pipeline. Here the CPU tries to
maintain one instruction execution per clock cycle. However, this architecture
does not provide maximum potential performance improvement due to the
following reasons:
Manipal University Jaipur B1648 Page No. 109
Computer Architecture Unit 1
• Occurrence of sub-cycles between the initial fetch and the instruction
execution.
• Execution of an instruction waiting for previous instruction output.
• Occurrence of the branch instruction.
5.4.3 Implementation of FX pipelines
Most of today's arithmetic pipelines are designed to perform fixed functions.
These arithmetic/logic units (ALUs) perform fixed-point and floating-point
operations separately. The fixed-point unit is also called the integer unit. The
floating-point unit can be built either as part of the central processor or on a
separate coprocessor. These arithmetic units perform scalar operations
involving one pair of operands at a time. The pipelining in scalar arithmetic
pipelines is controlled by software loops. Vector arithmetic units can be
designed with pipeline hardware directly under firmware or hardwired control.
Scalar and vector arithmetic pipelines differ mainly in the areas of register files
and control mechanisms involved. Vector hardware pipelines are often built as
add-on options to a scalar processor or as an attached processor driven by a
control processor. Both scalar and vector processors are used in modem
supercomputers.
Arithmetic pipeline stages: Depending on the function to be implemented,
different pipeline stages in an arithmetic unit require different hardware logic.
Since all arithmetic operations (such as add, subtract, multiply, divide,
squaring, square rooting, logarithm, etc.) can be implemented with the basic
add and shifting operations, the core arithmetic stages require some form of
hardware to add and to shift. For example, a typical three-stage floatingpoint
adder includes a first stage for exponent comparison and equalisation which
is implemented with an integer adder and some shifting logic; a second stage
for fraction addition using a high-speed carry look ahead adder; and a third
stage for fraction normalisation and exponent readjustment using a shifter and
another addition logic.
Manipal University Jaipur B1648 Page No. 110
Computer Architecture Unit 1
Arithmetic or logical shifts can be easily implemented with shift registers. High-
speed addition requires either the use of a carry-propagation adder (CPA)
which adds two numbers and produces an arithmetic sum as shown in
figure5.6a, or the use of a carry-save adder (CSA) to "add" three input
numbers and produce one sum output and a carry output as exemplified in
figure 5.6b.
e.g. n=4
A = 10 11
<•) B = 0 111
S=10010=A*B
(Sum)
(a) An n-bit carry-propagate adder (CPA) which allows either carry
propagation or applies the carry-lookahead technique
e.g. n=4
X=
CSA
Sb= 0 1 0 0 0 1 1
+) C = 0 1 1 1 0 1 0
c Sb
8=1011111= Sb+C = X+Y+Z (Bitwise
(Carry
vector) sum)
(b) An n-bit carry-save adder (CSA), where Sb is the bitwise sum of X. Y, and Z. and
C is a carry vector generated without cany propagation between digits
Figure 5.6: Distinction between a Carry-propagate Adder (CPA) and a
Carry-save Adder (CSA)
Manipal University Jaipur B1648 Page No. 111
Computer Architecture Unit 1
In a CPA, the carries generated in successive digits are allowed to propagate
from the low end to the high end, using either ripple carry propagation or sonic
carry looks-head technique. In a CSA, the carries are not allowed to propagate
but instead are saved in a carry vector. In general, an n-bit CSA is specified
as follows: Let X, Y, and Z be three n-bit input numbers, expressed as X= (xn-
1, xn-2, x1, x0) and so on. The CSA performs bitwise operations simultaneously
on all columns of digits to produce two n- bit output numbers, denoted as Sb =
(0, Sn-1, Sn-2, ..., S1, S0) and C = (Cn, Cn- 1, C1, 0).Note that the leading bit of
the bitwise sum, Sb is
always a 0, and the tail bit of the carry vector C is always a 0. The inputoutput
relationships are expressed below:
Si = x ® y-® Zi
Ci +1 = XiYiv yiZiv ZiXi..5.1 fori = 0,1, 2, ...,n - 1, where ® is the exclusive
OR and v is the logical OR operation. Note that the arithmetic sum of three
input numbers, i.e., S = X+ Y + Z, is obtained by adding the two output
numbers, i.e., S = Sb +C, using a CPA. We use the CPA and CSAs to
implement the pipeline stages of a fixed-point multiply unit as follows.
Multiply Pipeline Design: Consider as an example the multiplication of two
8-bit integers A x B = P, where P is the 16-bit product. This fixed-point
multiplication can be written as the summation of eight partial products as
shown below: P = A x B = P0 + P1+ P2 + ...............+ P7, where x and + are
arithmetic multiply and add operations, respectively.
10 1 10 ) 0 1 ° A
x) I 0 0 10 0 I 1 = B
10 1 10 1 0 1 - Po
101 1 0 1 0 1 0 -
00000 0 0 0 0 0 - P2
000000 00 0 0 0 ” Py
1011010 1 0 0 0 0 ° PA
00000000 00 0 0 0 = Pi
000000000 0 0 0 0 0 - P(,
+> 1 01101010000 0 00 = P7
0110011111101 1 11=P
Manipal University Jaipur B1648 Page No. 112
Computer Architecture Unit 1
Note that the partial product Pj, is obtained by multiplying the multiplicand A by
the jth bit of B and then shifting the result j bits to the left for j = 0, 1, 2, ..., 7.
Thus Pj, is (8 + j) bits long with j trailing zeros. The summation of the eight
partial products is done with a Wallace tree of CSAs plus a CPA at the final
stage, as shown in figure 5.7.
Figure 5.7: A Pipeline Unit for Fixed-point Multiplication of 8-bit Integers
The first stage (S1) generates all eight partial products, ranging from 8 bits to
15 bits, simultaneously. The second stage (S2) is made up of two levels of four
CSAs, and it essentially merges eight numbers into four numbers ranging from
13 to 15 bits. The third stage (S3) consists of two CSAs, and it merges four
numbers from S2 into two 16-bit numbers. The final stage (S4) is a CPA, which
adds up the last two numbers to produce the final product P.
For a maximum width of 16 bits, the CPA is estimated to need four gate levels
Manipal University Jaipur B1648 Page No. 113
Computer Architecture Unit 1
of delay. Each level of the CSA can be implemented with a two-gate- level
logic. The delay of the first stage (S1) also involves two gate levels.
Thus the entire pipeline stages have an approximately equal amount of delay.
The matching of stage delays is crucial to the determination of the number of
pipeline stages, as well as the clock period. If the delay of the CPA stage can
be further reduced to match that of a single CSA level, then the pipeline can
be divided into six stages with a clock rate twice as fast. The basic concepts
can be extended to operands with a larger number of bits.
Self Assessment Questions
5. While processing operates instructions, RISC pipelines have to cope only
with __________________ .
6. In RISC architecture, instructions are of a uniform length (True/ False).
7. Name two microprocessors which follow the CISC philosophy.
8. ____________ adds two numbers and produces an arithmetic sum.
Activity 2:
Access the internet and find out more about the difference between fixed point
and floating point units.
5.5 Pipelined Processing of Loads and Stores
Now let us study pipelined processing of loads and stores in detail.
5.5.1 Subtasks of load and store processing
Loads and stores are frequent operations, especially in RISC code. While
executing RISC code we can expect to encounter about 25-35% load
instructions and about 10% store instructions. Thus, it is of great importance
to execute load and store instructions effectively. How this can be done is the
topic of this section.
To start with, we summarise the subtasks which have to be performed during
a load or store instruction.
Let us first consider a load instruction. Its execution begins with the
determination of the effective memory address (EA) from where data is to be
fetched. In straightforward cases, like RISC processors, this can be done in
two steps: fetching the referenced address register(s) and calculating the
effective address. However, for CISC processors address calculation may be
a difficult task, requiring multiple subsequent register fetches and address
Manipal University Jaipur B1648 Page No. 114
Computer Architecture Unit 1
calculations, as for instance in the case of indexed, postincremented, relative
addresses. Once the effective address is available, the next step is usually, to
forward the effective (virtual) address to the MMU for translation and to access
the data cache. Here, and in the subsequent discussion, we shall not go into
details of whether the referenced cache is physically or virtually addressed,
and thus we neglect the corresponding issues. Furthermore, we assume that
the referenced data is available in the cache and thus it is fetched in one or a
few cycles. Usually, fetched data is made directly available to the requesting
unit, such as the FX or FP unit, through bypassing. Finally, the last subtask to
be performed is writing the accessed data into the specified register.
For a store instruction, the address calculation phase is identical to that already
discussed for loads. However, subsequently both the virtual address and the
data to be stored can be sent out in parallel to the MMU and the cache,
respectively. This concludes the processing of the store instruction. Figure 5.8
shows the subtasks involved in executing load and store instructions.
Figure 5.8: Subtasks of Executing Load and Store Instructions
5.5.2 The design space
While considering the design space of pipelined load/store processing we take
into account only one aspect, namely whether load/store operations are
executed sequentially or in parallel with FX instructions (Figure 5.9).
In traditional pipeline implementations, load and store instructions are
processed by the master pipeline. Thus, loads and stores are executed
sequentially with other instructions (Figure 5.9).
Manipal University Jaipur B1648 Page No. 115
Computer Architecture Unit 1
US addresses are calculated by US is performed by a separate US unit
the FX pipeline Arfonomons load/store unrtfs)
Mester pipeline SuperSPARC
(960CA (1989)
(1992p) PowerPC 601 (1993) MC88110 (1991)
R4000 (1992) PowerPC 603 (1993)
Pentium (1993,2 FX EUs) PowerPC 604 (1995)
68060 (I993p) a21164 PowerPC 620 (1996)
(1994,2 FX EUs) Power? R6000 (1994,2 US units)
(1993) (121064/21064A (1992,1993)
US: Load/Store
Performance, trend
Figure 5.9: Sequential vs. Parallel Execution of Load/Store Instructions
In this case, the required address calculation of a load/store instruction can be
performed by the adder of the execution stage. However, one instruction slot
is needed for each load or store instruction.
A more effective technique for load/store instruction processing is to do it in
parallel with data manipulations (see again Figure 5.9). Obviously, this
approach assumes the existence of an autonomous load/store unit which can
perform address calculations on its own.
Let’s discuss both these techniques in detail.
5.5.3 Sequential consistency of instruction execution
By operating the processors with multiple EUs (Execution Units) in parallel, the
instructions execution can be finished very fast. However, all the instructions
execution should maintain sequential consistency. The sequential consistency
follows two aspects:
1. Processor Consistency - the order of instructions execution ();
Manipal University Jaipur B1648 Page No. 116
Computer Architecture Unit 1
2. Memory Consistency - the order of accessing the memory ().
Processor consistency: The phrase Processor Consistency is applied to
suggest the consistency of instruction completion with sequential instruction
execution. There are two types of processor consistency reflected by
Superscalar processors; namely weak or strong consistency.
Weak processor consistency specifies that all the instructions must be
executed justly; with the condition of no violation of data dependencies. Data
dependencies must be observed and settled during the execution.
Strong processor consistency forces the instructions to follow program order
for the execution. This can be attained through ROB (reorder buffer).ROB is a
storage area from where all data is read and written.
Memory consistency: One another face of superscalar instruction execution
is whether memory access is executed in the same order as in a sequential
processor.
Memory consistency is weak if with strict sequential program execution, the
memory access is out-of-order. Moreover, data dependencies should not be
dishonoured. Simply, it can be stated that weak consistency permits load and
store reordering and being very particular about memory data dependencies,
to be found and settled.
Memory consistency is strong, if memory access occurs strictly in program
order and load/store reordering is prohibited.
Load and Store reordering
Load and store instructions affect both the processor and the memory. Firstly
ALU or address unit computes the addresses and then the load and store
instructions get executed.
Now, the loads can fetch the data cache from the memory data. Once the
generated address is received, a store instruction can send the operands.
Processor affirming weak memory consistency permits memory access
reordering. This point can be considered as advantageous because of the
following three reasons:
1. Permitting load/store bypassing,
2. Making speculative loads or stores feasible
3. Allowing hiding of cache misses.
Load/Store bypassing
Load/Store bypassing means that any of the two can bypass each other. This
Manipal University Jaipur B1648 Page No. 117
Computer Architecture Unit 1
means either stores can bypass loads or vice versa, without violating the
memory data dependencies. The bypassing of loads to stores provides the
advantage of runtime overlapping of loops.
This is accomplished by permitting loads at the origin of iteration to access
memory without having to hold till stores at the end of the former iteration are
finished. In order to prevent fetching a false data value, a load can bypass
pending stores if none of the previous stores have the same target address as
the load. Nevertheless, certain addresses of pending stores may not be
available.
Speculative loads
Speculative loads avoid memory access delay. This delay can be caused due
to the non- computation of required addresses or clashes among the
addresses. The speculative loads should be checked for correctness. If
required then respective measures should be taken to done for it. Speculative
loads are alike speculative branches.
To check the address, write the loads and stores computed target address into
ROB (ReOrder buffer). The address comparison is carried out at ROB.
Reorder buffer (ROB)
ROB came in 1988 for the solution of precise interrupt problem. Currently,
ROB is an assurance tool for sequential consistency execution where multiple
EUs operate in parallel.
Manipal University Jaipur B1648 Page No. 118
Computer Architecture Unit 1
ROB is a circular buffer. It has a head and tail pointers. In ROB, instructions
enter in program order only. Instructions can only be retired if all of their
previous instructions have finished and they had also retired.
Sequential consistency can be maintained by directing instructions to update
the program state by writing their results in proper program order into the
memory or referenced architectural register(s). ROB can successfully support
both interrupt handling and speculative execution.
5.5.4 Instruction Issuing and parallel execution
In this phase execution tuples are created. After its creation it is decided that
which execution tuple can now be issued. When the accessibility of data and
resources are checked during run-time it is then known as Instruction Issuing.
In instruction issuing area many pipelines are processed.
In figure 5.10 you can see a reorder buffer which follows FIFO order.
Figure 5.10: A Reorder Buffer.
In this buffer the entries received and sent in FIFO order. When the input
operands are present then the instruction can be executed. Other instruction
might be located in instruction issue.
Other constraints are associated with the buffers carrying the execution tuples.
In figure 5.11 you can see the Parallel Execution Schedule (PES) of iteration.
PES has hardware resources which contain one path to the memory, two
integer units and one branch unit.
Manipal University Jaipur B1648 Page No. 119
Computer Architecture Unit 1
INTEGER UNIT 1 INIEGERUNIT2 MEMORYUNIT BRANCHUNIT
time
■ove Rl,r7
add 82, P.1,4 lw rB, (RI)
Kve 83, r7 lx rS, (82)
add 84,83,4
add r5,rS,l add rt.rt.l sx r9, (83) ble r3,r9,L3 1 add r7,r7,4 sv rB,(84) bit r6,r4,L2
Figure 5.11: Example of PES
You can see that rows are showing the time steps and columns are showing
certain operations performed in time step. In this PES we can see that in
branch unit “ble” is not taken and it is theoretically executing instruction from
predicted path. In this example we have showed renaming values for only r3
register but others can also be renamed. Various values allotted to register r3
are bounded to different physical register (R1, R2, R3, R4).
Now you can see numerous ways of arranging instruction issue buffer for
boosting up the complexity.
Single queue method: Renaming is not needed in single queue method
because this method has 1 queue and no out of ordering issue. In this method
the operand availability could be handled through easy reservation bits allotted
to every register. During the instructional modification of register issues, a
register reserved and after the modification finished the register is cleared.
Multiple queue method: In multiple queue method, all the queues get
instruction issue in order. Due to other queues some queues can be issued
out. With respect to instruction type single queues are organized.
Reservation stations: In reservation stations, the instruction issue does not
follow the FIFO order. As a result for data accessibility, the reservation stations
at the same time have to observe their source operands. The conventional
way of doing this is to reserve the operand data in reservation station. As
reservation station receive the instruction then available operand values are
firstly read and placed in it.
After that it logically evaluate the difference between the operand designators
of inaccessible data and result designators of finishing instructions. If there is
similarity, then the result value is extracted to matching reservation station.
Instruction got issued as all the operands are prepared in reservation station.
It can be divided into instruction type for decreasing data paths or may behave
Manipal University Jaipur B1648 Page No. 120
Computer Architecture Unit 1
as a single block.
Self Assessment Questions
9. In traditional pipeline implementations, load and store instructions are
processed by the ___________________ .
10. The consistency of instruction completion with that of sequential
instruction execution is specified b ______________ .
11. Reordering of memory accesses is not allowed by the processor which
endorses weak memory consistency does not allow (True/False).
12. ____________ is not needed in single queue method.
13. In reservation stations, the instruction issue does not follow the FIFO
order. (True/ False).
5.6 Summary
• The design space of pipelines can be sub divided into two aspects:
basic layout of a pipeline and dependency resolution.
• An Instruction pipeline operates on a stream of instructions by
overlapping and decomposing the three phases (fetch, decode and
execute) of the instruction cycle.
• Two basic aspects of the design space are how FX pipelines are laid out
logically and how they are implemented.
• A logical layout of an FX pipeline consists, first, of the specification of how
many stages an FX pipeline has and what tasks are to be performed in
these stages.
• The other key aspect of the design space is how FX pipelines are imple-
mented.
• In logical layout of FX pipelines, the FX pipelines for RISC and CISC
processors have to be taken separately, since each type has a slightly
different scope.
• Pipelined processing of loads and stores consist of sequential consistency
of instruction execution and parallel execution.
5.7 Glossary
• CISC: It is an acronym for Complex Instruction Set Computer. The CISC
machines are easy to program and make efficient use of memory.
• CPA: It stands for carry-propagation adder which adds two numbers
and produces an arithmetic sum.
• CSA: It stands for carry-save adder which adds three input numbers
and produces one sum output.
Manipal University Jaipur B1648 Page No. 121
Computer Architecture Unit 1
• LMD: Load Memory Data.
• Load/Store bypassing: It defines that either loads can bypasss stores or
vice versa, without violating the memory data dependencies.
• Memory consistency: It is used to find out whether memory access is
performed in the same order as in a sequential processor.
• Processor consistency: It is used to indicate the consistency of
instruction completion with that of sequential instruction execution.
• RISC: It stands for Reduced Instruction Set Computing. RISC
computers reduce chip complexity by using simpler instructions.
• ROB: It stands for Reorder Buffer. ROB is an assurance tool for
sequential consistency execution where multiple EUs operate in parallel.
• Speculative loads: They avoid memory access delay. This delay can be
caused due to the non- computation of required addresses or clashes
among the addresses.
• Tomasulo’s algorithm: It allows the replacement of sequential order by
data-flow order.
5.8 Terminal Questions
1. Name the two sub divisions of design space of pipelines and write short
notes on them.
2. What do you mean by pipeline instruction processing?
3. Explain the concept of pipelined execution of Integer and Boolean
instructions.
4. Describe the logical layout of both RISC and CISC computers.
5. Write in brief the process of implementation of FX pipelines.
6. Explain the various subtasks involved in load and store processing
7. Write short notes on:
a. Sequential Consistency of Instruction Execution
b. Instruction Issuing and Parallel Execution
5.9 Answers
Self Assessment Questions
1. Microprocessor without Interlocked Pipeline Stages
2. Dynamically
3. Write Back Operand
4. Opcode, operand specifiers
5. Register operands
6. True
Manipal University Jaipur B1648 Page No. 122
Computer Architecture Unit 1
7. Intel 80x86 and Motorola 68K series
8. Carry-propagation adder (CPA)
9. Master pipeline
10. Processor Consistency
11. False
12. Renaming
13. True
Terminal Questions
1. The design space of pipelines can be sub divided into two aspects: basic
layout of a pipeline and dependency resolution. Refer Section 5.2.
2. A pipeline instruction processing technique is used to increase the
instruction throughput. It is used in the design of modern CPUs,
microcontrollers and microprocessors.Refer Section 5.3 for more details.
3. There are two basic aspects of the design space of pipelined execution of
Integer and Boolean instructions: how FX pipelines are laid out logically
and how they are implemented. Refer Section 5.4.
4. While processing operates instructions, RISC pipelines have to cope only
with register operands. By contrast, CISC pipelines must be able to deal
with both register and memory operands as well as destinations. Refer
Section 5.4.
5. Depending on the function to be implemented, different pipeline stages in
an arithmetic unit require different hardware logic. Refer Section 5.4.
6. The execution of load and store instructions begins with the
determination of the effective memory address (EA) from where data is to
be fetched. This can be broken down into subtasks. Refer
Section 5.5.
7. The overall instruction execution of a processor should mimic sequential
execution, i.e. it should preserve sequential consistency. Refer Section
5.5. The first step is to create and buffer execution and then determine
which tuples can be issued for parallel execution. Refer Section 5.5.
References:
• Hwang, K. (1993) Advanced Computer Architecture. McGraw-Hill.
• Godse D. A. & Godse A. P. (2010). Computer Organisation, Technical
Publications. pp. 3-9.
• Hennessy, John L., Patterson, David A. & Goldberg, David (2002)
Computer Architecture: A Quantitative Approach, (3rd edition), Morgan
Manipal University Jaipur B1648 Page No. 123
Computer Architecture Unit 1
Kaufmann.
• Sima, Dezso, Fountain, Terry J. & Kacsuk, Peter (1997) Advanced
computer architectures - a design space approach, Addison-Wesley-
Longman: I-XXIII, 1-766.
E-references:
• http://www.eecg.toronto.edu/~moshovos/ACA06/readings/ieee-
proc.superscalar.pdf
• http://webcache.googleusercontent.com/search?q=cache:yU5nCVnju9
cJ:www.ic.uff.br/~vefr/teaching/lectnotes/AP1-topico3.5.ps.gz+load+
store+sequential+instructions&cd=2&hl=en&ct=clnk&gl=in
Unit 6 Instruction-Level Parallelism and its Exploitation
Structure:
6.1 Introduction
Objectives
6.2 Dynamic Scheduling
Advantages of dynamic scheduling Limitations of dynamic
Scheduling
6.3 Overcoming Data Hazards
6.4 Dynamic Scheduling Algorithm - The Tomasulo Approach
6.5 High performance Instruction Delivery
Branch target buffer
Advantages of branch target buffer
6.6 Hardware-based Speculation
6.7 Summary
6.8 Glossary
6.9 Terminal Questions
6.10 Answers
6.1 Introduction
In pipelining, two or more instructions that are independent of each other can
overlap. This possibility of overlap is known as ILP (instruction-level
parallelism). It is addressed as ILP because the instructions may be assessed
parallelly. Parallelism level is quite small in straight-line codes where there are
no branches except the entry or exit. The easiest and most widely used
methodology to enhance parallelism is by exploiting parallelism among the
Manipal University Jaipur B1648 Page No. 124
Computer Architecture Unit 1
loop iterations. This is termed as “loop-level parallelism”.
In the previous unit, you studied design space of pipelines. You studied various
aspects such as pipelined execution of integer and Boolean instructions and
pipelined processing of loads and stores. In this unit, we will throw light on the
process of overcoming hazards with dynamic schedule, its examples and
algorithm. We will also examine the High performance instruction delivery and
hardware based speculation.
Objectives:
After studying this unit, you should be able to:
• describe the process of overcoming the data hazards with dynamic
scheduling
• give examples of dynamic scheduling
• describe the Tomasulo approach of dynamic scheduling algorithm
• identify techniques of overcoming data hazards with dynamic scheduling
• analyse the concept of high performance instruction delivery
• explain hardware based speculation
6.2 Dynamic Scheduling
Pipeline fetches an instruction and executes it. This flow is restrained if there
exists any data dependencies among the instruction already in the pipeline
and the fetched instruction that can be hidden with bypassing or forwarding.
When the data dependence between the instructions cannot be hidden, then
in such a case the hazard detection hardware generally stalls the instruction
pipeline. In this scenario, new instructions are neither fetched nor issued till
the time the dependence is resolved. Techniques for scheduling the
instructions need to be examined properly in order to so as to identify the
dependent instructions and also to decrease the actual hazards and their
resultant stalls. This act of scheduling is termed as static scheduling.
There is another category of scheduling known as dynamic scheduling. A
dynamic scheduling is the hardware based scheduling. In this approach, the
hardware rearranges the instruction execution to reduce the stalls. Dynamic
scheduling reduces the stalls and simultaneously maintains the data flow &
exceptions in the instruction execution.
6.2.1 Advantages of dynamic scheduling
There are various advantages of dynamic scheduling. They are as follows:
1. Dynamic scheduling is helpful in situations where the data dependencies
between the instructions are not known during the time of compilation.
Manipal University Jaipur B1648 Page No. 125
Computer Architecture Unit 1
2. Dynamic scheduling also helps to simplify the task of compiler.
3. It permits code compiled by one pipeline in mind to execute efficiently on
some other pipeline.
6.2.2 Limitations of dynamic scheduling
Dynamic scheduling has several limitations:
• The pipelining techniques we have used so far use in-order instruction
issue. This acts as a major limitation. In-order instruction means that the
following instructions cannot proceed if there is any instruction stalled in
instruction pipeline. Therefore, when two nearly positioned instructions are
dependent on each other, then a stall occurs.
Existence of multiple functional units could lead to idle-time of these units.
Suppose if any instruction j depends on any time-consuming instruction i,
which is presently being executed in the instruction pipeline, then in such
a case all instructions following instruction j needs to be stalled till the time
instruction i is over and instruction j begins execution. For example,
consider this code sequence:
DIVD FO, F2, F4
ADDD F10, FO, F8
SUBD F12, F8, F14
Here F0, F1, F2....F14 are the floating point registers (FPRs) and DIVD, ADDD
and SUBD are the floating point operations on double precision(denoted by
D). The dependence of ADDD on DIVD causes a stall in the pipeline; and thus,
the SUBD instruction cannot execute. IF the instructions are not executed in
same sequence then this limitation could be ruled out.
In case of DLX (DLX is a RISC processor architecture) pipeline, the structural
& data hazards are examined during the instruction decode (ID). If any
instruction can carry out appropriately, it is issued from ID. To commence with
the execution of the SUBD, we need to examine the following two issues
separately:
• Firstly we need to analyse the any type of structural hazards
• Secondly, we need to wait for the non-occurrence of any data hazard.
Structural hazards must be checked at the time of issuance. Therefore, inorder
instruction issuance is still used. Moreover, instruction implementation must
Manipal University Jaipur B1648 Page No. 126
Computer Architecture Unit 1
initiate at the instant when the data operands are available for access.
Therefore the pipeline which executes out-of-order results in out-of-order
completion.
But the out-of-order completion results in various types of difficulties in
exception handling. The exceptions generated in a dynamic scheduled
processor are also imprecise because any instruction may be entirely
executed before any previously issued instruction generates an exception. In
such a scenario, it is quite challenging to again start after the interrupt.
For carrying out out-of-order execution, we need to necessarily separate the
ID (Instruction Decode) pipe stage into two. These are as follows:
1. Issue - In this stage, the instructions are decoded and a check for
identifying structural hazards is performed.
2. Read operands - In this stage the operands are read after no data hazards
are detected.
IF (instruction fetch) comes before the issue stage. The IF can fetch and issue
instructions from a queue or latch. The EX (Execution) stage follows the read
operands stage. Based on the complexity of operation, the execution may
involve various cycles. Consequently, there must be a demarcation between
the initiation of instruction execution and completion of instruction execution.
Doing so will allow simultaneous execution of multiple instructions.
Self Assessment Questions
1. The methodology, which involves separation of dependent instructions,
minimizes data/structural hazards and consequential stalls is termed as
2. To commence with the execution of the SUBD, we need to separate the
issue method into 2 parts: firstly __________________ and secondly
3. ______________ stage precedes the issue phase.
4. The _______________ stage follows the read operands stage similar
to the DLX pipeline.
6.3 Overcoming Data Hazards
Now let us discuss the methods of overcoming data hazards with dynamic
scheduling in this section.
Dynamic Scheduling with a Scoreboard
In a dynamically scheduled pipeline, all instructions pass through the issue
Manipal University Jaipur B1648 Page No. 127
Computer Architecture Unit 1
stage in order (in-order issue); however, they can be stalled or bypass each
other in the second stage (read operands) and thus enter execution out of
order. Score board is a method of permitting out-of-order instruction execution
when sufficient resources are available and there are no data dependencies.
The CDC (Control Data Corporation) 6600 scoreboard developed this
capability and it is named after it. (CDC 6600 was a family of mainframe
computers manufactured by Control Data Corporation)
Out-of-order instruction execution may give rise to WAR (Write after Read, a
type of data hazard) hazards which are not present in DLX floating point and
integer pipelines.
Let us consider that SUBD destination is F8 in the earlier example; then its
code sequence will be as shown below:
In this example you can see that ADDD and SUBD are interdependent. If
SUBD is executed before ADDD, then the data interdependence will be
violated resulting in wrong execution. Similarly, to refrain output dependencies
violation, it is essential to detect WAW (Write after Write) data hazards
Scoreboard technique helps to minimize or remove both the structural as well
as the data hazards. Scoreboard stalls the later instruction that is engaged in
the interdependence. Scoreboard’s goal is to execute an instruction in each
clock cycle (in situation where no structural hazards exist). Therefore, when
any instruction is stalls, some other independent instructions may be executed.
The scoreboard technique takes complete accountability for issuing and
executing the instruction together with all hazards detection. To take
advantage of executing instructions out-of-order necessarily requires several
instructions to be executed simultaneously. We can achieve this by use of
either of the two ways:
1. By utilizing pipelined functional units
2. By using multiple functional units
The above given ways are necessary for pipeline control. Here we will consider
the use of multiple functional units.
CDC 6600 comprises of 16 distinct functional units. These are of following
types:
Manipal University Jaipur B1648 Page No. 128
Computer Architecture Unit 1
• Four FPUs (floating-point units)
• Five units for memory references
• Seven units for integer operations.
FPUs are of prime importance in DLX scoreboards in comparison to other FU
(functional units).
For example: We have 2 multipliers, 1 adder, 1 divide unit, and 1 integer unit
for all integer operations, memory references and branches.
The methodology for the DLX & CDC 6600 is quite similar as both of these are
load-store architectures. Given below in figure 6.1 is the basic structure of a
DLX Processor with a Scoreboard.
Figure 6.1: The Basic Structure of a DLX Processor with a Scoreboard
Here every instruction involves four execution steps, considering only the FP
operations only. Now let’s us analyse in detail the manner in which the
scoreboard stores the essential information so as to determine when to move
from one step to another. Figure 6.2 below shows these steps.
Manipal University Jaipur B1648 Page No. 129
Computer Architecture Unit 1
Figure 6.2: Steps Replaced in the Standard DLX Pipeline
Now let us study the four steps in the scoreboard technique in detail.
1. Issue: Issue step is used as a replacement of a part of ID step of DLX
pipeline. In this step the instruction is forwarded to FU. The internal data
construction is also modified here. It is done only in two situations:
• FU for the instruction is jobless.
• No other active instruction has the same register as destination. This
ensures that the operation is free from WAW (Write after Write) data
hazard.
When any structural or WAW hazards are detected, the stall occurs and
the issue of all subsequent instructions is stopped until these data hazards
have been corrected. when a stall occurs in this stage, the buffer between
instruction issue and fetch is filled. If buffer contains a single instruction
then the instruction fetch also stalls at once but if the buffer space contains
a queue, it creates stalls only after the buffer queue is fully filled.
2. Read operands: The scoreboard examines if the source operands is
available or not. The source operand is said to be available when no
previously issue active instruction is ready to write to it. The scoreboard
prompts the FU to start reading the operands from data registers and start
execution as soon as the source operands become available. Read after
Write (RAW) hazards are resolved in a dynamic manner during this stage.
It may also send instructions for out-of-order execution. Issue and read
operand step together completes the functions of the ID step of DLX
Manipal University Jaipur B1648 Page No. 130
Computer Architecture Unit 1
pipeline.
3. Execution: After receiving the operands, the FU starts execution. on
completion of execution, the result is generated. Thereafter FU informs the
scoreboard about the completion of execution step. Execution step is used
in place of EX step of DLX pipeline but in latter it may involve multiple
cycles.
4. Write result: after the FU completes execution, the scoreboard detects
whether the WAR hazards are present or not. If the WAR hazard is
detected, it stalls the instruction. WAR hazard occurs when there is an
instruction code as in our earlier example of ADDD & SUBD where both
utilize F8. The code for that example is again shown below:
Here you can see that the source operand for ADDD is F8 that is similar to the
destination register of SUBD. However, ADDD in fact is dependent on the
previous instruction DIVD. In this case, the scoreboard will stall SUBD in its
write result stage till the time ADDD read its operands.
Any completing instruction may not be permitted to write its results in following
cases:
• when there exists any instruction which hasn’t read its operands that
precedes (i.e., in issuance order) the completing instruction
• one of the operands is the same register as the result of the completing
instruction
Manipal University Jaipur B1648 Page No. 131
Computer Architecture Unit 1
After handling the WAR hazard, the scoreboards prompts the FU for storing
their results into destination register. This step is s replacement of the WB step
of DLX pipeline.
The DLX scoreboard comprises of functional units. Figure 6.3 shows what the
scoreboard’s information looks like through the execution of this simple
sequence of instructions:
Scoreboard shows three types of status. These are:
1. Instruction status: Instruction status shows in which of the four steps the
instruction is currently.
2. Functional unit status: This shows the status of FU (functional unit).
There are mainly nine fields for every FU, shown below:
3. Register result status: It is used to declare which FU can write on any
register, whether an active instruction has any register allocated for its
destination or not. In situation where no pending instructions exist that
need to written to any register, the field is set as blank.
Manipal University Jaipur B1648 Page No. 132
Computer Architecture Unit 1
Instruction status
Issue Read operands Execution complete Write result
Instruction
LD F6,34(R2) V V
LD F2,45(R3) y J
MULTD F0,F2,F4
SUED F8,F6,F2
DIVD F10,F0,F6 >1
ADDD F6,F8,F2
Functional unit status
Name Busy Op Fi Fj Fk Qj Qk Rj Rk
Integer Yes Load F2 R3 No
Multi Yes Mult FO F2 F4 Integer No Yes
Mult2 No
Add Yes Sub F8 F6 F2 Integer Yes No
Divide Yes Div FIO FO F6 Multi No Yes
Register result status
HI F2 F4 F6 F8 FIO FI2 ... F30
FU Multi Integer Add Divide
Figure 6.3: Components of the Scoreboard
Self Assessment Questions
5. When the pipeline executes ____________________ before ADDD, it
violates the interdependence between instructions leading to wrong
execution.
6. The objective of scoreboard is achieved with _________________ or
______________________ functional units or both.
7. The source operand for ADDD is ______________ , and is similar to
destination register of SUBD.
8. The FU status ___________________ shows whether it is busy or
idle.
Manipal University Jaipur B1648 Page No. 133
Computer Architecture Unit 1
6.4 Dynamic Scheduling Algorithm - The Tomasulo Approach
Dynamic Scheduling Algorithm was proposed by Robert Tomasulo.
Tomasulo’s scheme combines the important constituents of scoreboard
methodology with the prologue of Register renaming. This scheme has many
variants. The basic idea behind this algorithm is “Avoiding WAR and WAW
data hazards by use of renaming registers”.
The Tumasulo algorithm
It was formulated for IBM 360/91 in 1967; approximately three years later to
CDC 6600. This algorithm emphasises on the FPUs, in relation to a pipelined
FPU for DLX. The key distinction between DLX and the IBM360 is that IBM
360 processor contains register-memory instructions.
Tomasulo’s algorithm makes use of a load FU therefore no key alterations are
essential for adding register-memory addressing modes. One of the most
significant additions is an added bus. The IBM 360/91 also contains pipelined
FU rather than numerous FUs. The only dissimilarity is that pipelined FU can
commence at the most one action in a clock cycle. There are no major
variations between the IBM 360/91 and CDC6600. The IBM 360/91 is capable
of holding 3 operations for the FP (floating-point) adder and 2 for the FP
(floating-point multiplier). Additionally it may contain maximum of 6 FP loads,
or memory references, and 3 FP stores as outstanding. To do this load data
buffers & store data buffers are utilized.
There are various differences between Tomasulo’s scheme and
scoreboarding. These are given below:
• In Tomasulo’s scheme, the control and buffers are dispersed between FUs
(Functional Units) but it is centralised in score board technique. In case of
Tomasulo’s scheme register renaming is done to avoid the data and
structural hazards but no register renaming is done in score board
technique.
• CBD (Common Data Bus) is responsible for broadcasting the results to all
FUs in case of Tomasulo’s scheme. But scoreboard technique writes the
results into various registers.
• The Tumasulo algorithm can read operands from registers and CDB
(common data bus) and write operands to CDB only. While the operands
are read and written from and to registers in case of score board technique.
Manipal University Jaipur B1648 Page No. 134
Computer Architecture Unit 1
• In Tomasulo’s scheme, issue can take place only when the RS
(Reservation station) is free while the issue can take place when the FU is
free.
Figure 6.4 shows the basic structure of a Tomasulo-based floating-point unit
for DLX.
Figure 6.4: Basic Structure of a DLX Floating-Point Unit using Tomasulo’s
Algorithm
The reservation station contains the following:
• Issued instructions which are waiting for execution by the FU,
• operands for the instructions which have already been worked out (else
the source of the operands),
• Information required to handle the instruction after it has started execution.
The addresses, which come from or go to the memory are held in the load
buffers and store buffers. A pair of bus connects the FP register to FU and a
bus connects FP register to store buffers. Common bus transmits the results
Manipal University Jaipur B1648 Page No. 135
Computer Architecture Unit 1
from the FU & from memory everywhere excluding the load buffer. The buffers
& RS (reservation stations) contain tag fields that are utilized for hazard
control.
Tomasulo’s scheme is invoking when the designers are compelled to pipeline
the architecture where it is hard to schedule code or has registers sufficiency
of. But when evaluated in terms of cost, the benefits of the Tomasulo approach
as compared to compiler scheduling for an effective single-issue pipeline are
very less. But with the increasing demand for issuance capability and improved
performance of difficult-to-schedule codes the methods of dynamic scheduling
& register renaming are becoming more wide-spread.
Self Assessment Questions
9. Tumasulo scheme was invented by _____________ .
10. The ________________ could hold 3 operations for the FP adder and
2 for the FP multiplier.
11. The ____________ and _____________ are used to store the data/
addresses that come from or go to memory.
Activity 1:
Imagine yourself as a computer architect. Explain the measures you will take
to overcome data hazards with dynamic scheduling.
6.5 High Performance Instruction Delivery
In case of MIPS 5-stage pipelining, the address of the incoming-instruction-
fetch must be recognized before the completion of the present Instruction
Fetch (IF) cycle. Consequently, for ZERO branch penalties, it ought to be
realized if the fetched (as-yet un-decoded) instruction is branch or not. In case
it is a branch then it must also know the next-PC (Program Counter). This is
accomplished by introducing a Cache which contains the address of the
following instruction if branch is taken as well as not-taken. This cache is
known as the Branch-Target Cache or Branch-Target Buffer (BTB). The
branch-prediction buffer is accessed throughout the ID phase, after the
instruction decode, i.e., we know the branch-target address at the end of ID
stage to fetch the next predicted instruction. This is shown in figure 6.5.
Manipal University Jaipur B1648 Page No. 136
Computer Architecture Unit 1
Figure 6.5: Branch Prediction
6.5.1 Branch target buffer
Branch Target Buffer has three fields:
• Lookup: Addresses of the known branch instructions (predicted as
taken)
• Predicted PC: PC of the fetched instruction predicted taken-branch
• Prediction State: Optional: Extra prediction state bits
Branch Target Buffer has the following complications:
• Complication arise in using 2-bit predictor because it uses information for
both the branches taken and not-taken
• This complication is resolved in PowerPC processors by using both the T
arget-buffer and Prediction-buffer
The penalty can be calculated by looking at the possibility of the 2 events:
(i) Branch predicted taken but end up not take
= %buffer hit rate x % incorrect prediction
= 0.95 x 0.1 = 0.095
(ii) Branch is taken but is not found in buffer
= % incorrect prediction = 0.1
The penalty in both the cases is 2 cycles, therefore,
Branch Penalty = (0.095 + 0.1) x2 = 0.195 x 2 = 0.39
Manipal University Jaipur B1648 Page No. 137
Computer Architecture Unit 1
Example:
Consider a branch-target buffer implemented for conditional branches only for
pipelined processor.
Assuming that:
• Misprediction penalty = 4 cycles
• Buffer miss-penalty = 3cycles
• Hit rate and accuracy each = 90%
• Branch Frequency = 15%
Solution:
The speedup with Branch Target Buffer verses no BTB is expressed as:
Speedup = CPI no BTB/CPI BTB
= (CPI base+Stallsno BTB) / (CPI base + Stalls BTB)
The stalls are determined as:
Stalls = ZFrequency x Penalty
The sum over all the stall cases is given as the product of frequency of the stall
cases and the stall-penalty.
i) Stallsno BTB = 0.15 x 2 = 0.30
ii) To find Stalls BTB, we have to consider each output from BTB
There exist three possibilities:
a) Branch misses the BTB:
Frequency = 15 % x 0.1 = 1.5% = 0.015
Penalty = 3
Stalls=0.045
b) Branch can hit and correctly predicted:
Frequency = 15 % x 0.9(htt)x 0.9^^^)= 12.1% = 0.121
Penalty = 0
Stalls= 0
c) Branch can hit but incorrectly predicted:
Frequency = 15 % x 0.9 (hit) x 0.1 (misprediction) = 1.3% = 0.013 Penalty
=4
Stalls = 0.052
iii) Stalls BTB = 0.045 + 0 + 0.052 = 0.097
Speedup = (CPIbase + Stallsno BTB) / (CPIbase + Stal^)
= (1.0 + 0.3) / (1.0 + 0.097)
Manipal University Jaipur B1648 Page No. 138
Computer Architecture Unit 1
= 1.2
In order to achieve more instruction delivery, one possible variation in the
Branch Target Buffer is:
• To keep one or more than one target instructions, instead of or in addition to,
the anticipated Target Address
6.5.2 Advantages of branch target buffer
There are several advantages of branch target buffer. They are as follows:
• It possibly allows larger BTB as it allows access to take more time between
consecutive instructions fetches
• Buffering the actual Target-Instructions allow Branch Folding, i.e., ZERO
cycle Unconditional Branching or sometimes ZERO Cycle conditional
Branching
Self Assessment Questions
12. The branch-prediction buffer is accessed during the _____ stage.
13. The _____ field helps check the addresses of the known branch
instructions.
14. Buffering the actual Target-Instructions allow ___________ .
6.6 Hardware-Based Speculation
Hardware-based speculation is the methodology for cutting down the
consequences of control dependencies in multiple issue processors.
This methodology is based upon three basic ideas:
• Dynamic branch prediction that determines the which particular
instructions has to be executed,
• Speculation that permits the instructions execution prior to resolving the
control dependencies and
• Dynamic scheduling that relates to scheduling of various grouping of the
basic blocks.
When any processor allows branch prediction together with dynamic
scheduling, then it always assumes that the branch-prediction results were
correct, accordingly fetches, and issues instructions.
Hardware-based speculation makes use of dynamic data dependences to
select when to carry out instructions. This technique of executing programs is
basically a data-flow execution: operations carry out as soon as their operands
are accessible. This can be seen in figure 6.6.
Manipal University Jaipur B1648 Page No. 139
Computer Architecture Unit 1
Figure 6.6: Hardware-based Speculation
Advantages of hardware-based speculation
Some of the major advantages of hardware-based speculation (in comparison
to s software-based speculation) are as follows:
1. A hardware-based speculation helps in disambiguation of memory
references at the run time mostly in case of pointers. This permits us to
transfer loads past stores at runtime.
2. Hardware-based speculation is more beneficial whenever the hardware-
based branch prediction is higher-up to software-based branch
prediction performed at time of compilation. This is valid for numerous
integer programs.
For instance, a profile-based static predictor has an approximate
misprediction rate of 16% for four of the five integer SPEC programs we
use, while a hardware predictor has an approximate misprediction rate of
11%. As speculated instructions might retard the computation rate
whenever the prediction is wrong, this variation is substantial.
3. Hardware-based speculation helps to maintain an entirely accurate
exception model for speculated instructions.
Manipal University Jaipur B1648 Page No. 140
Computer Architecture Unit 1
4. Hardware-based speculation neither demand compensation nor book-
keeping code.
5. Hardware-based speculation along with dynamic scheduling does not
necessitate distinct code sequences for achieving quality performance for
various implementations of the architecture. While the compilerbased
speculation & scheduling necessarily need the different code sequences
that are customized according to the machine, and more outdated or
unlike code sequences may deteriorate the performance.
6. Although hardware speculation & scheduling can take advantage from
scheduling & tuning processors, utilizing hardware-based methodologies
are anticipated to perform substantially even with earlier or dissimilar code
sequences. Although this benefit is the most difficult to measure, it may
be the most significant in the end.
Self Assessment Questions
15. __________ makes use of dynamic data dependences to select
when to carry out instructions.
16. Hardware-based speculation is more beneficial whenever the
hardware-based branch prediction is higher-up to software
___________________ performed at time of compilation
17. Hardware-based speculation helps to maintain an entirely accurate
exception model for __________ .
18. Hardware-based speculation demand neither ________________ nor
Activity 2:
In a computer designing situation, discuss why software-based speculation
is not superior.
6.7 Summary
Let us recapitulate the important concepts discussed in this unit:
• In pipelining, implementation of instructions independent of one another
can overlap. This possible overlap is known as instruction-level parallelism
(ILP)
• Pipeline fetches an instruction and executes it.
• In DLX pipelining, all the structural & data hazards are analyzed
throughout the process of instruction decode (ID).
• A dynamic scheduling is the hardware based scheduling. In this
approach, the hardware rearranges the instruction execution to reduce the
stalls.
Manipal University Jaipur B1648 Page No. 141
Computer Architecture Unit 1
• Score board is a method of permitting out-of-order instruction execution
when sufficient resources are available and there are no data
dependencies. There are four steps in this technique.
• The Tumasulo algorithm emphasises on the FP, in relation to a pipelined,
FPU for DLX.
• The addresses, which come from or go to the memory are held in the load
buffers and store buffers.
• Hardware-based speculation makes use of dynamic data dependences to
select when to carry out instructions.
6.8 Glossary
• Dynamic scheduling: Hardware based scheduling that rearranges the
instruction execution to reduce the stalls.
• EX: Execution stage
• FP: Floating-Point Unit
• ID: Instruction Decode
• ILP: Instruction-Level Parallelism
• Instruction-level parallelism: Overlap of independent instructions on one
another
• Static scheduling: Separating dependent instructions and minimising the
number of actual hazards and resultant stalls.
6.9 Terminal Questions
1. What do you understand by instruction-level parallelism? Also, explain
loop-level parallelism.
2. Describe the concept of dynamic scheduling.
3. How does the execution of instructions take place under dynamic
scheduling with score boarding?
4. What is the goal of score boarding?
5. Explain the tumasulo approach.
6.10 Answers
Self Assessment Questions
1. Static scheduling
2. Check the structural hazards, wait for the absence of a data hazards
3. An instruction fetch
4. EX
5. SUBD, ADDD
6. Pipelined, multiple
Manipal University Jaipur B1648 Page No. 142
Computer Architecture Unit 1
7. F8
8. Busy
9. Robert Tomasulo
10. IBM 360/91
11. Load buffers, store buffers
12. ID
13. Lookup
14. Branch Folding
15. Hardware-based speculation
16. Software-based branch prediction
17. Speculated instructions
18. Compensation, bookkeeping code
Terminal Questions
1. In pipelining, implementation of instructions independent of one another
can overlap. This possible overlap is known as instruction-level parallelism
(ILP). Refer Section 6.1.
2. In dynamic scheduling, instructions can be executed out of program order
without hampering the result. Refer Section 6.2.
3. A dynamic scheduling is the hardware based scheduling. In this approach,
the hardware rearranges the instruction execution to reduce the stalls.
Refer Section 6.3.
4. The objective of a scoreboard is to maintain an execution rate of one
instruction per clock cycle (when there are no structural hazards) by
executing an instruction as early as possible. Refer Section 6.3.
5. Robert Tomasulo proposed this technique and is therefore named after
him. In this methodology the important elements of the scoreboarding
scheme are merged with the prologue of register renaming. Refer Section
6.4.
References:
• John L. Hennessy and David A. Patterson, Computer Architecture: A
Quantitative Approach, Fourth Edition, Morgan Kaufmann Publishers.
• David Salomon, Computer Organisation, 2008, NCC Blackwell.
• Joseph D. Dumas II; Computer Architecture; CRC Press.
• Nicholas P. Carter; Schaum’s Outline of Computer Architecture; Mc. Graw-
HiLl Professional.
Manipal University Jaipur B1648 Page No. 143
Computer Architecture Unit 1
E-references:
• http://cnx.org/content/m29416/latest/
• http://www.ece.unm.edu/
• www.nvidia.com
• www.jilp.org/
• www-ee.eng.hawaii.edu/
Unit 7 Exploiting Instruction - Level Parallelism With
Software Approach
Structure:
7.1 Introduction
Objectives
7.2 Types of Branches
Unconditional branch Conditional branch
7.3 Branch Handling
7.4 Delayed Branching
7.5 Branch Processing
7.6 BranchPrediction
Fixed branch prediction Static branch prediction Dynamic branch
prediction
7.7 The Intel IA-64 Architecture and Itanium Processor
7.8 ILP in the Embedded and Mobile Markets
7.9 Summary
7.10 Glossary
7.11 Terminal Questions
7.12 Answers
7.1 Introduction
In the previous unit, you studied Instruction-level parallelism and its dynamic
exploitation. You learnt how to overcome data hazards with dynamic
scheduling besides performance instruction delivery and hardware based
speculation.
As mentioned in the previous unit, inherent property of a sequence of
instructions, results in execution of some instructions parallel which is also
known as Instruction level parallelism (ILP). There is an upper bound, as to
how much parallelism can be achieved. We can approach this upper bound
via a series of transformations that either expose or allow more ILP to be
exposed to later transformations. The best way to exploit ILP is to have a
Manipal University Jaipur B1648 Page No. 144
Computer Architecture Unit 1
collection of transformations that operate on or across program blocks, either
producing “faster code” or exposing more ILP. In this unit, you will study the
software approach of exploiting Instruction-level parallelism. You will also learn
about various concepts like types of branches, branch handling, delayed
branching, branch processing, and static branch prediction. Beside these, we
will also discuss the Intel IA-64 architecture and Itanium processor. We will
conclude this unit by discussing ILP in the embedded and mobile markets.
Objectives:
After studying this unit, you should be able to:
• identify the various types of branches
• explain the concept of branch handling
• describe the role of delayed branching
• recognise branch processing
• discuss the process of branch prediction
• explain Intel IA-64 architecture and Itanium processor
• discuss the use of ILP in the embedded and mobile markets
7.2 Types of Branches
Implementation of branching is done by using a branch instruction. The
address of target instruction is included in the branch instruction. In processors
(for example, Pentium), we can also call this instruction as jump instruction.
The different types of branches are:
• unconditional
• conditional
In unconditional as well as conditional branch, the method for transfer control
remains similar. This is shown in figure 7.1 below:
Manipal University Jaipur B1648 Page No. 145
Computer Architecture Unit 1
Branch Target
Figure 7.1: Control Flow in Branching
7.2.1 Unconditional Branch
This type of branch is considered as the simplest one. It is used to transfer
control to a particular target. Let us discuss an example as follows:
branch target
Target address specification can be performed in any of the following ways:
• absolute address
or
• PC-relative address
In case of absolute address, target instruction’s actual address is specified.
The method of PC-relative address specifies the address of target instruction
corresponding to the contents of PC. Many of the processors provide support
to absolute address in case of unconditional branches. Both the formats are
supported by others. For instance, MIPS processors provides support to
absolute address-based branch by means of
j target.
Also it supports PC-relative unconditional branch by means of b target
Actually, the final instruction is considered as assembly language instruction,
even though only j instruction is supported by processor. Every branch
instruction is permitted to utilise any of the absolute or a PC-relative address.
This permission is provided by PowerPC. Instruction encoding comprises a bit.
We call this bit as AA (absolute address) bit, which specifies the address type.
If the value of AA is equal to 1, it is considered as absolute address, is or else,
it is considered as PC-relative address.
Manipal University Jaipur B1648 Page No. 146
Computer Architecture Unit 1
In case of using absolute address, processor transfers the control by just
loading the particular address of target into PC register. In case of using PC-
relative addressing, the particular address of target exists as an addition to the
contents of PC. The outcome is positioned in PC. In both cases, as the PC
signifies the address of next instruction, the instruction will be fetched by
processor at the proposed target address. The major benefit of utilising PC
relative address is that the code can be moved from memory’s one block to
another block where target addresses are not changed. We call this code as
re-locatable code which is impossible in case of absolute addresses.
7.2.2 Conditional Branch
Here if a particular condition meets its requirements, then only the jump is
conducted. For instance, a branch may be needed when two values are equal.
These types of conditional branches can be managed in any of the following
fundamental ways:
• Set-then-Jump: This design separate the testing for condition as well
as branching. A condition code register is used for attaining
communication among the instructions for condition as well as branching.
This design is followed by Pentium which makes use of flag register for
recording the outcome of test condition. For testing the condition, mp
(compare) instruction is used. Numerous flag bits are fixed by this
instruction. This specifies the connection among two compared values. For
instance, let us consider the zero bit. In the case when two values are
same, then this bit is set. Now if the zero bit is set, then the conditional
instruction, that is, jump can be used. This instruction is used to jump to
the target location. This sequence can be clarified by the following code
segment, where the values available in register AX as well as register BX
are compared:
Manipal University Jaipur B1648 Page No. 147
Computer Architecture Unit 1
cmp AX,BX ;compare the two values in AX and BX ;if equal, transfer
je target control to target ;if not, this instruction is executed
sub AX,BX
target:add AX,BX ;control is transferred here if AX
= BX
...
Here je is defined as jump if equal instruction which transfers control to
target in the case when two values in register AX as well as in register BX
are equal.
• Test-and-Jump: Many of the processors merge the testing as well as
branching into a particular instruction. MIPS processor is used to
demonstrate the rule included in this approach. MIPS offer numerous
branch instructions which are used for testing and branching. Below, you
can see the branch on equal instruction:
beq Rsrc1,Rsrc2,target
The conditional branch instruction given above performs the testing of the
contents available in two registers, that is, Rsrc1 as well as Rsrc2 for
equality. The control is transferred to the target if their values appear to be
equal. Let us suppose that the numbers that are to be compared are
placed in register t0 and register t1. For this, the branch instruction is
written as below:
beq $t1,$t0,target
The instruction given above substitutes the two-instruction cmp/je
sequence which is utilised by Pentium.
Registers are maintained by some of the processors. This is done for recording
the condition of arithmetic as well as logical operations. We call these registers
as condition code registers.
The status of the last arithmetic or logical operation is recorded by these
registers. For instance, if two 32-bit integers are added, i then the sum might
need more than 32 bits. It is an overflow condition which should be recorded
by the system. Usually, this overflow condition is indicated by setting a bit in
condition code register. For example, the MIPS, does not make use of
condition registers. Rather, it to flag the overflow condition exceptions is used.
Alternatively, th processors such as the Pentium, SPARC, and Power PC
Manipal University Jaipur B1648 Page No. 148
Computer Architecture Unit 1
make use of the condition registers. In case of Pentium, this information is
recorded by flags register. In case of PowerPC, XER register keeps the record
of this information. . SPARC utilises a condition code register. Many instruction
sets present branches founded on comparisons to zero.
SPARC and MIPS processors are the examples of processors that offer this
kind of branch instructions. Extremely pipelined RISC processors provide
support to what we call as delayed branch execution. Refer figure 7.1 to
observe the dissimilarity between delayed and normal branch execution. On
the execution of branch instruction, it transfers control to the target instantly.
For instance, Pentium makes use of this kind of branching. In case of delayed
branch execution, control is transmitted to target. This is done after the
execution of instruction which follows branch instruction.
In figure 7.1, for instance, the execution of instruction y takes place before
transferring the control. We call this slot of instruction as delay slot. For
instance, delayed branch execution is used by the SPARC. Actually, it delayed
execution is also used for procedure calls. This process helps because when
the processor is decoding branch instruction, the instruction which comes next
is already obtained. Therefore, the efficiency is improved by the execution of
it rather than throwing it away. This approach needs rearrangement of several
instructions.
Self Assessment Questions
1. Branch instruction like Pentium is also known as ___________ .
2. It is possible to have Re-locatable code in case of absolute addresses.
(True/False)
Activity 1:
Work on an MIPS processor to find out the difference between conditional
and unconditional branching.
7.3 Branch Handling
Branch is a flow altering instruction that is required to be handled in a special
manner in pipelined processors. Branch instruction’s impact on the pipeline is
shown in figure 7.2 (a) as below:
Manipal University Jaipur B1648 Page No. 149
Computer Architecture Unit 1
Figure 7.2: Branch Instruction’s impact on Pipeline
As shown in figure 7.2, the instruction Ib is considered as a branch instruction.
In the case when branch is taken, the control is transferred to instruction It. On
the other hand, if branch is not taken, instructions available in pipeline are of
use.
In the case when the branch is taken, every instruction available in pipeline, at
different stages, is removed. In the example discussed above, it is required to
remove instructions I2, I3, and I4. Fetching of instructions begin at target
address. Due to this our pipeline works inefficiently for three clock cycles. We
call this process as branch penalty.
Now we will discuss the process of reducing this branch penalty. In figure 7.2,
it is observed that we wait in anticipation of the IE (execution) stage. This is
done before starting instruction fetch at target address. The delay can be
reduced if it is determined previously. We can reduce the delay if we can
determine this earlier. As shown in figure 7.2(b), for instance, to find out if
branch is taken together with the information of target address throughout the
Manipal University Jaipur B1648 Page No. 150
Computer Architecture Unit 1
ID (decode) stage, it is required for us to give one cycle’s penalty.
In the example discussed above, it is required remove just one instruction (I2).
However can the required information be obtained at decode stage? For many
of the branch instructions, target address is specified as the instruction’s part.
Thus calculation of the target address is comparatively simple. Determining
whether the branch is taken throughout decode stage may not be an easy
process. For instance, it may be required to fetch operands in addition to
comparing their values so as to find out whether the branch is taken. This
signifies that it is required to wait in anticipation of the execution stage.
Self Assessment Questions
3. ______ is a flow altering instruction that is required to be handled in
a special manner in pipelined processors.
4. Wasteful work done by pipeline for a considerable time is called the
7.4 Delayed Branching
In figure 7.2 (b), it is shown that the branch penalty can be reduced to one
cycle. Branch penalty is efficiently reduced further by means of Delayed
branch execution. The plan is based on the study that the instruction is always
fetched that follows the branch before identifying whether the branch is taken.
Now the question arise that why the instruction is not executed rather than
throwing it away? This means that it is required to put a valuable instruction in
this instruction slot. We call this instruction slot a delay slot. Alternatively,
branching is delayed until after the execution of instruction in the delay slot
take place. A number of processors, for example, MIPS and SPARC make use
of delayed execution for procedure calls as well as branching. When this
method is applied, it is required to perform modification in our program so as
to place a valuable instruction inside delay slot. For example, let us consider
the code segment given below for better understanding.
add branch R2, R3, R4
sub target
R5, R6, R7
target:
mult R8, R9, R10
Manipal University Jaipur B1648 Page No. 151
Computer Architecture Unit 1
When the branch delay takes place, the instructions can be rearranged so as
to move the branch instruction forward by one instruction. This is shown as
below:
branch target
add R2, R3, R4 /* Branch delay slot */
sub R5, R6, R7
... .. .
target: mult R10
R8, R9,
The process of moving instructions into delay slots is not an issue of worry for
programmers. This task is accomplished by compilers in addition to
assemblers. If any valuable instruction cannot be moved into delay slot, NOP
operation (no operation) is placed. This is to observe that if the branch is not
taken, we would not like to provide execution to delay slot instruction. This
means that we would like to nullify the instruction in delay slot. A number of
processors such as SPARC offer this option of nullification.
Self Assessment Questions
5. A number of processors such as __________ and _________ make
use of delayed execution for procedure calls as well as branching.
6. If any valuable instruction cannot be moved into delay slot, is placed.
7.5 Branch Processing
Branch Processing helps in instruction execution. It receives branch
instructions and resolves the conditional branches as early as possible. For
resolving it uses static and dynamic branch prediction. Effective processing of
branches has become a cornerstone of increased performance in ILP-
processors. No wonder, therefore, that in the pursuit of more performance,
predominantly in the past few years, computer architects have developed a
confusing variety of branch processing schemes.
After the recent announcements of a significant number of new processors,
we are in a position to discern trends and to emphasise promising solutions.
Branch processing has two aspects, its layout and its micro-architectural
implementation, as shown in figure 7.3.
Manipal University Jaipur B1648 Page No. 152
Computer Architecture Unit 1
detection conditional branches branch target path
Figure 7.3: Design Space of Branch Processing
As far as its layout is concerned, branch processing involves three major
subtasks: detecting branches, handling of unresolved conditional branches
during instruction decoding and accessing the branch target path.
However, the earlier a processor detects branches, the earlier branch
processing can be started and the fewer penalties there are. Therefore, novel
schemes try to detect branches as early as possible.
The next aspect of the layout is the handling of unresolved conditional
branches. We note that we designate a conditional branch unresolved if the
specified condition is not yet available at the time when it is evaluated during
branch processing. The last aspect of the layout of branch processing is how
the branch target path is accessed.
Self Assessment Questions
7. Branch processing has two aspects _______ and _______ .
8. Name the major sub tasks of branch processing.
7.6 Branch Prediction
Branch prediction is a method which is basically utilised for handling the
problems related to branch. Different strategies of branch prediction include:
• Fixed branch prediction
• Static branch prediction
• Dynamic branch prediction
These approaches are discussed as below:
7.6.1 Fixed branch prediction
In fixed branch prediction, prediction is considered to be fixed. This approach
of branch prediction is easy to implement. This approach presumes either of
the following: • branch is never taken
Manipal University Jaipur B1648 Page No. 153
Computer Architecture Unit 1
or
• branch is always taken
The examples of branch-never-taken strategy include VAX 11/780 and
Motorola 68020.
The benefit of using never-taken approach is that the instructions are
continuously fetched by processor so that the pipeline can be filled. Now if the
prediction turns out to be wrong, then minimum penalty would be there.
Alternatively, using always-taken strategy involves the pre-fetching of
instruction by the processor. This is done at the target address of branch.
In case of paged environment, a page fault may take place. To handle this
situation, a special method is required.
Now in case of loop structure, the never-taken strategy is not appropriate. If a
loop is repeated 200 times, then the branch is taken 199 times out of 200 times.
The always-taken strategy is a better one in case of loops. Likewise, we prefer
the always-taken strategy for procedure calls as well as returns.
7.6.2 Static branch prediction
Till now it is understood that instead of using a fixed approach, the
performance can be improved by making use of an approach which is reliant
on the type of branch. This type of approach is known as the static branch
prediction. This approach makes use of instruction opcode for predicting
whether the branch is taken. This approach provides high
prediction correctness. To illustrate this, let us show sample data for industrial
environments. In these types of environments, the prediction of branches and
loops from all branch-type operations are discussed below: • branches
are about 70%,
• loops are about 10%
• remaining operations include procedure calls/returns
40% of branches are considered to be unconditional from the total branches.
On using a never-taken strategy for conditional branch and always-taken
strategy for remaining branch-type operations, there occurs 82% prediction
accuracy. This is shown in table 7.1 as below
Table 7.1: Static Branch Prediction Accuracy
Instruction Prediction: Correct prediction
Instruction type distribution (%) Branch taken? (%)
Unconditional branch 70 x 0.4 = 28 Yes 28
Manipal University Jaipur B1648 Page No. 154
Computer Architecture Unit 1
Conditional branch 70 x 0.6 - 42 No 42 x 0.6 - 25.2
Loop 10 Yes 10 x 0.9 = 9
Call/renim 20 Yes 20
Overall prediction accuracy = 82.2%
It is presumed by the data in the table given above that approximately 60% of
the time conditional branches are not taken. Therefore this prediction of
conditional branch is accurate only sixty percent of the time. So now we get
the following:
42 x 0.6 = 25.2%
This is the prediction accuracy in case of conditional branches.
Likewise, loops jump back having 90% possibility. As loops emerge about 10%
of the time, 9% prediction appears to be accurate. To our surprise, even this
static prediction approach provides accuracy of about 82%.
7.6.3 Dynamic branch prediction
For making more accurate predictions, this approach considers run-time
history. Here the n branch executions of history are considered and this
information is used for predicting the next one.
The experiential study done by Smith and Lee proposes that this approach
provides major enhancement in prediction accuracy. In table 7.2, we have
shown a summary of what they have studied.
Table 7.2: Affect of utilising the information of Past Branches on Prediction
Accuracy
An algorithm that is applied is simple. That is, the next branch prediction is the
majority of n branch executions of past. For instance, let us suppose n = 3.
That is, if three branch executions of the past includes two or more times
branches, then the prediction that occurs is the branch that will be taken.
In table 7.2, the data propose that if we consider l two branch executions of
Manipal University Jaipur B1648 Page No. 155
Computer Architecture Unit 1
the past, then about 90% prediction accuracy is provided to us for most of the
mixes. Apart from that, only minor improvement is obtained. From the
implementation viewpoint, only two bits are required to obtain the history of
past two branch executions.
The process is simple. Preserve the existing prediction unless the two
predictions of past were incorrect. Particularly, it is not required to change the
prediction just for the reason that the last prediction was incorrect. We can
express this plan by means of the four-state finite state machine. This is shown
in figure 7.4.
Figure 7.4: State Diagram for Branch Prediction
In the figure given above, the left bit signifies the prediction whereas the right
bit signifies the status of branch (that is, whether branch is taken or not). In
case the left bit appears to be”0”, then the prediction would occur as “not
taken”. Or else it is predicted that the “branch is taken”. Actual outcome of
branch instruction is provided by right bit. Therefore, “branch not taken” is
signified by a “0”. This means that branch instruction didn't jump. On the other
hand, “branch is taken” is signified by “1”. For instance, state 00 signifies that
it predicted left zero bit (branch would not be taken) () and right zero bit (branch
is definitely not taken) (). Thus, we stay in state 00 in the case when branch is
not taken, In case the prediction is incorrect, we move to state 01. But, still
“branch not taken” is predicted since we were incorrect just once. In case the
prediction is right, we move to state 00 again. If the prediction appears to be
incorrect again, then we change the prediction to “branch taken”. Also we will
move to state10. Thus, on the occurrence of two wrong predictions one after
the other makes us change the prediction.
Manipal University Jaipur B1648 Page No. 156
Computer Architecture Unit 1
Self Assessment Questions
9. In case of Fixed Branch Prediction, the branch is presumed to be either
or _____ .
10. Static strategy makes use of _______________ for predicting whether
the branch is taken.
Activity 2:
Find out examples of processors which use the above mentioned three types
of branch predictions.
7.7 The Intel IA-64 Architecture and Itanium Processor
Due to the complex structure of superscalar and related architecture
technology, a need for the development of new technology was felt. The two
main features of superscalar technology: linear growth of functional unit area
with respect to number of units and square growth of scheduler area with
respect to number of units contributed to the quest for new technology.
As a result of these, the cost performance reaches the level of diminishing
returns. Moreover, traditional architecture exhibited limited parallelism. Thus,
to overcome these factors, the Intel IA-64 architecture and Itanium processor
were developed. Let’s study them in detail.
The Intel IA-64
Intel is quickly reaching to the point where it has taken almost everything from
IA-32 ISA (Intel’s Architecture, 32-bit, Instructional Set Architecture) as well as
the Pentium II line of processors. Latest models can even now take advantage
from improvements in manufacturing technology. However, determining new
ways to quicken the process of implementing even more is becoming tough.
This is because the restraints enforced by IA-32 ISA are appearing larger
constantly. The real solution is to cast aside IA-32 as the main line of
development and perform ISA. Actually Intel proposes to do this. The new
architecture, generated mutually by means of Hewlett Packard as well as Intel,
is known as IA-64. It is considered as a full 64-bit machine from start to end.
In upcoming years, an entire series of processors is expected that implements
this architecture.
The Itanium processor comprises a group of 64-bit Intel microprocessors
which provides execution to the Intel Itanium architecture. This architecture
was initially known as IA-64. The processors are sold by Intel for enterprise
servers and high-performance computing. The beginning of this architecture
Manipal University Jaipur B1648 Page No. 157
Computer Architecture Unit 1
took place at Hewlett-Packard (HP). Afterwards, it was modernized by the joint
efforts of both HP and Intel. The compiler of Itanium architecture, based on
explicit instruction-level parallelism, chooses the instructions to be executed in
parallel. This is quite different from superscalar architectures that rely upon
CPU to administer instruction dependencies at the time of execution.
The cores of Itanium involving Tukwila have the capability of executing about
six instructions for every clock cycle. The first Itanium processor occurred in
2001 which was named as Merced (A dual mode processor, which is capable
of executing the programs of both IA-32 as well as IA-64.). At present, HP is
not the sole manufacturer of Itanium-based systems. Also various other
manufacturers have plunged in this field.
Itanium was regarded as the 4th-most used microprocessor architecture for
enterprise class systems and comes just after x86-64, IBM POWER, and
SPARC. Initially planned for release in 2007, Tukwila is the most recent
processor of this category and was released on February 8, 2010. The
beginning point for IA-64 architecture was a high-end 64-bit RISC processor,
for example, UltraSPARC II (Scalable performance Architecture).
IA-64 architecture is considered as a load/store architecture having 64-bit
addresses as well as 64-bit broad registers. We have 64 general registers
which are available to IA-64 programs. Also there are some more registers
which are available to IA-32 programs).
Every instruction contains the same fixed format. That is, it includes two 6- bit
source register fields, a 6-bit destination register field, an opcode, and another
6-bit field. Many instructions consider two register operands, carry out some
computation on them, and keep the outcome the destination register again. To
perform various operations in parallel, there are various functional units
available. Mostly, the RISC machines are of analogous architecture. The
thought of bundle of associated instructions is unusual. Instructions take place
in groups of three, which we call a bundle. This is shown in figure 7.5 as below.
Manipal University Jaipur B1648 Page No. 158
Computer Architecture Unit 1
Instructions
can be
chained
together
PREDICATE
REGISTER
Figure 7.5: IA-64: Bundles of 3 Instructions
Every bit bundle of 128-bit includes three fixed-format instructions of 40-bit.
Also it includes a template of 8-bit. We can group the bundles together by
means of an end-of-bundle bit. This is done so that 3 or more instructions can
be available in a bundle. Template comprises information regarding the
instructions that can be accomplished in parallel. This plan, in addition to the
availability of various registers, permits the compiler to separate blocks of
instructions and inform the Processor that their execution can be performed in
parallel. Therefore the compiler is required to rearrange instructions, confirm
for dependences, ensure that functional units are available, etc., rather than
the hardware.
By displaying the internal functioning of machine and informing compiler
writers to ensure that every bundle comprises well-suited instructions, the task
of scheduling RISC instructions is moved from hardware (at run time) to
compiler (at compile time). This is the reason because of which this model is
known as Explicitly Parallel Instruction Computing (EPIC). Performing
scheduling of instructions at compile time includes various benefits which are
discussed as below:
• As all the work is now performed by compiler, the hardware can be simpler
to a great extent, saving numerous transistors for other valuable functions,
like larger level 1 caches.
• For any specified program, scheduling is to be performed just once. It is
done at compile time.
• As all the work is done by, a software seller can utilise a compiler that takes
much time in optimising its program. Every user gets the advantage
whenever the program is executed.
The entire family of CPUs are created by the thought of bundles of instructions.
On the low-end processors a bundle may be supplied for each clock cycle.
Manipal University Jaipur B1648 Page No. 159
Computer Architecture Unit 1
Before providing the next bundle, the CPU is required to wait until every
instruction is accomplished On high-end processors, providing numerous
bundles during the same clock cycle may be possible, similar to existing
superscalar designs.
Self Assessment Questions
11. IA-64 architecture is considered as a load/store architecture having 64-bit
_______________ as well as 64-bit broad __________ .
12. IA-64 model is also called ______________ .
7.8 ILP in the Embedded and Mobile Markets
Interesting strategies are represented by the Crusoe chips and Trimedia for
applying the concepts of Very long instruction word (VLIW) in an embedded
space. Trimedia processor may be the closest existing processor to a "classic"
processor of VLIW. Also, it supports a method for the compression of
instructions at the time when they are in main memory along with instruction
cache. It also supports the method for decompressing them throughout the
fetching of instruction.
The drawbacks of VLIW processor are handled by this strategy. On the
contrary, the Crusoe processor makes use of software translation from the x86
architecture to a VLIW processor. Thus it achieves lower power utilization as
compared to general x86 processors.
Now we will focus on Trimedia TM32 architecture in detail.
The Trimedia TM32 Architecture
A group of embedded processors which are committed to multimedia
processing has been given a name, that is, Media processor. Usually it
appears to be cost sensitive similar to embedded processors. However it
follows the compiler orientation from desktop as well as server computing.
Similar to DSPs, they work on narrower data types and not on the desktop.
They must frequently manage endless, continuous flows of data. In the Figure
7.6, we have given a list of media application areas besides benchmark
algorithms for media processors.
Manipal University Jaipur B1648 Page No. 160
Computer Architecture Unit 1
Application area Benchmarks
Data Communication Verterbi decoding
Audio coding AC3 Decode
Video coding MPEG2 encode, DVD decode
Video processing Layered natural motion. Dynamic noise.
Reduction, Peaking
Graphics 3D renderer library
Figure 7.6: Media Processor Application Areas and Example Benchmarks
TM32 CPU is considered as an example of this class. Since multimedia
applications comprise significant parallelism in managing these data streams,
the architectures of instruction set frequently appear dissimilar as compared to
the desktop. It is proposed for products such as advanced televisions as well
as set top boxes. Lots of registers are there such as 128 32-bit registers. These
registers include any of the integer or floating point data. To permit
computations on numerous data instances, the partitioned ALU or SIMD
instructions are provided. In the Figure 7.7, we have shown various operations
found in Trimedia TM32 CPU.
Number
Operation
Category Examples of
Operation
Comment
Load/store cps s 33 signed, unsigned,register
ld8, ldl6, H3 2,1mm. st8, stl6, st32 indirect, indexed, scaled
addressing
Byte shuffles SIMD type convert
shiftrighr 1-.2-, 3-bytes, selectbyte, mergp,
pack 1
Bit shifts asl, asr, Isl, 1ST, rol,
mul, sum of products, sum-of-SIMD-
1 10 shifts, SIMD
round, saturate. 2’scomp
Multiplies and 23
multimedia elements, multimedia, e.g. sum of products SIMD ~
(FIR)
Integer arithmetic add, sib,min, max, abs, average, bitand, bitor, 62
saturate, 2’s comp,
bitxor, bitinv, bitandinv eql, neq, gtr, geq, les, unsigned, immediate,
leq, sign extend, zero extend, sum of absolute SIMD
differences
Floating point add, sub, neg ,mul, div, sqn eql, neq, gtr, geq, 42 scalar
les, leq, IEEE flags
Special ops alloc, prefetch, copy back, read tag read, 20 cache, special regs
cache status, read counter
Branch jmpt, jmpf 6 (un) interruptible
Total 207
Figure 7.7: Operations found in Trimedia TM32 CPU
One of the unusual characteristic from the desktop point of view is that the
programmer is allowed to state five autonomous operations that can be issued
simultaneously. In case the five autonomous instructions are not available
(which means that others are dependent), then no operations (NOPs) are
Manipal University Jaipur B1648 Page No. 161
Computer Architecture Unit 1
positioned in the remaining slots. We call this method of instruction coding a
VLIW (Very Long Instruction Word) method.
It is known that as Trimedia TM32 CPU comprise longer instruction words and
frequently includes NOPs, the instructions of Trimedia are compressed in the
memory. Also the instructions are decoded to the full size when they are
loaded into cache. In Figure 7.8, we have shown the TM32 CPU instruction
mix for EEMBC bench-marks.
Figure 7.8: TM32 CPU Instruction Mix for EEMBC Customer Benchmark
By means of source code which is unmodified, instruction mix is analogous to
others, even though more byte data transfers are there. For aligning the data
for SIMD instructions, the huge number of pack is observed and the
instructions are merged. Computers used for general purpose (having higher
importance byte data transfers) and the instruction mix for “out-of- the-box” C
code is considered similar to each other. The Single instruction, multiple data
(SIMD) instructions along with the pack are used by means of the hand-
Manipal University Jaipur B1648 Page No. 162
Computer Architecture Unit 1
optimised C code. Also the instructions are merged so to align the data.
The comparative instruction mix for unmodified kernels is represented by
means of, middle column. On the other hand, modification at the C level is
allowed by right column. All the operations that were accountable for at least
1% of the total in any of the mixes are listed by these columns.
Self Assessment Questions
13. Trimedia processor may be the closest existing processor to a
14. State two uses of Trimedia TM32 CPU.
7.9 Summary
• Implementation of branching is done by using a branch instruction. The
address of target instruction is included in the branch instruction
• The branch penalty can be reduced to one cycle. It can be efficiently
reduced further by means of Delayed branch execution.
• Effective processing of branches has become a cornerstone of increased
performance in ILP-processors.
• Branch prediction is a method which is basically utilised for handling the
problems related to branch. Different strategies of branch prediction
include:
❖ Fixed branch prediction
❖ Static branch prediction
❖ Dynamic branch prediction
• The new architecture, generated mutually by means of Hewlett Packard
as well as Intel , is known as IA-64
• IA-64 model is also known as Explicitly Parallel Instruction Computing
(EPIC).
• Itanium comprises a group of 64-bit Intel microprocessors which provides
execution to the Intel Itanium architecture. This architecture was initially
known as IA-64.
• Interesting strategies are represented by the Crusoe chips and Trimedia
for applying the concepts of Very long instruction word (VLIW) in an
embedded space. Trimedia processor may be the closest existing
processor to a "classic" processor of VLIW.
7.10 Glossary
Manipal University Jaipur B1648 Page No. 163
Computer Architecture Unit 1
• Branch penalty: Wasteful work done by pipelines for a considerable time.
• Condition code registers: A condition code register is used for attaining
communication among the instructions for condition as well as branching.
• EPIC: Explicitly Parallel Instruction Computing.
• ILP: Instruction level parallelism.
• Merced: A dual mode processor, which is capable of executing the
programs of both IA-32 as well as IA-64.
• VLIW: Very Long Instruction Word.
7.11 Terminal Questions
1. Differentiate between unconditional and conditional branch.
2. Explain the concept of branch handling.
3. What do you understand by delayed branching?
4. Define branch processing.
5. What do you mean by branch prediction?
6. Write short notes on:
a) Fixed Branch Prediction
b) Intel IA-64 architecture
c) Static Branch Prediction
d) Itanium processor
e) Dynamic Branch Prediction
7. Explain the concept of Trimedia TM32 Architecture.
7.12 Answers
Self Assessment Questions
1. Jump instruction
2. False
3. Branch
4. Branch penalty
5. SPARC, MIPS
6. No operation (NOP)
7. Layout, micro-architectural implementation
8. a) Detecting branches
b) Handling of unresolved conditional branches during instruction
decoding.
c) Accessing the branch target path
9. Never taken, always taken
10. Instruction opcode
11. Addresses, registers
Manipal University Jaipur B1648 Page No. 164
Computer Architecture Unit 1
12. Explicitly Parallel Instruction Computing (EPIC)
13. "Classic" VLIW processor.
14. Set top boxes and advanced televisions.
Terminal Questions
1. This type of branch is considered as the simplest one. It is used to transfer
control to a particular target. In conditional branches, if a particular
condition meets its requirements, then only the jump is conducted. Refer
Section 7.2.
2. Branch Handling is executed when the flow of control is altered. For
example branch requires special handling in pipelined processors. Refer
Section 7.3.
3. Delayed branching is the reduction of branch penalty to one cycle. Refer
Section 7.4.
4. Branch processing receives branch instructions and resolves the
conditional branches as early as possible. Refer Section 7.5.
5. Branch prediction predicts the outcome of branch. Refer Section 7.6.
6. a) In Fixed Branch Prediction, prediction is fixed. Refer Section 7.6.1.
b) The new architecture, generated mutually by means of Hewlett
Packard as well as Intel , is known as IA-64. Refer section 7.7.1.
c) This approach makes use of instruction opcode for predicting
whether the branch is taken. Refer section 7.6.2.
d) Itanium comprises a group of 64-bit Intel microprocessors which
provides execution to the Intel Itanium architecture. This architecture
was initially known as IA-64. Refer Section 7.7.2.
e) For making more accurate predictions, this approach considers run-
time history. Here the n branch executions of history are considered
and this information is used for predicting the next one. Refer section
7.6.3.
7. TM32 CPU is considered as an example of multimedia applications. The
multimedia applications comprise significant parallelism in managing the
data streams Refer Section 7.8.
References:
• Hwang, K. Advanced Computer Architecture. McGraw-Hill.
• Godse, D. A. & Godse, A. P. Computer Organization. Technical
Publications.
• Hennessy, John L., Patterson, David A. & Goldberg David. Computer
Manipal University Jaipur B1648 Page No. 165
Computer Architecture Unit 1
Architecture: A Quantitative Approach, Morgan Kaufmann.
• Sima, Dezso, Fountain, Terry J. & Kacsuk, Peter, Advanced computer
architectures - a design space approach. Addison-Wesley-Longman.
E-references:
• http://www.scribd.com/doc/46312470/37/Branch-processing,
• http://www.scribd.com/doc/60519412/15/Another-View-The-Trimedia-
TM32-CPU-151.
Unit 8 Memory Hierarchy Technology
Structure:
8.1 Introduction
Objectives
8.2 Memory Hierarchy
Cache memory organisation
Basic operation of cache memory
Performance of cache memory
8.3 Cache Addressing Modes
Physical address mode
Virtual address mode
8.4 Mapping
Direct mapping
Associative mapping
8.5 Elements of Cache Design
8.6 Cache Performance
Improving cache performance
Techniques to reduce cache miss
Techniques to decrease cache miss penalty
Techniques to decrease cache hit time
8.7 Shared Memory organisation
8.8 Interleaved Memory Organisation
8.9 Bandwidth and Fault Tolerance
8.10 Consistency Models
Strong consistency models
Weak consistency models
8.11 Summary
8.12 Glossary
Manipal University Jaipur B1648 Page No. 166
Computer Architecture Unit 1
8.13 Terminal Questions
8.14 Answers
8.1 Introduction
You can say that Memory system is the important part of a computer system.
The input data, the instructions necessary to manipulate the input data and the
output data are all stored in the memory.
Memory unit is an essential part of any digital computer because computer
processes data only if it is stored somewhere in its memory. For example, if
computer has to compute f(x) = sinx for a given value of x, then first of all x is
stored somewhere in memory, then a routine is called that contains program
that calculates sine value of a given x. It is an indispensable component of a
computer. We will cover all this in this unit.
In the previous unit, we explored the software approach of exploiting
Instruction-level parallelism in which you studied types of branches, branch
handling, delayed branching, branch processing, and static branch prediction.
Also, you studied the Intel IA-64 architecture and Itanium processor, ILP in the
embedded and mobile markets.
In this unit, we will study memory hierarchy technology. We will cover cache
memory organisation, cache addressing modes, direct mapping and
associative caches. We will also discuss the elements of cache design,
techniques to reduce cache misses via parallelism, techniques to reduce
cache penalties, and techniques to reduce cache hit time. Also, we will study
the shared memory organisation and interleaved memory organisation.
Objectives:
After studying this unit, you should be able to:
• explain the concept of cache memory organisation
• label different cache addressing modes
• explain the concept of mapping
• identify the elements of cache design
• analyse the concept of cache performance
• describe various techniques to reduce cache misses
• explain the concept of shared and interleaved memory organisation
• discuss bandwidth and fault tolerance
• discuss strong and weak consistency models
Manipal University Jaipur B1648 Page No. 167
Computer Architecture Unit 1
8.2 Memory Hierarchy
Computer memory is utilized for storing and retrieving data and instruction.
The memory system includes the managing and controlling of storage devices
along with information or algorithms contained in it. Basically computers are
used to enhance the speed of computing. Similarly the main aim of memory
system is to give speedy and continuous access on memory by CPU. Small
computers do not require additional storage because they have limited
applications that can be easily fulfilled.
The General Purpose computers perform very well with the additional storage
capacity including the capacity of main memory. Main memory directly deals
with the processor. Auxiliary memory is a high-speed memory which provides
backup storage and not directly accessible by CPU but it is connected with
main memory. The early forms of auxiliary memory are punched paper tape,
punched cards and magnetic drums. Since 1980’s the devices employed in
auxiliary memory are tapes, optical and magnetic disks. Cache memory is an
extremely high speed memory utilized to boost up the speed of computation
by providing the required information and data to the processor at high speed.
Cache memory is introduced in the system for just for overcoming the
difference of speed between main memory and CPU Cache memory stores
the program segments which is being executed in processor as well as the
temporary data required in current computations. Computer performance rate
increases because cache memory provides the segments and data at very
high speed. As Input/output processor is concerned with data transfer among
main memory and auxiliary memory, similarly cache memory is concerned for
information transfer between processor and main memory. The objective of
using memory system is to get maximum access speed and to minimize the
entire cost of memory organization.
Memories vary in their design, in their capacity and speed of operation that is
why we have a hierarchical memory system. A typical computer can have all
types of memories. According to their nearness to the CPU, memories form a
hierarchy structure as shown in figure 8.1.
Manipal University Jaipur B1648 Page No. 168
Computer Architecture Unit 1
Now, we let us discuss cache memory and the cache memory organisation.
8.2.1 Cache memory organisation
A cache memory is an intermediate memory between two memories having
large difference between their speeds of operation. Cache memory is located
between main memory and CPU. It may also be inserted between CPU and
RAM to hold the most frequently used data and instructions. Communicating
with devices with a cache memory in between enhances the performance of a
system significantly. Locality of reference is a common observation that at a
particular time interval, references to memory acts limited for some localised
memory areas. Its illustration can be given by making use of control structure
like 'loop'. Cache memories exploit this situation to enhance the overall
performance.
Whenever a loop is executed in a program, CPU executes the loop repeatedly.
Hence for fetching instructions, subroutines and loops act as locality of
reference to memory. Memory references also act as localised.
Table look-up procedure continually refers to memory portion in which table is
stored. These are the properties of locality of reference. Cache memory’s basic
idea is to hold the often accessed data and instruction in quick cache memory,
the total accessed memory time will attain almost the access time of cache.
The fundamental idea of cache organisation is that by keeping the most
frequently accessed instructions and data in the fast cache memory, the
average memory access time will reach near to access time of cache.
8.2.2 Basic operation of cache memory
Whenever CPU needs to access the memory, cache is examined. If the file is
found in the cache, it is read from the fast memory. If the file is missing in cache
then main memory is accessed to read the file. A set of files just accessed by
CPU is then transferred from main memory to cache memory.
Manipal University Jaipur B1648 Page No. 169
Computer Architecture Unit 1
8.2.3 Performance of cache memory
Cache memory performance is measured in terms of Hit Ratio. If the processor
detects a word in cache, while referring that word in main memory is known to
produce a “hit”. If processor cannot detect that word in cache is known as
“miss”. Hit ratio is a ratio of hits to misses. High hit ratio signifies validity of
"locality of reference". When the hit ratio is high, then the processor accesses
the cache memory rather than main memory.
The main feature of cache memory is its spontaneous response. Hence, time
is not wasted in finding files in cache. When data is transformed from main
memory to cache memory this process is known as mapping process.
Self Assessment Questions
1. ____________ directly deals with the processor.
2. ___________ is a high-speed memory which provides backup
storage.
3. A __________ memory is an intermediate memory between two
memories having large difference between their speeds of operation.
4. If the processor detects a word in cache, while referring that word in main
memory is known to produce a ____________________ .
8.3 Cache Addressing Modes
The operation to be performed is specified by the operation field of the
instruction. The execution of the operation is performed on some data stored
in computer registers or memory words. In program execution the selection of
operands depends upon the addressing mode of the instruction. Addressing
modes has a rule that says “the address field in the instruction is translated or
changed before the operand is referenced”. While accessing a cache, the CPU
can address the cache in two ways, as following
• Physical Address Mode
• Virtual Address Mode
Now let’s go into the details of these addressing modes
8.3.1 Physical address mode
In physical address mode, a physical memory address is used to access a
cache.
Implementation on unified cache: Generally, both instructions and data are
stored into the same cache. This design is called the Unified (or Integrated)
cache design. When it is implemented on unified cache, the cache is indexed
Manipal University Jaipur B1648 Page No. 170
Computer Architecture Unit 1
and tagged with physical address. When the processor issues an address, the
address is translated in Translation Lookaside Buffer (TLB) or in Memory
Management Unit (MMU) before any cache lookup as illustrated in figure 8.2.
TLB is a cache where a set of recently looked entries is maintained.
Manipal University Jaipur B1648 Page No. 171
Computer Architecture Unit 1
Cache Hit: When the addressed data or instruction is found in cache during
operation, it is called a cache hit.
Cache Miss: When the addressed data or instruction is not found in cache
during operation, it is called a cache miss. At the time of cache miss, a
complete cache block is loaded from the equivalent memory location at one
time.
Implementation on Split Cache: When physical address is used in split
cache, both data cache and instruction cache are accessed with a physical
address after translation from MMU. In this design, the first-level D-cache uses
write-through policy as it is a small one (64 KB) and the second-level D-cache
uses write-back policy as it is larger (256 KB) with slower speed. The I-cache
is a single level cache that has a smaller size (64 KB). The implementation of
physical address mode on split cache is illustrated in figure 8.3.
Figure 8.3: Implementation of Physical Address Mode on Split Cache Design
Advantages of physical address mode: The main advantages of the
physical mode of cache addressing is that the Design is simple as it requires
little intervention of operating system and no problem arises in accessing the
Manipal University Jaipur B1648 Page No. 172
Computer Architecture Unit 1
physical addresses as the cache is having the same index tag.
Disadvantage of physical address mode: The main disadvantage of the
physical mode of cache addressing is that the physical mode is slow in
accessing the cache because of the time taken by MMU/TLB in completing the
address translation.
8.3.2 Virtual address mode
In the virtual mode of cache addressing, the cache is indexed and tagged with
virtual address that is stored in the cache and MMU simultaneously and the
translation process is done by MMU with cache lookup operations. The
process of virtual cache addressing in the case of a unified cache is illustrated
in figure 8.4.
In figure 8.4 you can see that a unified cache is in direct contact with virtual
address. It is known as virtual address cache. In the above figure you can also
see that Main Memory Unit and cache validation or interpretation is performed
simultaneously. The cache lookup operation does not use the physical address
produced by the Main Memory Unit but it can be saved for later use. The virtual
address cache is encouraged with the improved proficiency of quick cache
accessing; it is overlapped with the MMU translation.
Advantages of virtual address mode: The virtual mode of cache addressing
offers the following advantages:
• It eliminates address translation time for a cache hit since misses are not
common as hits.
• Cache lookup is not delayed.
Manipal University Jaipur B1648 Page No. 173
Computer Architecture Unit 1
• The cache access efficiency is faster than physical addressing mode.
• The MMU translation yields physical main memory address which is saved
for later use by the cache.
Self Assessment Questions
5. When both instructions and data are stored into the same cache, the
design is called the _____________ cache design.
6. TLB stands for ___________________ .
8.4 Mapping
Mapping refers to the translation of main memory address to the cache
memory address. The transfer of information from main memory to cache
memory is conducted in units of cache blocks on cache lines. Blocks in caches
are called block frames which are denoted as
Bi for i = 1, 2, ...j
where j is the entire block frames in caches.
The corresponding memory blocks are denoted as
Bj for j = 1, 2, ...k
where k is the total number of blocks in memory. It is assumed that
k >> j and k = 2s and j = 2r
Where s is the number of bits required to address a main memory block, and
r is number of bits required to address a cache memory block.
There are four types of mapping schemes: direct mapping, associative
mapping, set associative mapping, and sequential mapping. Here, we will
discuss the first two types of mapping.
8.4.1 Direct mapping
Associative memories are very costly as compared to RAM due to the
additional logic association with all cells. Generally there are 2j words in main
memory and 2k words in cache memory. The j-bit memory address is
separated by 2 fields. k bits are used for index field. j-k bits are long-fields. The
direct mapping cache organization utilizes k-bit indexes to access the cache
memory and j-bit address for main memory. Cache words contain data and
related tags. Every memory block is assigned to a particular line of cache in
direct mapping. But if a line already contains memory block when new block is
to be added then the old memory block is removed. The figure 8.5 illustrates
Manipal University Jaipur B1648 Page No. 174
Computer Architecture Unit 1
the mapping of multiple blocks to the similar line in the cache memory. Blocks
can be sent to these lines only. In figure 8.5, memory address has block
identification portion that contains 8-bits.
Figure 8.5: Direct Mapping
Tag bits are stored next to data bits as new word enters the cache. Once
processor has produced a memory request, the cache index field is utilized for
the main memory address to access cache. Tag in word in cache is evaluated
with tag field in processor address. If this comparison is positive there is a hit
and the word is found in cache. If the comparison is negative then it is a miss
and the word is read in main memory. The word is then stored in cache with
the new tag and deletes the previous value.
Demerits of direct mapping: If the two words have similar addresses and
indexes then the hit ratio will fall substantially. But dissimilar tags are accessed
continually.
8.4.2 Associative mapping
Associative mapping is used in cache organization which is the quickest and
most supple organization. Addresses of the word and content of the words are
stored in associative memory. It means cache can store any word in main
memory.
For example, in figure 8.6, CPU address is first placed in argument register
Manipal University Jaipur B1648 Page No. 175
Computer Architecture Unit 1
and then associative memory is explored for the match of the above address.
CPU address ----------- ► Argument Register ---------------- ► Associative
memory
Cache memory
Figure 8.6: Flow of Search
If address is found in it somewhere, it has to be placed in cache memory
immediately. If cache memory has no vacant space for storage of new
information, in such a case vacancy is created using page replacement policy.
In associative mapping technique, the entire cache array is implemented as a
single associative memory. The associative memory is also called content
addressable memory (CAM). When a memory address produced by the
processor is sent to the CAM, the CAM simultaneously compares that address
to all addresses currently stored in the cache.
Self Assessment Questions
7. The translation of main memory address to the cache memory address
is known as ________________ .
8. _______________ memories are expensive compared to RAMs.
Activity 1:
Visit an organisation and find out the cache memory size and the costs they
are incurring to hold it. Also try to retrieve the size of the data stored in the
cache memory.
8.5 Elements of Cache Design
Main elements of Cache design are:
1. Cache Size: The size of cache should be small enough to bring overall
cost per bit closer to the main memory. On the contrary, the size must be
large enough, so that the cache and overall access time be somewhat
equal. Mostly large caches are slow than smaller ones because if the
cache is large then there are numerous gates are concerned in addressing
the cache which makes it slow
2. Mapping Function: The two types of mapping—Direct Mapping and
Associative Mapping (explained earlier in this unit)
Manipal University Jaipur B1648 Page No. 176
Computer Architecture Unit 1
3. Replacement Algorithm: LRU (Least recently Used), FIFO (First-in, First-
out), LFU (Least Frequently Used) or some Random one i.e. simple to
build in hardware.
4. Write Policy: Write Through, Write Back or Write Once.
5. Line Size: Optimum size depends on workload.
6. Number of Caches: Single or two levels and Unified or split.
Generally, both instructions and data are stored into the same cache. This
design is called the Unified (or Integrated) cache design. However at times
different caches are used to store and access instructions and data separately.
A cache used only to store instructions but not data is called Instruction
Cache (I-Cache) while a cache used only to store data is called Data Cache
(D-Cache).
The advantage of restricting a cache to store only instructions is that
instructions relatively do not change. Therefore, the contents of an instruction
cache need never be written again to main memory. However, the contents of
D-Cache undergo frequent changes and require to be written again to main
memory to keep the memory updated. A design where instructions and data
are stored in different caches for execution conveniences is called Split (or
Mixed) cache design.
Self Assessment Questions
9. A design where instructions and data are stored in different caches for
execution conveniences is called _____________ cache design.
10. I-Cache denotes ________________________.
8.6 Cache Performance
Instruction count is free from the hardware, thus, it is generally used to
calculate the processor performance. A computer designer must concentrate
on the miss rate for evaluating memory-hierarchy performance as it is too
independent of the speed of the hardware. Sometimes, the miss rate can
mislead the performance measure and thus, average memory access time
should be used.
Average memory access time = Hit time + Miss rate x Miss Penalty
This equation assists you in taking decision among unified or split cache.
8.6.1 Improving cache performance
The gap between CPU and the main memory speeds have been increasing
from the past few years. This has pulled the attention of many computer
Manipal University Jaipur B1648 Page No. 177
Computer Architecture Unit 1
designers. The average memory access time formula helped us to present the
techniques to improve the caches. The techniques for improving the cache
performance are:
• Reduce Miss Rate
• Reduce Cache Miss Penalty
• Reduce Cache Hit Time
Now let’s discuss these techniques in detail.
8.6.2 Techniques to reduce cache miss
For reducing the miss rate the following techniques are used:
• Hardware prefetching of instructions and data
• Victim caches
• Pseudo-associative Caches
Hardware prefetching of Instruction and Data
Hardware-based approaches can be classified into two categories: spatial and
temporal. In spatial access to the current block is the basis for the prefetch
decision. In spatial schemes, prefetches occur when there is miss on the cache
block. In temporal schemes the look ahead decoding of the instruction stream
is implied. Temporal mechanisms attempt to have data in the cache “just in
time” to be used.
8.6.3 Techniques to reduce cache miss penalty
For reducing cache miss penalty the following techniques are used:
• Early restart and critical word first
• Giving Priority To Read Misses Over Writes
• Sub-block Placement
Early restart and critical word first
• Before the restarting the processor do not wait for loading of block :
❖ Early restart: As the block of the word arrives, send it to the processor
for continuing execution.
❖ Critical Word First: Firstly ask for the missed word from memory and
then send it to the processor right away when it arrives.
❖ Processor is busy in filling words in block so let him do his work.
• Normally it is utilized for large block caches.
8.6.4 Techniques to reduce cache hit time
• Avoiding address translation during cache indexing
Manipal University Jaipur B1648 Page No. 178
Computer Architecture Unit 1
• Small and simple caches
• Pipelining writes for fast write hits
Avoiding address translation
In case cache received virtual address is called Virtually Addressed Cache or
just Virtual Cache. It offers following features:
• The logically switched process must flush the cache or else get false hits.
• Handling synonyms or aliases;
❖ Mapping of dissimilar virtual address with same physical address.
❖ Virtual address is needed because input/output system have to interact
with cache
• Synonyms or aliases solution
❖ Hardware Guarantee: We should get that hardware which ensures
unique physical address of each cache block
❖ Software guarantee: lower n bits ought to have similar address; only if
covers index field and direct mapped, they must be unique called page
colouring.
• Cache flush solution
❖ Solution for the cache flush is to add process identifier tag. This tag
recognizes the process. If it is a wrong process then the address within
that process can’t get a hit
Self Assessment Questions
11. ____________ affects the clock cycle rate of the processor.
12. Average memory access time = Hit time + Miss rate x ____________ .
8.7 Shared Memory Organisation
The common issues in architecture of shared memory organization are access
control, synchronization, security and protection.
• Access control decides accessibility of process to resources
• Synchronisation constraints restrict the accessibility time of shared
processes to access shared resources.
• Protection is the third issue that does not allow processes to create
arbitrary access to resources that belongs to some other process.
The computer technology has become so much advanced that it is very difficult
to improve further the performance of superscalar processors which is
exploiting more instruction-level parallelism (ILP). The best solution is to rely
Manipal University Jaipur B1648 Page No. 179
Computer Architecture Unit 1
on thread-level parallelism (TLP) rather than ILP. The various forms of TLP
are as follows:
• Explicit multithreading
• Chip-level Multiprocessing (CMP)
• Symmetric Multiprocessing (SMP)
• Asymmetric Multiprocessing (ASMP)
• Uniform Memory Access multiprocessing (UMA)
• Non-Uniform Memory Access multiprocessing (NUMA)
• Clustered multiprocessing
• Cache Only Memory Architecture (COMA)
All of the above architectures except clustered multiprocessors provide all
cores in the system with access to a shared physical address space.
A simple architecture of the shared memory organisation is shown in figure
8.7.
Shared Bus
Figure 8.7: Shared Memory Organisation
It basically has the following features:
• The bus is usually a simple physical connection.
• The bus bandwidth limits the number of CPUs.
• There could be multiple memory elements in the system
• A single 'on chip' cache is universal.
• A second level cache could be 'on chip', which could be shared as part of
memory system
Designs of shared memory processor
There are various approaches for designing a shared memory processor. The
available design alternatives for a shared memory processor are as follows:
1. No physical sharing: In this memory system organisation, every
processor or a node that consists of more than one processor has its own
private main memory. It can access remote memory connected to other
Manipal University Jaipur B1648 Page No. 180
Computer Architecture Unit 1
nodes through interconnection network. This architecture is known as the
non-uniform memory access architecture (NUMA) and is shown in figure
8.8. In NUMA design, the cost of access to local memory is much lower
than for remote memory access.
Figure 8.8: NUMA Architecture
2. Shared main memory: In this memory system organisation, every
processor or core has its own private L1 and L2 caches, but all processors
share the common main memory. Although this was the dominating
architecture for small-scale multiprocessors, some of the recent
architectures abandoned the shared memory organisation and switched to
the NUMA organisation.
3. Shared L1 cache: This design is only used in chips with explicit
multithreading, where all logical processors share a single pipeline.
4. Shared L2 cache: This design minimises the on-chip data replication and
makes more efficient use of cache capacity. Some Chip-level
Multiprocessing (CMP) systems are built with shared L2 caches.
Self Assessment Questions
13. ILP stands for ______________________ .
14. TLP is the abbreviation for ____________________ .
8.8 Interleaved Memory Organisation
Interleaved Memory Organisation (or Memory Interleaving) is a technique
aimed at enhancing the efficiency of memory usages in a system where more
than one data/instruction is required to be fetched simultaneously by the CPU
as in the case of pipeline processors and vector processors. To understand
the concept, let us consider a system with a CPU and a memory as shown in
figure 8.9.
Manipal University Jaipur B1648 Page No. 181
Computer Architecture Unit 1
Figure 8.9: Interleaved Memory Organisation
As long as the processor requires a single memory read at a time, the above
memory arrangement with a single MAR, a single MDR, a single Address bus
and a single Data bus is sufficient. However, if more than one read is required
simultaneously, the arrangement fails. This problem can be overcome by
adding as many address and data bus pairs along with respective MARs and
MDRs. But buses are expensive as equal number of bus controllers will be
required to carry out the simultaneous reads.
An alternative technique to handle simultaneous reads with comparatively low
cost overheads is memory interleaving. Under this scheme, the memory is
divided into numerous modules which is equivalent to the number of
simultaneous reads required, having their own sets of MARs and MDR but
sharing common data and address buses. For example, if an instruction
pipeline processor requires two simultaneous reads at a time, the memory is
partitioned into two modules having two MARs and two MDRs, as shown in
figure 8.10.
Manipal University Jaipur B1648 Page No. 182
Computer Architecture Unit 1
Figure 8.10: MAR and MDR in Interleaved Memory Modules
The memory modules are assigned different mutually exclusive memory
address spaces. Thus, in this case suppose, the memory module 1 is assigned
even address space and memory module 2 is assigned odd memory space.
Now, when the CPU needs two instructions to be fetched from the memory; let
us say located at address 2 and 3, the first memory module is loaded with the
address 2. While the first instruction is being read into the MDR1, the MAR2 is
loaded with address 3.
When both the instructions are ready to be read into the CPU from the
respective MDRs, the CPU reads them one after the other from these two
registers. This is an example of two-way interleaved memory architecture. In
a similar way an n-way interleave memory may be designed. Obviously, but
for this technique, 2 sets of address and data buses and MARs and MDRs
would be required to achieve the same objective.
This type of modular memory architecture is helpful for systems that use vector
or pipeline processing. By suitably arranging the memory accesses, the
memory cycle time will reduced to a number of memory module. The same
technique is also employed in enhancing the speed of read/write operations in
various secondary storage devices such as hard disks and the like.
Self Assessment Questions
15. ________________ is a technique aimed at enhancing the
efficiency of memory usages in a system
16. ________________ share common data and address buses.
Manipal University Jaipur B1648 Page No. 183
Computer Architecture Unit 1
8.9 Bandwidth and Fault Tolerance
H. Hellerman (1967) has derived an equation to estimate the effective increase
in memory bandwidth through multiway interleaving. A single memory module
is assumed to deliver one word per memory cycle and thus, has a bandwidth
of 1.
Memory Bandwidth: The memory bandwidth B of an m-way interleaved
memory is upper-bounded by m and lower-bounded by 1. Hellerman estimated
B as:
In this equation m= number of interleaved memory modules. This equation
implies that if 16 memory modules are used, then the effective memory
bandwidth is approximately four times that of a single module. This pessimistic
estimate is due to the fact that block access of various lengths and access of
single words are randomly mixed in user programs. Hellerman’s estimate was
based on a single-processor system. If memoryaccess conflicts from multiple-
processors, such as the hot spot problems, are considered, the effective
memory bandwidth will be further reduced.
In a vector processing, the access time of a long vector with n elements and
stride distance 1 has been estimated by Cragon (1992) as: It is assumed that
the n elements are stored in contiguous memory locations in m-way
interleaved memory system. The average time t1 required to access one
element in a vector is estimated by
„ 0m-i
ti - (1+—)
mn
0
Where, n >/. (very long vector), ti ^— - r .As n ^ 1 (scalar access), m
t1 ^ 0
Fault Tolerance: High- and low-order interleaving could be mixed to generate
various interleaved memory organisations. Sequential addresses are assigned
in the high-order interleaved memory in each memory module.
This makes it easier to isolate faulty memory modules in a memory bank of m
memory modules. When one module failure is detected, the remaining
modules can still be used by opening a window in the address space. This fault
isolation cannot be carried out in a low-order interleaved memory, in which a
Manipal University Jaipur B1648 Page No. 184
Computer Architecture Unit 1
module failure may paralyse the entire memory bank. Thus, low- order
interleaving memory is not fault-tolerant.
Self Assessment Questions
17. The memory bandwidth is upper-bounded by _______________ and
lower-bounded by _________________ .
18. ________________ are assigned in the high-order interleaved
memory in each memory module.
8.10 Consistency Models
Usually the logical data store is distributed and replicated physically throughout
several processes. But the consistency models acts as a agreement among
the data storage and processes. Perfection in the work of store only happens
if the processes follow some rules. This model helps in understanding that how
simultaneous writes and reads occur in shared memory. It is applicable for
shared memory multiprocessors with shared databases and cache coherence
algorithms.
Consistency Models are divided into two models: Strong and Weak.
8.10.1 Strong consistency models
In these models, the operations on shared data are synchronised. The various
strong consistency models are:
i) Strict consistency: As the name suggests it is very strict. In this type of
consistency if there is any read on data item then it will gives the matching
result of the last written date item. The main drawback of this consistency
is that it depends on the absolute global time.
ii) Sequential consistency: In this type of consistency, if the processes are
executed in sequence then the results is similar to the read write
operations. The operations of all processes are in sequence as defined
in the program. Figure 8.11 shows the Sequential Consistency Model
Manipal University Jaipur B1648 Page No. 185
Computer Architecture Unit 1
Figure 8.11: Sequential Consistency Model
iii) Casual consistency: In casual consistency, casual writes should be
seen in the similar order through all processes. The concurrent writes are
seen in dissimilar order in different machines.
iv) FIFO consistency: In FIFO consistency, all process can see single
process writes in the same order they were issued. Though dissimilar
process writes could be seen in dissimilar order through dissimilar
processes.
8.10.2 Weak consistency models
In these models, synchronisation can happen when the shared data is locked
and unlocked. If a consistency model is weaker, it is easier to build a scalable
solution. The various weak consistency models are:
i) General weak consistency: In this consistency, accesses to
synchronisation variables linked with a data store are sequentially
consistent. When all the writes are completed then only you can perform
any operation on synchronization variable.
ii) Release consistency: In release consistency, all the previous work of
the process should be finished so that read/ write operations can be
performed on shared data.
iii) Entry consistency: In entry consistency, the access to
synchronization variable is not permitted to process. Until shared data is
updated in respect to that process.
Self Assessment Questions
19. _________ model is a contract between processes and a data store.
20. The two categories of consistency models are _____ and ________ .
Activity 2:
Visit an organisation. Find the number of m-interleaved memory modules.
Now, calculate the memory bandwidth using the formula of B.
Manipal University Jaipur B1648 Page No. 186
Computer Architecture Unit 1
8.11 Summary
Let us recapitulate the important concepts discussed in this unit:
• Small computers do not require additional storage because they have
limited applications that can be easily fulfilled.
• If the processor detects a word in cache, while referring that word in main
memory is known to produce a “hit”. If processor cannot detect that word
in cache is known as “miss”.
• The characteristic of cache memory is its spontaneous response. Hence,
time is not wasted in finding files in cache.
• When physical address is used in split cache, both data cache and
instruction cache are accessed with a physical address after translation
from MMU.
• Mapping refers to the translation of main memory address to the cache
memory address.
• A computer designer must concentrate on the miss rate for evaluating
memory-hierarchy performance as it is too independent of the speed of the
hardware.
• Interleaved Memory Organisation (or Memory Interleaving) is a technique
aimed at enhancing the efficiency of memory usages in a system where
more than one data/instruction is required to be fetched simultaneously by
the CPU.
• The memory bandwidth B of an m-way interleaved memory is upper-
bounded by m and lower-bounded by 1.
8.12 Glossary
• Associative Mapping: Associative mapping is used in cache
organization which is the quickest and most supple organization.
• Auxiliary memory Auxiliary memory is a high-speed memory which
provides backup storage and not directly accessible by CPU but it is
connected with main memory.
• Cache Memory Organisation: A small, fast and costly memory that is
placed between a processor and main memory.
• Main memory: Refers to physical memory that is internal to the computer.
• Memory interleaving: A category of techniques for increasing memory
speed. NUMA Multiprocessing: Non-Uniform Memory Access
multiprocessing.
• RAM: Random-access memory
• Split Cache Design: A design where instructions and data are stored in
Manipal University Jaipur B1648 Page No. 187
Computer Architecture Unit 1
different caches for execution conveniences.
8.13 Terminal Questions
1. Explain Memory-Hierarchy?
2. Explain the meaning of Cache Memory Organisation.
3. Describe the term addressing modes. List the different types of addressing
modes.
4. Define the following terms:
A. Cache Hit
B. Cache Miss
5. What is meant by Direct Mapping? Discuss the various types of Mapping.
6. Explain the concept of Shared Memory Organisation.
8.14 Answers
Self Assessment Questions
1. Main memory
2. Auxiliary memory
3. Cache
4. Hit
5. Unified
6. Translation Lookside Buffer
7. Mapping
8. Associative
9. Split
10. Instruction Cache
11. Hit time
12. Miss Penalty
13. Instruction-Level Parallelism
14. Thread-Level Parallelism
15. Interleaved Memory Organisation
16. MARs and MDRs
17. m, 1
18. Sequential addresses
19. Consistency
20. Strong, Weak
Terminal Questions
1. Memory hierarchy contains the Cache Memory Organisation. Refer
Manipal University Jaipur B1648 Page No. 188
Computer Architecture Unit 1
Section 8.2.
2. A cache memory is an intermediate memory between two memories
having large difference between their speeds of operation. Refer Section
8.2.
3. Addressing modes has a rule that says “the address field in the instruction
is translated or changed before the operand is referenced”. Refer Section
8.3.
4. When the addressed data or instruction is found in cache during operation,
it is called a cache hit. When the addressed data or instruction is not found
in cache during operation, it is called a cache miss. Refer Section 8.3.
5. Mapping refers to the translation of main memory address to the cache
memory address. Refer Section 8.4.
6. Shared memory organization is a process by which program processes
can exchange data faster than by reading and writing using the regular
operating system functions. Refer Section 8.7.
References:
• Kai Hwang: Advanced Computer Architecture, Parallelism, Scalablility,
Programmability - MGH
• Micheal J. Flynm: Computer Architecture, Pipelined & Parallel Processor
Design - Narosa.
• J.P. Haycs: Computer Architecture & Organisation - MGM
• Nicholas P. Carter; Schaum’s Outline of Computer Architecture; Mc. Graw-
Hill Professional
E-references:
• www.csbdu.in/
• www.cs.hmc.edu/
• www.usenix.org
• cse.yeditepe.edu.tr/
Manipal University Jaipur B1648 Page No. 189
Computer Architecture Unit 1
Unit 9 Vector Processors
Structure:
9.1 Introduction
Objectives
9.2 Use and Effectiveness of Vector Processors
9.3 Types of Vector Processing
Memory-memory vector architecture
Vector register architecture
9.4 Vector Length and Stride Issues
Vector length
Vector stride
9.5 Compiler Effectiveness in Vector Processors
9.6 Summary
9.7 Glossary
9.8 Terminal Questions
9.9 Answers
9.10 Introduction
In the previous unit, you learnt about memory hierarchy technology and related
aspects such as cache addressing modes, mapping, elements of cache
design, cache performance, shared & interleaved memory organisation,
bandwidth & fault tolerance, and consistency models. In this unit, we will
introduce you to vector processors.
A processor design which has the capacity to operate mathematical operations
upon multiple data elements at the same time is called a vector processor.
This is just opposite to a scalar processor, which is able to tackle just a single
element at one time. A vector processor is also called array processor. Vector
processing was first successfully implemented in the CDC STAR-100 and
Advanced Scientific Computer (ASC) of the Texas Instruments. The vector
technique was first fully exploited in the famous Cray-1. The Cray design had
eight vector registers which held sixty-four 64-bit words each. The Cray-
1usually had a performance of about 80 MFLOPS (Million floating-point
operations per second), but with up to three chains running, it could hit the
highest point at 240 MFLOPS.
In this unit, you will study about various these processors such as types, uses
Manipal University Jaipur B1648 Page No. 190
Computer Architecture Unit 1
and effectiveness of these processors. You will also study vector length and
stride issues, and compiler effectiveness in vector processors.
Objectives:
After studying this unit, you should be able to:
• state the use and effectiveness of vector processors
• identify the types of vector processing
• describe memory-memory vector architecture
• discuss the use of CDC Cyber 200 model 205 computer
• explain vector register architecture
• recognise the functional units of vector processor
• discuss vector instructions and vector processor implementation (CRAY-
1)
• solve vector length and stride issues
• explain compiler effectiveness in vector processors
9.2 Use and Effectiveness of Vector Processors
There is class of computational problems that is beyond the capabilities of a
conventional computer. The characteristics of these problems are that they
need a large number of computations that can be completed by a conventional
computer in days or may be weeks. In most of the science and engineering
applications, the problems can be developed in terms of vectors and matrices
that lend themselves to vector processing. Computers with vector processing
capabilities are in demand in specialised applications. Vector processing is of
utmost importance in the following representative application areas:
• Image processing
• Seismic data analysis
• Aerodynamics and space flight simulations
• Long-range weather forecasting
• Medical diagnosis
• Petroleum explorations
• Mapping the human genome
• Artificial intelligence and expert systems
Advantages of Vector Processors
Vector processors provide the following benefits:
1. Vector processors take advantage of data parallelism in huge scientific as
Manipal University Jaipur B1648 Page No. 191
Computer Architecture Unit 1
well as multimedia applications.
2. The moment a vector instruction begins functioning, just the register buses
and the functional unit feeding it require to be powered. Power can be
turned off for Fetch unit, decode unit, Re-order Buffer (ROB) etc. This
leads to reduction in power usage.
3. Vector processors are able to function on one whole vector in a single
instruction. Therefore, vector processors lessen the fetch and decode
bandwidth because of less number of instructions fetched.
4. In vector processing, the size of programs is small, because it needs fewer
numbers of instructions.
5. Vector memory access does not cause any wastage just like cache
access. Each data item requested by the processor is utilised in actual
terms.
6. Vector instructions also don’t reveal a lot of branches by implementing a
loop in a single instruction.
Self Assessment Questions
1. __________ is able to function on one whole vector in a single
instruction.
2. They also take advantage of ____________ in huge scientific as well
as multimedia applications.
9.3 Types of Vector Processing
Depending on the way the operands are fetched, vector processors can be
segregated into following two groups:
Memory-memory vector architecture: Operands are straight away streamed
from the memory to the functional units and outcomes are written back to
memory at the time the vector operation advances in this architecture.
Vector-register architecture: Operands are read into vector registers
wherein they are fed to the functional units and outcomes of operations are
written to vector registers in this architecture.
In the next section we will learn more about these two types of processors.
9.3.1 Memory-memory vector architecture
A type of vector processor that gives permission to the vector operands to be
fetched right away from memory to the various vector pipelines and the
outcomes to be written directly to memory is known as memory-memory
Manipal University Jaipur B1648 Page No. 192
Computer Architecture Unit 1
vector processor. As the elements of the vector require to be taken out from
memory than from a register, it takes a bit longer to start a vector operation;
this is partially because of the price of a memory access. An instance of a
memory-memory vector processor is the CDC Cyber 205.
Because of the ability to overlap memory accesses as well as the probable
reprocess of vector processors, ‘vector-register processors’ are normally
more productive and efficient as compared to ‘memory-memory vector
processors’. However, because the vectors’ length in a computation rises,
such a difference in effectiveness between the two kinds of architectures drops
down. In reality, the memory-memory vector processors can prove much
efficient when it comes to long vectors. However, experience displays that
smaller vectors are more commonly utilised.
Planned on the concepts initiated for the CDC Star 100, the first commercial
model of the CDC Cyber 205 was handed over in 1981. Such a supercomputer
is a memory-memory vector machine and fetches vectors directly from
memory to load the pipelines as well as stores the pipeline outcomes directly
to memory. Besides, it does not contain any vector registers. Consequently,
the vector pipelines have large start-up times. Instead of pipelines designed
for specific operations, such a machine consists of up to four general-purpose
pipelines. It also provides gather as well as scatter functions. ETA-10 is an
updated modern shared-memory multiprocessor version of the CDC Cyber
205.The next section provides more detail of this model.
CDC Cyber 200 model 205 computer overview: The Model 205 computer is
a super-scale, high-speed, logical and arithmetic computing system. It utilises
LSI circuits in both the scalar and vector processors that improve performance
to complement the many advanced features that were implemented in the
STAR-100 and CYBER 203 (these are the two Control Data Corp. computers
with built-in vector processors), like hardware macroinstructions, virtual
addressing and stream processing. The Model 205 contains separate scalar
and vector processors particularly designed for sequential and parallel
operations on single bit, 8-bit bytes, and 32-bit or 64-bit floating-point operands
and vector elements.
The central memory of the Model 205 is a high-performance semiconductor
memory with single-error correction, double-error detection (SECDED) on
each 32-bit half word, providing extremely high storage integrity. Virtual
Manipal University Jaipur B1648 Page No. 193
Computer Architecture Unit 1
addressing uses a high-speed mapping technique to convert a logical to an
absolute storage address to allow programs to appear logically contiguous
while being physically discontinuous in the storage system.
The basic Model 205 computer consists of the central processor unit (CPU), 1
million 64-bit words of central memory with SECDED, 6 input/output ports, and
a maintenance control unit (MCU). The CPU consists of the scalar processor
and a vector processor with one vector pipeline. Central memory is field-
expandable from one million 64-bit words to two or four million words of
semiconductor memory. The vector pipelines can be expanded to two or four
and the input/output ports are expandable to 16.The Model 205 central
processor consists of all instruction and streaming control, scalar and vector
arithmetic processors, and control for communication with central memory by
the CPU and the input/output channels.
The basic functional areas of the Model 205 CPU are:
• Scalar Processor
• Vector Processor
• Memory Interface
• Maintenance Control Unit
The LSI scalar processor contains a scalar arithmetic unit with independent
high-speed scalar arithmetic functional units. The scalar processor also
contains a semiconductor register file of 256 64-bit words utilised for indexing
and storing constants, instruction and operand addressing and field length
counts. Additionally it also holds operands and results for scalar instructions.
The scalar processor performs instruction control and virtual address
comparison and translation. A feature is provided to select, via an operating
system software installation parameter, a small page size of 512, 2048, or
8192 words. A large page size of 65,536 words is also provided. The vector
processor contains one, two or four parallel, segmented pipelines to facilitate
high-speed vector processing. The vector processor control is contained in the
stream unit. The string and all logical operations are performed in the string
unit. The memory interface provides the read and write ports of central memory
for the scalar and vector processors. Each port contains a one-SWORD (512-
bit Super Word) buffer to facilitate high transfer rates. The CPU processes
input and output by issuing relatively simple high-level messages to high-
speed peripheral stations or a front-end processor connected to the
Manipal University Jaipur B1648 Page No. 194
Computer Architecture Unit 1
input/output ports.
9.3.2 Vector register architecture
In a vector-register processor, the entire vector operations excluding load and
store are in the midst of the vector registers. Such architectures are the vector
equivalent of load-store architecture. Since the late 1980s, all major vector
computers have been using a vector-register architecture which includes the
Cray Research processors (Cray-1, Cray-2, X-MP, YMP, C90, T90 and SV1),
Japanese supercomputers (NEC SX/2 through SX/5, Fujitsu VP200 through
VPP5000, and the Hitachi S820 and S-8300), and the mini-
supercomputers(Convex C-1 through C-4).
All vector operations are memory to memory in a memory-memory vector
processor, the initial vector computers and CDC’s vector computers were of
such kind. Vector register architectures possess various benefits over vector
memory-memory architectures. It is necessary for the vector memorymemory
architecture to write the entire intermediate outcomes to memory as well as
later on read them back from memory. Vector register architecture is able to
maintain intermediate outcomes in the vector registers just near to the vector
functional units, decreasing temporary storage needs, inter-instruction latency
and memory bandwidth needs.
In case a vector outcome is required by multiple other vector instructions,
memory-memory architecture should read it from memory innumerable times;
while a vector register machine can use the value from vector registers once
again, thereby decreasing memory bandwidth needs. For such reasons, vector
register machines have proved to be more effective practically.
Components of a vector register processor: The major components of the
vector unit of a vector register machine are as given below:
1. Vector registers: There are many vector registers that can perform
different vector operations in an overlapped manner. Every vector register
is a fixed-length bank that consists of one vector with multiple elements
and each element is 64-bit in length. There are also many read and write
ports. A pair of crossbars connects these ports to the inputs/ outputs of
functional unit.
2. Scalar registers: The scalar registers are also linked to the functional
units with the help of the pair of crossbars. They are used for various
purposes such as computing addresses for passing to the vector
Manipal University Jaipur B1648 Page No. 195
Computer Architecture Unit 1
load/store unit and as buffer for input data to the vector registers.
3. Vector functional units: These units are generally floating-point units that
are completely pipelined. They are able to initiate a new operation on each
clock cycle. They comprise all operation units that are utilised by the vector
instructions.
4. Vector load and store unit: This unit can also be pipelined and perform
an overlapped but independent transfer to or from the vector registers.
5. Control unit: This unit decodes and coordinates among functional units.
It can detect data hazards as well as structural hazards. Data hazards are
the conflicts in register accesses while functional hazards are the conflicts
in functional units.
Figure 9.1 gives you a clear picture of the above mentioned functional units
of vector processor.
Figure 9.1: Vector Register Architecture
Types of Vector Instructions: The various types of vector instructions for a
register-register vector processor are:
(a) Vector-scalar instructions
(b) Vector-vector instructions
(c) Vector-memory instructions
(d) Gather and scatter instructions
(e) Masking instructions
(f) Vector reduction instructions
Let us discuss these.
Manipal University Jaipur B1648 Page No. 196
Computer Architecture Unit 1
(a) Vector-scalar instructions: Using these instructions, a scalar operand
can be combined with a vector one. If A and B are vector registers and f
is a function that performs some operation on each element of a single
or two vector operands, a vector-scalar operand can be defined as
follows:
Ai: = f (scalar, Bi)
(b) Vector-vector instructions: Using these instructions, one or two vector
operands are fetched from respective vector registers and produce
results in another vector register. If A, B, and C are three vector registers,
a vector-vector operand can be defined as follows:
Ai: = f (Bi, Ci)
(c) Vector-memory instructions: These instructions correspond to vector
load or vector store. The vector load can be defined as follows:
A: = f (M) where M is a memory register
The vector store can be defined as follows:
M: = f (A)
(d) Gather and scatter instructions: Gather is an operation that fetches the
non-zero elements of a sparse vector from memory as defined below:
A x Vo: = f (M)
Scatter stores a vector in a sparse vector into memory as defined below:
M: = f (A x Vo)
(e) Masking instructions: These instructions use a mask vector to expand
or compress a vector as defined below:
V = f (A x VM) where V is a mask vector
(f) Vector reduction instructions: These instructions accept one or two
vectors as input and produce a scalar as output.
Vector processor implementation (CRAY-1): CRAY-1 is one of the oldest
processors that implemented vector processing. CRAY-1 is considered as the
world's first vector supercomputer. It was introduced in 1975 by Seymour Cray.
It is basically a register-oriented RISC-like machine requiring all operands to
be in registers. It has five kinds of registers:
(a) A registers: A set of 8 24-bit registers
(b) B registers: A set of 64 24-bit registers
Manipal University Jaipur B1648 Page No. 197
Computer Architecture Unit 1
(c) S registers: A set of 8 64-bit registers
(d) T registers: A set of 64 64-bit registers
(e) Vector registers: A set of 8 64-element floating point registers
There are12 functional units in CRAY-1:
(a) 2 24-bit units for address calculation
(b) 4 64-bit integer scalar units for integer operations
(c) 6 deeply pipelined units for vector operations
CRAY-1 uses 16-bit instructions. All vector operations can be executed in one
16-bit instruction. The block diagram of CRAY-1 architecture is shown in figure
9.2.
Manipal University Jaipur B1648 Page No. 198
Computer Architecture Unit 1
Figure 9.2: Architecture of CRAY-1
Self Assessment Questions
3. ______________ is a modern shared-memory multiprocessor
version of the CDC Cyber 205 ________________ .
4. The memory-memory vector processors can prove to be much efficient
in case the vectors are sufficiently long. (True/False)
5. The scalar registers are linked to the functional units with the help of a
pair of ___________________ .
6. ___________ correspond to vector load or vector store.
7. Functional hazards are the conflicts in register accesses. (True/False)
Manipal University Jaipur B1648 Page No. 199
Computer Architecture Unit 1
Activity 1:
Find out more about a recent vector thread processor which comes in two
parts: the control processor, known as Rocket, and the vector unit, known as
Hwacha.
9.4 Vector Length and Stride Issues
This section will discuss two issues that occur in real programs. First is the
case when the vector length in a program is not precisely 64.Second is the
way non-adjacent elements in vectors that reside in memory are dealt with.
First, let us study the issue of vector length.
9.4.1 Vector length
In our study till now, we have not stated anything about the real vector size.
We just supposed that the size of the vector register is similar to the size of
the vector we hold. But this may not turn out to be always true. Particularly, we
have two cases in our hands:
• One in which the vector size is less than the vector register size, and
• The second in which the vector size is larger than the vector register
size.
To be more concrete, we assume 64-element vector registers as offered by
the Cray systems. Let’s observe the easier of these two problems.
Handling smaller vectors: In case the vector size is less than 64, we have to
permit the system to be aware that it should not function on all the 64 elements
in the vector registers. This can be simply done by utilising the vector length
register. The Vector Length (VL) register carries the appropriate vector length.
The entire vector operations are conducted on the first VL elements (in other
words, elements in the series 0 to VL - 1). The following two instructions are
needed to load values into the VL register:
V L1 (VL = 1)
V L Ak (VL = Ak where k # 0)
For instance, in case the vector length is equivalent to 40, the code given
below can be utilised to include two vectors in registers V3 and V4:
A1 40 (A1 = 40)
V L A1 (VL = 40)
V 2 V3+FV4 (V2 = V3 + V4)
Manipal University Jaipur B1648 Page No. 200
Computer Architecture Unit 1
As we cannot write
V L 40,
We must utilise the two-instruction order for loading 40 into the VL register.
The last instruction indicates floating-point addition of vectors V3 and V4. As
the VL is 40, just the first 40 elements are included. Table 9.1 below depicts a
sample of Cray X-MP instructions.
Table 9.1: Sample Cray X-MP Instructions
Instruction Meaning Description
Vi V j +Vk Vi = Vj+Vk Integer add Add corresponding elements (in the range 0 to XT 1) from Vj
and Vk vectors and place the result in vector Vi
Vi S j+Vk
Vi = Sj+Vk Add the scalar Sj to each element (in the range 0 to
Integer add XT 1} of Vk vector and place the result m vector Vi
Vi Vj+FVk Vi = Vj+Vk Add corresponding element- (in the range 0 to VL 1) from Vj
Floating-point add and Vk vectors and place the floating-point result in vectorVi
Vi Sj+FVk Vi = Sj+Vk Add the scalar S j to each element (in the range fl io VL
Floating-point add 1) of Vk rector and place the floating-point result in vector
Vi
Vi rAO,Ak
Vi = M[A0)+Ak Vector Load into elements 0 to VL 1 of vector register Vi from memory
load with stride Ak starting at address AO and incrementing addresses by Ak
Vi rAO,l Vi = M[AO>+1
Load into elements 0 to VL 1 of vector register Vi from memory
Vector load with stride
starting at address AO and incrementing addresses by 1
1
rAO,Ak Vi
Vi = M[A0)+Ak Vector Store elements 0 to VL 1 of vector register Vi in memory
store with stride Ak starting at address AO and incrementing addresses by Ak
rAO,l Vi Vi = M[AO)+1 Store element. 0 to VL 1 of vector register Vi in memory starting
Vector store with stride at address AO and mcrementmE -addresses by 1
1
Vi V j &Vk Vi = Vj tVk Logical Perform bitwise-AND operation on corresponding elements (in
AND the range 0 to VL 1) from Vj and Vk vectors and place the result
in vectorVi
Vi S j &Vk Vi = Sj iVk Logical AND
Perform bitwise-AND operation on fl to VL 1 elements of Vk
and scalar and place the result in vector Vi
Vi V j aAk Right-shift 0 to XT 1 element: ofVj by Ak and place the result
Vi = Vj >aAk
Kight-cbifl by Ak m vector Vi
Vi V j <Ak Vi = Vj ^<Ak Left-shift Left-shift 0 to XT 1 elements of Vj by Ak and plaice the result
by Ak m vector Vi
Handling larger vectors: Smaller vector sizes can be handled by the VL
register, but this does not apply to vectors of larger sizes. For instance, we
Manipal University Jaipur B1648 Page No. 201
Computer Architecture Unit 1
possess 200-element vectors (i.e., N = 200), in which way the vector
instructions can be used to total two such vectors? The instance of larger
vectors is handled by a method called strip mining.
In strip mining, the vector is segregated into strips of 64 elements. In this way,
a single odd-size piece remains which may be less than 64 elements. The size
of such a piece is provided by N mod 64. Every strip is thereafter loaded into
a vector register. Later on the vector addition instruction is put into operation.
Now, the number of strips can be portrayed by (N /64) + 1. For this case, the
200 elements are segregated into four pieces:
• 64 elements are contained in three pieces.
• 8 elements are contained in one odd piece.
Thereafter a loop is utilised which iterates four times: VL is adjusted to 8 in one
of the iterations, and the rest of the three iterations will adjust the VL register
to 64.
9.4.2 Vector stride
We have to know the way in which elements are stored in memory in order to
understand vector stride. Let’s first observe vectors. Because vectors are one-
dimensional groups, saving a vector in memory is considerably easy: vector
elements are saved as sequential words in memory. In case, we wish to fetch
40 elements, 40 contiguous words from memory have to be read. Such
elements are said to contain a stride of 1, i.e., to connect with the subsequent
element, we must add 1 to the recent element. It’s necessary to observe that
the distance between consecutive elements is measured in number of
elements and not in bytes.
We will require non-unit vector strides for multidimensional ranges. In order to
find out the reason, we should concentrate on two-dimensional matrices. In
case we want to save a two-dimensional matrix in memory, we must linearise
it. We are able to work on this in one of two ways: column-major or row-major
sequence. Majority of the languages with the exception of FORTRAN, utilise
the row-major order. In such a way of sequencing, elements are saved in row
order: row 0, row 1, row 2, and so on. Elements are saved column by column:
column 0, column 1, and so on in the columnmajor order, which is utilised by
FORTRAN. For instance, consider the 4 x 4 matrix below:
Manipal University Jaipur B1648 Page No. 202
Computer Architecture Unit 1
Figure 9.3: Memory Layout of Vector A.
Such a matrix is saved in memory as depicted in figure 9.3. Presuming row-
major order for saving, we should search for a way to reach all elements of
column 0. It’s obvious that such elements are not saved alongside. We are
forced to reach 0, 4, 8, and 12 elements in the memory array.
Since successive elements are divided on the basis of 4 elements, it can be
said that the stride is 4. Vector machines provide load and store instructions
that make an allowance for the stride. It can be noted from Table 9.1 that Cray
X-MP machine assists both unit as well as non-unit stride access. For instance,
the instruction
Vi, A0, Ak
loads vector register Vi along with stride Ak. As unit stride is quite usual, a
particular instruction
Vi, A0,1
is given. Alike instructions exist for storing vectors in memory.
Self Assessment Questions
8. The instance of larger vectors is dealt with by a method called
9. Vector elements are saved in the form of ____________ in memory.
Manipal University Jaipur B1648 Page No. 203
Computer Architecture Unit 1
9.5 Compiler Effectiveness in Vector Processors
A program can be run in vector mode successfully with the help of two factors.
The program structure is the first factor. It should be able to judge whether the
loops comprise of true data dependences or can they be restructured in such
a way that they have no such dependences. This factor is affected by the
algorithms selected and, to some degree, by the manner in which they are
coded. The second factor is the ability of the compiler. Although no compiler
is able to vectorise a loop which does not contain parallelism among the loop
iterations, however there is huge variation in the capability of compilers to
decide if a loop can be vectorised.
The techniques utilised for vectorising programs are similar to revealing ILP;
here we just review how well such techniques perform. Let's look at the
vectorisation levels noted for the Perfect Club benchmarks, as a sign of the
vectorisation level which can be achieved in scientific programs. These
benchmarks are huge and actual scientific applications. Figure 9.4 depicts the
percentage of operations implemented in vector mode for two versions of the
code performing on the Cray Y-MP.
Operations executed in Operations executed In
Benchmark vector mode, compiler- vector mode, hand- Speedup from hand
name optimized optimized optimization
BDNA 96 |% 972% 1.52
MG3D 95.1% 94.5% 100
FLO52 HS.7% N/A
ARC3D 91.1% . 1.01
SPEC77 ■Xi V; 90.4% 1.07
MDG «7.7% 94.2% 1.49
IRIl) 6*).X% 73.7% 1.67
DYFESM 68.8% 65.6*4 N/A
ADM 42.9% 59.6% 3.60
OCEAN 42.8% 91.2% 3.92
1 RACK 14.4% 54.6% 2.52
SPICE 113% 79.9% 4.06
QCD 4.2% 75.1% 2.15
Figure 9.4: Level of Vectorisation among the Perfect Club Benchmarks
when executed on the Cray Y-MP
The first version is that acquired with simply compiler optimisation on the
original code, whereas the second version has been considerably hand-
optimised by a team of Cray Research programmers. The extensive variation
in compiler vectorisation level has been noted by various studies of the
Manipal University Jaipur B1648 Page No. 204
Computer Architecture Unit 1
functioning of applications on vector processors. The hand-optimised versions
normally depict important gains in level of vectorisation for codes which the
compiler was not able to vectorise properly by itself, as all codes at present
were above 50% vectorisation. Interestingly, the quicker code created by the
Cray programmers had lower vectorisation levels. The vectorisation level is
not enough by itself to decide performance.
Alternative vectorisation methods might implement lesser instructions, or
maintain more values in vector registers, or permit higher chaining and overlap
in the midst of vector operations, and thus enhance performance even in case
the vectorisation level stays the same or decreases.
For instance, BDNA has approximately the same vectorisation level in the two
versions, however the hand-optimised code is more than 50% faster. There is
also huge variation in the way various compilers perform in vectorising
programs. Summing up the state of vectorising compilers, look at the data in
figure 9.5, that depicts the degree of vectorisation for various processors,
which utilise a test suite containing 100 handwritten FORTRAN kernels.
Completely Partially
Processor Compiler vectorized vectorized Not vectorized
CDC CYBER 205 VAST-2 V2.21 62 5 33
Convex C-series FC5.0 69 5 26
Cray X-MP CFT77 V3.0 69 3 28
Cray X MP CFT V 1.15 50 1 49
Cray-2 CFT2 V3.1a 27 1 72
ETA-10 FTN 77 V 1.0 62 7 31
Hitachi S810/820 FORT77/H AP V2O-2B 67 4 29
IBM <090/VI- VS FOR I RAN V2.4 52 4 44
NEC SX/2 FORTRAN77 / SX V.040 66 5 29
Figure 9.5: Result of applying Vectorising Compilers to the 100 FORTRAN Test
Kernels
The kernels were planned to verify vectorisation ability and are able to be
vectorised by hand.
Self Assessment Questions
10. List two factors which enable a program to run successfully in vector
mode.
11. There does not exist any variation in the capability of compilers to decide
if a loop can be vectorised. (True/False)
Manipal University Jaipur B1648 Page No. 205
Computer Architecture Unit 1
Activity 2:
Visit your local computer vendor and get an expert opinion about vector
processors and their working.
9.6 Summary
There are several representative application areas where vector processing is
of the utmost importance. Depending upon the way the operands are fetched,
vector processors can be segregated into two groups.
• Operands are straight away streamed from the memory to the functional
units and outcomes are written back to memory at the time the vector
operation advances in this architecture.
• Operands are read into vector registers wherein they are fed to the
functional units and outcomes of operations are written to vector registers
in this architecture.
• Vector register architectures have several advantages over vector
memory-memory architectures.
• There are several major components of the vector unit of a registerregister
vector machine
• The various types of vector instructions for a register-register vector
processor are:
■ Vector-scalar Instructions
■ Vector-vector Instructions
■ Vector-memory Instructions
■ Gather and Scatter Instructions
■ Masking Instructions
■ Vector Reduction Instructions
• CRAY-1 is one of the oldest processors that implemented vector
processing.
Two issues that arise in real programs: (i) the vector length in a program is not
exactly 64. (ii) Non adjacent elements in vectors that reside in memory.
• The structure of the program & capability of the compiler are two factors
that affect the success with which a program can be run in vector mode.
9.7 Glossary
• ASC: Advanced Scientific Computer
• Data hazards: the conflicts in register accesses
Manipal University Jaipur B1648 Page No. 206
Computer Architecture Unit 1
• ETA-10: A later shared-memory multiprocessor version of the CDC Cyber
205.
• Functional hazards: the conflicts in functional units.
• Gather: an operation that fetches the non-zero elements of a sparse vector
from memory.
• Masking instructions: These instructions use a mask vector to expand or
compress a vector
• Scatter: It stores a vector in a sparse vector into memory.
• SECDED: single-error correction, double-error detection.
• Small scale integration: it can pack 10 to 20 transistors in a single chip.
• Strip mining: the vector is partitioned into strips of 64 elements.
• Vector reduction instructions: These instructions accept one or two
vectors as input and produce a scalar as output.
9.8 Terminal Questions
1. Explain the importance of Vector Processors.
2. What are the different types of Vector Processing?
3. How is vector register architecture more advantageous over memory-
memory vector architecture?
4. Write short notes on:
a) CDC Cyber 200 model 205 computer overview
b) CRAY-1
c) Vector Length
d) Vector Stride
5. List the various functional units of Vector Processor and explain each one
in brief.
6. Explain the various types of vector instructions in detail.
7. How effective is the compiler in vector processors?
9.9 Answers
Self Assessment Questions
1. Vector processors
2. Data parallelism
3. ETA-10
4. True
5. Crossbars
6. Vector-memory instructions
7. False
Manipal University Jaipur B1648 Page No. 207
Computer Architecture Unit 1
8. Strip mining
9. Sequential words
10. Structure of the program & capability of the compiler
11. False
Terminal Questions
1. There are various application areas of vector processors which are of
considerable importance. Refer Section 9.2.
2. Depending upon the way the operands are fetched, vector processors can
be segregated into two groups: Memory-memory vector architecture and
Vector-register architecture. Refer Section 9.3.
3. Due to the capability to overlap memory accesses as well as the probable
use of vector processors again, vector-register vector processors are
normally more efficient as compared to memory-memory vector
processors. Refer Section 9.3.
4. a. The CDC Cyber 205 is based on the concepts initiated for the CDC Star
100; the first commercial model was produced in 1981. Refer Section
9.4.
b. CRAY-1 is one of the oldest processors that implemented vector
processing. Refer Section 9.5.
c. The vector size may be less than the vector register size, and the
vector size may be larger than the vector register size. Refer Section
9.6.
d. As vectors are one-dimensional series, saving a vector in memory is
direct: vector elements are stored as sequential words in memory.
Refer Section 9.6.
5. The major components of the vector unit of a register-register vector
machine are Vector Registers, Vector Functional Units, Scalar Registers
etc. Refer Section 9.5.
6. The various types of vector instructions for a register-register vector
processor are: (Refer Section 9.5.) a. Vector-scalar Instructions
b. Vector-vector Instructions
c. Vector-memory Instructions
d. Gather and Scatter Instructions
e. Masking Instructions
f. Vector Reduction Instructions
7. Like an indication of vectorisation level which can be acquired in scientific
Manipal University Jaipur B1648 Page No. 208
Computer Architecture Unit 1
programs, we should observe the vectorisation levels noted for the Perfect
Club benchmarks. Refer Section 9.7.
References:
• Hwang, K. (1993). Advanced Computer Architecture. McGraw-Hill.
• Godse, D. A. & Godse, A. P. (2010). Computer Organisation. Technical
Publications.
• Hennessy, John L., Patterson, David A. & Goldberg David (2011).
Computer Architecture: A Quantitative Approach, Morgan Kaufmann; 5th
edition.
• Sima, Dezso, Fountain, Terry J. &Kacsuk, Peter (1997). Advanced
computer architectures - a design space approach. Addison-Wesley-
Longman.
E-references:
• https://csel.cs.colorado.edu/~csci4576/VectorArch/VectorArch.html
• http://www.cs.clemson.edu/~mark/464/appG.pdf
• nasa_fig.gif
Unit 10 SIMD Architecture
Structure:
10.1 Introduction
Objectives
10.2 Parallel Processing: An Introduction
10.3 Classification of Parallel Processing
10.4 Fine-Grained SIMD Architecture
An example: the massively parallel processor Programming and
applications
10.5 Coarse-Grained SIMD Architecture
An example: the CMS
Programming and applications
10.6 Summary
10.7 Glossary
10.8 Terminal Questions
10.9 Answers
10.1 Introduction
Manipal University Jaipur B1648 Page No. 209
Computer Architecture Unit 1
In the previous unit, you studied about the use and effectiveness of vector
processors. Also, you studied the vector register architecture, vector length
and stride issues. We also learnt the concept of compiler effectiveness in
vector processors. In this unit, we will throw light on the data parallel
architecture, SIMD design space. We will also study the types of SIMD
architecture. The instruction execution in conventional computers is
sequential. So there is a time constraint involved, if one program is being
executed, the other task has to wait for the time till the first one is executed. In
parallel processing, the execution is in a parallel manner, that is at the same
time the program can be divided into segments, while one segment is being
processed, the other can be fetched from the memory and some segments
can be printed (provided they are already processed) all at the same time. The
purpose of parallel processing is to bring down the execution time, hence to
speed up the data processing.
Parallel processing can be established by dividing the data among different
units, each unit being processed simultaneously and the timing and
sequencing being governed by the control unit so as to get the fruitful result in
minimum amount of time.
Objectives:
After studying this unit, you should be able to:
• discuss the concept of data parallel architecture
• describe the SIMD design space
• identify the types of SIMD architecture
• recognise fine grained SIMD architecture
• explain coarse grained SIMD architecture
10.2 Parallel Processing: An Introduction
Parallel processing is basic part of our everyday life. The concept of parallel
processing is so natural in our life that we use it without even realising. When
we face some crisis, we take help from others and involve them to solve it
more easily. This cooperation of using two or more helpers to make easy the
solution of some problem may be termed parallel processing. The aim of
parallel processing is therefore to solve a particular problem more rapidly, or
to enable the solution of a particular problem that would otherwise be not
solvable by one person. The principles of parallel processing are, however, not
recent, as evidence suggests that the computational devices used over 2000
years ago also used this.
Manipal University Jaipur B1648 Page No. 210
Computer Architecture Unit 1
However, the early computer developers rapidly identified two obstacles
restricting the widespread acceptance of parallel machines: the complexity of
construction; and, the seemingly high programming effort required. As a result
of these early set-backs, the developmental thrust shifted to computers with a
single computing unit. Additionally, the availability of sequential machines
resulted in the development of algorithms and techniques optimised for these
particular architectures. The evolution of serial computers may be finally
reaching its peak due to the limitations imposed on the design by its physical
performance and natural hurdles. As consumers and end-users are
demanding better performance, computer designers have started considering
parallel approaches to get the better of these limitations.
All contemporary computer architectures include some degree of parallelism
in their designs. Better hardware designing and assembling together with
increasing understanding of how to deal with the difficulties in parallel
programming has confirmed that parallel processing is at the front line of
computer technology. Customarily, software has been developed to be
executed on one single computer machine having only one CPU (Central
Processing Unit) for serial computing. In this, a single problem is divided into
a smaller, more easy to handle instructions. Instructions are performed in
series one after another and only one instruction is executed at a given time
as shown in figure 10.1.
Figure 10.1: An Example of Serial Computing
In parallel computing, simultaneous use of multiple compute resources is
Manipal University Jaipur B1648 Page No. 211
Computer Architecture Unit 1
made to work out a computational problem. It may take the use of multiple
CPUs. A problem is broken into discrete parts that can be solved concurrently.
Each part is further broken down to a series of instructions and instructions
from each part execute simultaneously on different CPUs as shown in figure
10.2.
Manipal University Jaipur B1648 Page No. 212
Computer Architecture Unit 1
Thus, we can say that a computer system is said to be Parallel Processing
System or Parallel Computer if it provides facilities for simultaneous
processing of various set of data or simultaneous execution of multiple
instructions.
On a computer with more than one processor each of several processes can
be assigned to its own processor, to allow the processes to progress
simultaneously. If only one processor is available the effect of parallel
processing can be simulated by having the processor run each process in turn
for a short time.
Parallel processing in multiprocessor computer is said to be true parallel
processing and parallel processing in uni-processor computer is said to
simulated or virtual parallel processing. We can easily understand the
difference between true and virtual parallel processing by following figure 10.3.
Manipal University Jaipur B1648 Page No. 213
Computer Architecture Unit 1
Figure 10.3: (a) Serial Processing (b) True Parallel Processing with Multiple
Processors (c) Parallel Processing Simulated by Switching one Processor
among Three Processes
Figure 10.3 (a) represents the serial processing means next processing is
started when the previous process must be completed. In figure 10.3 (b) all
three process are running in one clock cycle of three processors. In figure 10.3
(c) all three process are also running in one clock cycle but each process are
getting only 1/3 of actual clock cycle on each clock cycle and the CPU is
switching from on process to other in its clock cycle. When one process is
running all other process must wait for their turn. So if we see in figure 10.3 (c)
then we will find that at one clock time only one process is running and other
are waiting. But in figure 10.3 (b) at one clock time all three process are
running. So in uni-processor system the parallel processing as shown in figure
10.3 (c) is called virtual parallel processing.
Manipal University Jaipur B1648 Page No. 214
Computer Architecture Unit 1
Self Assessment Questions
1. A problem is broken into a discrete series of _____________ .
2. ______________ provides facilities for simultaneous processing of
various set of data or simultaneous execution of multiple instructions.
3. Parallel processing in multiprocessor computer is said to be parallel
processing.
4. Parallel processing in uni-processor computer is said to parallel
processing.
10.3 Classification of Parallel Processing
The core element of parallel processing is CPUs. The essential computing
process is the execution of sequence of instruction on asset of data. The term
stream is used here to denote a sequence of items as executed by single
processor or multiprocessor. Based on a number of instruction and data,
streams can be processed simultaneously, Flynn classifies the computer
system into four categories. The matrix defines the 4 possible classifications
according to Flynn as given in figure 10.4.
SISD S1 MD
Single Instruction, Single Data Single Instruction, Multiple Data
M1 SD M1M D
Multiple Instruction, Single Data Multiple Instruction, Multiple Data
Figure 10.4: Flynn’s Classification of Computer System
In this chapter, our main focus will be Single Instruction Multiple Data (SIMD).
Single Instruction Multiple Data (SIMD)
The term single instruction implies that all processing units execute the same
instruction at any given clock cycle. On the other hand, the term multiple data
implies that each and every processing unit could work on a different data
element. Generally, this type of machine has one instruction dispatcher, a very
big array of very small capacity instruction units and a network of very high
bandwidth. This type is suitable for specialised problems which are
characterised by a high regularity, for example, image processing. Figure 10.5
shows a case of SIMD processing.
Manipal University Jaipur B1648 Page No. 215
Computer Architecture Unit 1
Today, modern microprocessors can execute the same instruction on multiple
data. This is called Single Instruction Multiple Data (SIMD). SIMD instructions
handle floating-point real numbers and also provide important speedups in
algorithms. As the performing units for SIMD instructions typically belong to a
physical core, as many SIMD instructions can run in parallel as the available
physical cores. As mentioned, the utilisation of these vector-processing
capabilities in parallel could give significant speedups in certain specific
algorithms.
The adding up of SIMD instructions & hardware to a multi-core CPU is a bit
more extreme as compared to the addition of floating point ability. Since their
inception, a microprocessor is a SISD device. SIMD is also referred as vector
processing as its fundamental unit of organisation is the vector. This is shown
in figure 10.6:
Figure 10.6: Scalars and Vectors
A normal CPU operates on scalars, which is one at a time. A superscalar CPU
Manipal University Jaipur B1648 Page No. 216
Computer Architecture Unit 1
operates on multiple scalars at a given moment, but it executes a different
operation on each instruction. On the other hand, a vector processor lines up
an entire row of these same types of scalars and operates on them as a single
unit. Figure 10.7 shows the difference between SISD and SIMD.
Figure 10.7: SISD vs. SIMD
Modern, superscalar SISD machines exploit the property ‘instruction-level
parallelism’ of the instruction stream. This signifies that multiple instructions
can be executed at a single instance on the same identical data stream.
One property of the data stream called ‘data parallelism’ is exploited by a SIMD
machine. In this framework, you get data parallelism when you have a large
mass of uniform data that requires same instruction performed on it. Therefore,
a SIMD machine is totally a separate class of machine than the normal
microprocessor.
Self Assessment Questions
5. SIMD stands for _______________ .
6. Flynn classified computing architectures into SISD, MISD, SIMD and
7. SIMD is known as ________________ because its basic unit of
organisation is the vector.
8. Superscalar SISD machines use one property of the instruction stream
by the name of __________ .
Activity 1:
Explore the components of a parallel architecture that are used by an
organisation. Also, find out the type of memory used in that architecture.
10.4 Fine-Grained SIMD Architecture
The Steven Unger design scheme is the initial base for the Fine-grained SIMD
architectures. These are generally designed for low-level image processing
Manipal University Jaipur B1648 Page No. 217
Computer Architecture Unit 1
applications. The following are the features of fine-grained architecture:
• Complexity is minimal and the degree of autonomy is lowest feasible in
each Processing Element (PE).
• Economic constraints are applicable on the maximum number of PEs
provided.
• It is assumed by the programming model that there is equivalence between
the number of PEs and the number of data items, and hides any mismatch
as far as possible.
• The 4-connected nearest neighbour mesh is used as the basic
interconnection method.
• A simple extension of a sequential language with parallel-data additions is
the usual programming language
Even though, practically, this concept is not absolute in any systems, there are
certain systems that are close to this concept. They include CLIP4, the DAP,
the MPP (all first-generation systems), the CM1 and the MasPar1 amongst
later embodiments. There are other categories which are a bit deviated from
the classical model. They are explained as follows:
• Processing element complexity is increased, either so as to operate on
multi-bit numbers directly or by the addition of dedicated arithmetic units.
• Enhanced connectivity arrangements are superimposed over the standard
mesh. Such arrangements include hypercube and crossbar switches.
One of the most important architectural developments which have occurred in
this class of system over time is the incorporation of ever-increasing amounts
of local memory. This reflects the experience of all users that insufficient
memory can have a catastrophic effect on performance, outweighing, in the
worst cases, the advantages of a parallel configuration. Perhaps, the
Massively Parallel Processor (MPP) system has been the most modern design
which retained the simplicity of the fine-grained approach, and this is examined
in detail in the next section.
10.4.1 An example: The massively parallel processor
MPP is the acronym for Massively Parallel Processor. MPP shows the
principles of this group in the best possible way, though it is not the most recent
example of a fine-grained SIMD system. The overall system design is
illustrated in figure 10.8.
Manipal University Jaipur B1648 Page No. 218
Computer Architecture Unit 1
Figure 10.8: The MPP Systems
A square array was chosen in MPP to match the configuration of the
anticipated data sets on which the system was intended to work. The square
array is of 128 x 128 active processing elements. The MPP was constructed
for (and used by) NASA, with the obvious intention of processing mainly image
data. The size of the array was simply the biggest that could be achieved at
the time, given the constraints of then current technology and the intended
processor design. It resulted in a system constructed from 88 array cards, each
of which supported 24 processor chips (192 processors) together with their
associated local memory.
The array incorporates four additional columns of spare (inactive) processing
elements to provide some fault-tolerance. One of the major system design
considerations in highly parallel systems such as MPP is how to handle the
unavoidable device failures. The number of these is inevitably increased by
the use of custom integrated circuits, in which economic constraints lead to
poor characterisation. The problem is compounded in a data-parallel mesh-
connected array, since failure at some point of the array disrupts the very data
structures on which efficient computations are predicated. The MPP deals with
this problem by allowing a column of processors which contains one faulty
element to be switched out of use, while one of the four spare columns is
added to the edge of the array to maintain its size and shape. Naturally, if a
Manipal University Jaipur B1648 Page No. 219
Computer Architecture Unit 1
fault occurs during computation, the sequence of instructions following the last
dump to external memory must be repeated after replacement of the fault-
containing column.
The processing elements are linked by a 2-dimension near-neighbour mesh.
This resolution gives a number of important advantages over other likely
alternatives, such as trouble-less data structures maintenance in shifting,
engineering ease, high bandwidth, and a close conceptual match to the
formulation of many image processing calculations.
The principal disadvantage of this system is the sluggish transmission of data
between remote processors in array. However, this can be only seen if
comparatively minute amount of data is to be transmitted (rather than whole
images).
The option of 4 rather than 8-connectedness is perhaps surprising in view of
the minimal increase of complexity which latter involves, compared to a twofold
improvement in performance on some operations. There is one special
purpose staging memory meant for conversion of data format. All extremely
parallel computers have problems related with the data input & output, and in
those parallel computers which represent single-bit processors, the problems
are many and compounded. The problem is that external source data is usually
formatted as one individual string of integers. So, if such a data is utilised in a
two-dimensional array in any simple manner, considerable amount of time is
wasted before successful processing can start, basically because of the
unmatched format of the data.
The MPP included two solutions for this problem. The 1st was a distinct data
input/output register. The 2nd was the staging memory, which allowed
conversion between bit plane & integer string formats. Using jointly, these two
solutions allowed the processor array to function continuously, and so giving
out the maximum output.
10.4.2 Programming and applications
The MPP system was commissioned by NASA principally for the analysis of
Lands at images (satellite imagery of Earth) This meant that, initially, most
applications on the system were in the area of image processing, although the
machine eventually proved to be of wider applicability. At the same time, NASA
also utilised the MPP system for various other applications listed below. See
figure 10.9).
Manipal University Jaipur B1648 Page No. 220
Computer Architecture Unit 1
Figure 10.9: MPP integrated Circuits
Stereo image analysis: The stereo analysis algorithm on the MPP was
designed to work out elevations from artificial aperture images obtained at
different viewing angles during a Shuttle mission. By means of an appropriate
geometric model, elevations can be worked out from the differing locations of
corresponding pixels in a pair of images acquired at different incidence angles,
which form a pseudo stereo pair. The main difficulties observed in the
matching algorithm are:
• The brightness levels are different in corresponding areas of the two
images.
• Images have areas of low contrast and high noise.
• There are local distortions which differ from image to image.
We can overcome the first two difficulties by the use of normalised correlation
Manipal University Jaipur B1648 Page No. 221
Computer Architecture Unit 1
functions (a standard image processing technique) but the third arises due to
the different viewing angles. The MPP algorithm operates as follows:
• For each pixel in one of the images (the reference image) a local
neighbourhood area is defined. This is correlated with the similar area
surrounding each of the candidate match pixels in the second image.
• The measure applied is the normalised mean and variance cross
correlation function. The candidate yielding the highest correlation is
considered to be the best match, and the locations of the pixels in the two
images are compared to produce the disparity value at that point of the
reference image.
• The algorithm is iterative. It begins at low resolution, that is, with large
areas of correlation around each of a few pixels. When the first pass is
complete, the test image is geometrically warped according to the disparity
map.
• The process is then repeated with a higher resolution (usually reducing the
correlation area. and increasing the number of computed matches, by a
factor of two), a new disparity map is calculated and a new warping
applied, and so on.
• The procedure is continued either for a predetermined number of passes
or until some quality criterion is exceeded.
Self Assessment Questions
9. MPP is the acronym for ___________ .
10. All highly parallel computers have problems concerned with
10.5 Coarse-Grained SIMD Architecture
There are several technical difficulties that arise in fulfilling completely the fine-
grained SIMD ideal of one processor per data element. Thus, it is better to
begin with the coarse-grained approach and therefore, develop a more rational
architecture. Currently, a number of parallel computers manufacturers,
including nCUBE and Thinking Machines Inc., have adopted this outlook.
The manufacturers which are more familiar with the mainstream of computer
design than the application-specific architecture field often develop the
Coarse-grained data-parallel architectures. It is the result of MIMD
programmes that have helped discover the complexities of this approach and
seek to mitigate them. The consequences of these roots are systems which
can employ a number of different paradigms including MIMD. Multiple-SIMD
and what is often called single program multiple data (SPMD) in which each
Manipal University Jaipur B1648 Page No. 222
Computer Architecture Unit 1
processor executes its own program, but all the programs are the same, and
so remain in lock-step. Such systems are frequently used in this data-parallel
mode, and it is therefore reasonable to include them within the SIMD
paradigm. Naturally, when they are used in a different mode, their operation
has to be analysed in a different way. Coarse-grained SIMD systems of this
type embody the following concepts:
• Each PE is of high complexity, comparable to that of a typical
microprocessor.
• The PE is usually constructed from commercial devices rather than
incorporating a custom circuit.
• There is a (relatively) small number of PEs, on the order of a few
hundreds or thousands.
• Every PE is provided with ample local memory.
• The interconnection method is likely to be one of lower diameter and lower
bandwidth than the simple two-dimensional mesh. Networks such as the
tree, the crossbar switch and the hypercube can be utilised.
• Provision is often made for huge amounts of relatively high-speed, high-
bandwidth backup storage, often using an array of hard disks.
• The programming model assumes that some form of data mapping and
remapping will be necessary, whatever the application.
• The application field is likely to be high-speed scientific computing.
This type of systems have a number of advantages as compared to finegrained
SIMD, such as the capability to take maximum advantage from latest
processor technology, the aptitude to perform highly precise computations with
no performance penalty and the easier mapping to a selection of different data
types which the lesser number of processors and improved connectivity
permits.
The software required for such systems offers an advantage as well as a
disadvantage at the same time. The advantage lies in its closer similarity to
normal programming: the disadvantage lies in the less natural programming
for some applications. Coarse-grained systems also offer greater variety in
their designs, because each component of the design is less constrained than
in a tine-grained system. The example given below is, therefore, less
specifically representative of its class than was the MPP machine considered
earlier.
Manipal University Jaipur B1648 Page No. 223
Computer Architecture Unit 1
10.5.1 An example: the CM5
The Connection Machine family marketed by Thinking Machines Inc. has been
one of the most commercially successful examples of niche marketing in the
computing field in recent years (one other which springs to mind is the CRAY
family). Although the first of the family, CM1, was fine-grained in concept, the
latest, CM5, is definitely coarse-grained. The first factor in its design is the
processing element, illustrated in figure 10.10. The important components of
the design include:
• internal 64-bit organisation;
• a 40MHz SPARC microprocessor;
• separate data and control network interfaces;
• up to four floating-point vector processors with separate data paths to
memory;
• 32 Mbytes of memory.
Manipal University Jaipur B1648 Page No. 224
Computer Architecture Unit 1
Figure 10.10: CM5 Processing Element
Taken together, these components give a peak double-precision floatingpoint
rate of 128 Million Floating-Point Operations Per Second (MFLOPS) and a
memory bandwidth of 512 Mbyte/sec. Achieved performance rates for some
specific operations are given in Table 10.1.
Table 10.1: Performance of the CM5 node
Function Performance (MFLOPS)
Matrix multiply 64
Matrix-vector multiply 100
Unpack benchmark 50
8k-point FFT 90
Peak rate 128
There are three aspects to the system design which are of major importance.
The first is the data interconnection network, shown in figure 10.11, which is
designated by the designers a fat tree network. It is based upon the quadtree,
augmented to reduce the likelihood of blocking within the network.
Manipal University Jaipur B1648 Page No. 225
Computer Architecture Unit 1
Figure 10.11: CM5 Fat-Tree Connectivity Structure
Thus, at the lowest level, within what is designated an interior node of four
processors, there are at least two independent direct routes between any pair
of processors. Utilising the next level of tree, there are at least four partly
independent routes between a pair of processors. This increase in the number
of potential routes is maintained for increasing numbers of processors by
utilising higher levels of the tree structure.
Although this structure provides a potentially much higher band-width that the
ordinary quadtree, like any complex system, achieving the highest
performance depends critically on effective management of the resource. The
second component of system design which is of major importance is the
method of control of the processing elements. Since each of these
incorporates a complete microprocessor, the system can be used in fully
asynchronous MIMD mode. Similarly, if all processors execute the same
program, the system can operate in the SPMD mode. In fact, the designers
suggest that an intermediate method is possible, in which processing elements
act independently for part of the time, but are frequently resynchronised
globally. This technique corresponds to the implementation of SIMD with
algorithmic processor autonomy.
The final system design aspect of major importance is the data I/O method.
The design of the CM5 system seeks to overcome the problem of improving
(and therefore variable) disk access speeds by allowing any desired number
of system nodes to be allocated as disk nodes with attached backup storage.
This permits the amount and bandwidth of I/O arrangements to be tailored to
the specific system requirements. Overall, one of the main design aims, which
Manipal University Jaipur B1648 Page No. 226
Computer Architecture Unit 1
was pursued for the CM5 system was scalability. This not only means that the
number of nodes in the system can vary between (in the limits) one and 16384
processors, but that system parameters such as peak computing rate, memory
bandwidth and I/O bandwidth all automatically increase in the proportion of
processing elements. This is shown in the Table 10.2.
Table 10.2: CM5 System Parameters
Number of processor 32 1024 16384
Number or data paths 128 4096 65 536
Peak speed (MFLOPS) 4 128 2048
Memory (Gbyte) 1 32 512
Memory bandwidth (Gbyte/s) 16 512 8 192
1/0 bandwidth (Gbyte/s) 0.32 10 160
Synchronisation time (us) 1.5 3.0 4.0
10.5.2 Programming and applications
Purchasing a substantial CM5 machine involves a huge cost in terms of
investment. Thus, it is usually viewed as a multi-user system, in which
resources (that is, partial systems), once allocated, are fully protected and are
independent. This is ensured by a UNIX-based time-sharing operating system
and priority-based job queuing. Under this operating system, data- parallel
versions of C and Fortran are provided, together with a variety of packages for
data visualisation and scientific computation.
Self Assessment Questions
11. The two parallel computers manufacturers of coarse-grained architecture
are ____________________________ .
12. SPMD is the acronym for ___________________ .
13. ____________________ is the first of the CRAY family.
14. The latest system in the CRAY family is ________________ .
Activity 2
Visit an organisation and find out the difficulties that are faced by the
computer designers in implementing and operating the fine-grained and
coarse-grained SIMD architectures.
10.6 Summary
Let us recapitulate the important concepts discussed in this unit:
Manipal University Jaipur B1648 Page No. 227
Computer Architecture Unit 1
• Parallel processing can be established by dividing the data among different
units, each unit being processed simultaneously and the timing and
sequencing being governed by the control unit so as to get the fruitful result
in minimum amount of time.
• Parallel processing is an integral part of everyday life.
• The evolution of serial computers may be finally reaching its peak due to
the limitations imposed on the design by its physical implementation and
inherent bottlenecks.
• The core element of parallel processing is CPUs. The essential computing
process is the execution of sequence of instruction on asset of data.
• The term single instruction implies that all processing units execute the
same instruction at any given clock cycle. On the other hand, the term
multiple data implies that each processing unit can operate on a different
data element.
• SIMD instructions handle floating-point real numbers and also provide
important speedups in algorithms. A vector processor lines up a whole row
of the scalars, all of the same type, and operates on them as a unit.
• Modern, superscalar SISD machines exploit a property of the instruction
stream called instruction-level parallelism (ILP).
• The Steven Unger design scheme is the initial base for the Fine-grained
SIMD architectures. These are generally designed for low-level image
processing applications.
• MPP is the acronym for Massively Parallel Processor. MPP exemplifies the
ideology of this group in the best feasible way, though it is not the most
recent example of a fine-grained SIMD system.
• The array incorporates four additional columns of spare (inactive)
processing elements to provide some fault-tolerance. One of the major
system design considerations in highly parallel systems such as MPP is
how to handle the unavoidable device failures.
• There are several technical difficulties that arise in completely fulfilling the
fine-grained SIMD ideal of one processor per data element.
10.7 Glossary
• CM1: Connection Machine 1
• CM5: Connection Machine 5
• ILP: Instruction-level Parallelism
• MIMD: Multiple Instruction Multiple Data
• MISD: Multiple Instruction Single Data
Manipal University Jaipur B1648 Page No. 228
Computer Architecture Unit 1
• MPP: Massively Parallel Processor
• SIMD: Single Instruction Multiple Data
• SISD: Single Instruction Single Data
• SPMD: Single Program Multiple Data
10.8 Terminal Questions
1. What do you understand by Parallel Processing? Also, explain Serial
Processor and True Parallel Processor.
2. Explain the hardware Architecture of Parallel Processing.
3. Describe the Fine-Grained SIMD Architecture. Give a suitable example.
4. Illustrate and example the concept of Coarse-Grained SiIMD
Architecture.
5. Explain connection machine and describe the crayfFamily.
10.9 Answers
Self Assessment Questions
1. Instructions
2. Parallel Computer
3. True
4. Simulated Or Virtual
5. Single-Instruction, Multiple Data
6. SISD, SIMD, MISD, and MIMD
7. Vector Processing
8. Instruction-Level Parallelism
9. Massively Parallel Processor
10. Input and Output Of Data
11. Ncube and Thinking Machines Inc.
12. Single Program Multiple Data
13. Cm1
14. Cm5
Terminal Questions
1. Parallel Computing is the simultaneous use of multiple compute
resources to solve a computational problem. Refer Section 10.2.
2. The core element of Parallel Processing is Cpus. The essential
computing process is the execution of sequence of instruction on Asset of
Data. Refer Section 10.3.
3. The Steven Unger Design Scheme is the initial base for the fine-grained
Manipal University Jaipur B1648 Page No. 229
Computer Architecture Unit 1
SIMD architectures. These are generally designed for low-level image
processing applications. Refer Section 10.4.
4. There are several technical difficulties that arise in completely fulfilling the
fine-grained SIMD ideal of one processor per data element. Thus, it is
better to begin with the coarse-grained approach and therefore, develop a
more rational architecture. Refer Section 10.5.
5. The connection machine family marketed by thinking machines inc. Has
been one of the most commercially successful examples of niche
marketing in the computing field in recent years (one other which springs
to mind is the cray family). Refer section 10.5.
References
• Sima, Fountain, T. & Kacsuk, P. (1997) Advanced Computer Architectures:
A Design Space Approach.
• Hwang, K. (1993) Advanced Computer Architecture, Parallelism,
Scalablility, Programmability, Mgh.
• Flynn, M. J. (1995) Computer Architecture, Pipelined & Parallel Processor
Design - Narosa.
• Hayes, J. P. (1998) Computer Architecture & Organisation, 3rd Edition
Mcgraw Hill.
• Carter; N. P. (2002) Schaum’s Outline Of Computer Architecture; Mc.
Graw-Hill Professional
E-references:
• http://www.lc3help.com/
• http://www.scribd.com/
Unit 11 Vector Architecture and MIMD Architecture
Structure:
11.1 Introduction
Objectives
11.2 Vectorisation
11.3 Pipelining
11.4 MIMD Architectural Concepts
Multiprocessor
Multi-computer
11.5 Problems of Scalable Computers
11.6 Main Design Issues of Scalable MIMD Architecture
11.7 Summary
Manipal University Jaipur B1648 Page No. 230
Computer Architecture Unit 1
11.8 Glossary
11.9 Terminal Questions
11.10 Answers
11.11 Introduction
In the previous unit, you were introduced to data parallel architecture in which
you studied the SIMD part. You learned about SIMD architecture and its
various aspects like SIMD design space, fine-grained SMID architecture and
coarse gained SIMD architecture. In this unit we will progress a step further to
explain the MIMD architecture. Although we have covered vector architecture
in prior unit, we will throw some light on it as well so that the concept of MIMD
can be understood in a better way.
According to famous computer architect Jim Smith, the most efficient way to
execute a vectorisable application is a vector. Vector architectures are
responsible for collecting the group of data elements distributed in memory
and after that placing them in linear sequential register files. After placing,
operation starts on that data present in register files and the result is dispersed
again to the memory. On the other hand, MIMD architectures are of great
importance and may be used in numerous application areas such as CAD
(Computer Aided Design), CAM (Computer Aided Manufacturing), modelling,
simulation etc
In this unit, we are going to study different features of Vector architecture and
MIMD architecture such as pipelining, MIMD architectural concepts, problems
of scalable computers, Main design issues of scalable MIMD architecture.
Objectives:
After studying this unit, you should be able to:
• recall the concept of vector architecture
• discuss the concept of pipelining
• describe MIMD Architectural concepts
• differentiate between multiprocessor and multicomputer
• interpret the problems of scalable computers
• solve the problems of scalable computers
• recognise the main design issues of scalable MIMD architecture
11.2 Vectorisation
Vector machines are planned & designed to operate at the level of vectors.
Manipal University Jaipur B1648 Page No. 231
Computer Architecture Unit 1
Now, you will study the operation of vectors. Suppose there are 2 vectors, A
and B, both having 64 components. The components present in a vector are
the vector size. So our vector size is 64. Vector A and B are shown below:
JJ = />0, ................. j .
Now we want to add these 2 vectors and keep the result in another vector C.
It is shown in the equation below. The rule for adding the vector is to add the
corresponding components.
C — A I 13 — do I bo, di I bj .................................................... dn — 1 I bn— . ■
We can also perform the addition by utilizing loops as in high level languages.
This loop iterates n times; n is the size of vector. Now you will notice how this
code can be written in C-language:
For(i=0; i<n; i++),
C[i] = A[i] + B[i];
The above loop iterates n number of time and is known as for loop. This adding
of vectors is referred as scalar operation. Vector operations are defined by
vector processor instruction. Such as, only 1 vector instruction is required to
add A and B vectors. Basically, a vector instruction defines 4 fields, 3 are
registers and 1 is operation.
VOP Vd Vsl Vs2
which performs
Vd = Vs1 VOP Vs2
Here, VOP is the vector operation which is performed on registers Vs1 and
Vs2 and result is stored at Vd.
Architecture
As discussed in unit 9, both the scalar and vector unit is present in a vector
machine. The scalar unit has the same structural design as the conventional
processor and it works on scalars. Similarly vectors works on vector
operations. Advancements like moving from CISC to RISC designs, and
moving from the memory-memory architecture to the vector-register
architecture has been seen in vector architecture.
Manipal University Jaipur B1648 Page No. 232
Computer Architecture Unit 1
In the beginning, vector machines employed the memory-memory
architecture. In memory- memory architecture, input operands are received by
all vector operations from memory and the result is stored in memory. The 1 st
vector machine CDC Star 100 utilized this architecture.
The vector-register architecture is like RISC architecture and is utilized in
numerous vector machines such as Hitachi, NEC etc.
The vector register contains vectors on which vector operations are performed.
After the operations are performed then the result also gets stored in vector
register. Through RISC processor, particular load and store instruction moves
the vector among memory and register of vector.
Figure 11.1 shows the vector processor architecture which is grounded on
Cray 1 system. There are 5 major components in this architecture:
• vector register
• vector load/store unit
• vector functional units
• scalar unit
• main memory,
Manipal University Jaipur B1648 Page No. 233
Computer Architecture Unit 11
Figure 11.1: A Typical Vector Processor Architecture (Based on Cray 1)
Vector Registers: Vector registers carries the input and result vectors. 8
vector registers are present in Cray 1 and many other vector processors. Every
vector register contains 64 elements of 64 bits each. For example Fijitsu VP
200 processor permits the space of 8k elements present in vector register’s
programmable set whose range is 8 to 256. As 8 vector register carries 64
elements of 64 bits, but 256 register carry 32 elements.
Figure 11.1 contains 1 write port and 2 read port so that vector operations can
overlap on various vector registers. Scalar Registers: Vector operations get
the scalar inputs present in scalar registers. Such as a scalar register results
constant when elements are multiplied to matrix.
B=5*X+Y
In the above equation, 5 is a constant stored in scalar register and X and
vectors in 2 different vector register. Address calculation of vector load/store
unit is also done in this register. Vector Load/Store Unit: Data moves
Manipal University of Jaipur B1648 Page No.
234
Computer Architecture Unit 1
among memory and vector registers in vector load/store unit. It is responsible
for overlapping read and write operation from memory and also mark the high
latency linked with main memory access.
1. Load Vector Operation: In this operation a vector moves from memory to
vector register
2. Store Vector Operation: This operation moves a vector to memory
Vector Functional Units: This unit have some vector functional units:
• integer operations
• floating point
• logical operations
As shown in figure 11.1, Cray 1 has six functional units The NEC SX/2 has
sixteen functional units: four shift units, four integers add/logical, four FP add
and four FP multiply/divide.
Memory: In this processor the memory unit is different from the memory unit
we use in normal processors. This unit permits the pipelined data transfer to
and from memory. Interleaved memory is utilized to support pipelined data
transfer from memory.
Self Assessment Questions
1. The first vector machine was ____________ .
2. _______ operations get the scalar inputs present in scalar registers.
11.3 Pipelining
We have discussed this concept in Unit 4 and 5, but we need to recap it in
order to get a better idea of the next sections.
What is Pipelining?
An implementation technique by which the execution of multiple instructions
can be overlapped is called pipelining. In other words, it is a method which
breaks down the sequential process into numerous sub-operations. Then
every sub-operation is concurrently executed in dedicated segments
separately. The main advantage of pipelining is that it increases the instruction
throughput, which is specified the count of instructions completed per unit
time. Thus, a program runs faster. In pipelining, several computations can run
in distinct segments simultaneously.
A register is connected with every segment in the pipeline to provide isolation
between each segment. Thus, each segment can operate on distinct data
Manipal University Jaipur B1648 Page No. 235
Computer Architecture Unit 1
simultaneously. Pipelining is also called virtual parallelism as it provides an
essence of parallelism only at the instruction level.
In pipelining, the CPU executes each instruction in a series of following small
common steps:
1. Instruction fetching
2. Instruction decoding
3. Operand address calculation and loading
4. Instruction execution
5. Storing the result of the execution
6. Write back
The CPU while executing a sequence of instructions can pipeline these
common steps. However, in a non-pipelined CPU, instructions are executed
in strict sequence following the steps mentioned above.
To understand pipelining, let us discuss how an instruction flows through the
data path in a five-segment pipeline. Consider a pipeline with five processing
units, where each unit is assumed to take 1 cycle to finish its execution as
described in the following steps:
a) Instruction fetch cycle: In the first step, the address of the instruction to
be fetched from memory into Instruction Register (IR) is stored in PC
register.
b) Instruction decode fetch cycle: The instruction thus fetched is decoded
and register is read into two temporary registers. Decoding and reading
of registers is done in parallel.
c) Effective address calculation cycle: In this cycle, the addresses of the
operands are being calculated and the effective addresses thus
calculated are placed into ALU output register.
d) Memory access completion cycle: In this cycle, the address of the
operand calculated during the prior cycle is used to access memory. In
case of load and store instructions, either data returns from memory and
is placed in the Load Memory Data (LMD) register or is written into
memory. In case of branch instruction, the PC is replaced with the branch
destination address in the ALU output register.
e) Instruction execution cycle: In the last cycle, the result is written into
the register file.
Pipelines are of two types - Linear and Non-linear. Linear pipelines perform
Manipal University Jaipur B1648 Page No. 236
Computer Architecture Unit 1
only one pre-defined fixed functions at specific times in a forward direction
from one stage to next stage. On the other hand, a dynamic pipeline which
allows feed forward and feedback connections in addition to the streamline
connections is called a non-linear pipeline.
An Instruction pipeline operates on a stream of instructions by overlapping and
decomposing the three phases of the instruction cycle. Super pipeline design
is an approach that makes use of more and more fine-grained pipeline stages
in order to have more instructions in the pipeline. As RISC instructions are
simpler than those used in CISC processors, they are more conducive to
pipelining.
Self Assessment Questions
3. ___________ specifies the count of instructions completed per unit
time.
4. Pipelining is also called _______________ as it provides an essence
of parallelism only at the instruction level.
5. Linear pipelines perform only one pre-defined fixed functions at specific
times in a forward direction. (True/False)
11.4 MIMD Architectural Concepts
Computers with multiple processors that are capable of executing vector
arithmetic operations using multiple instruction streams and multiple data
streams are called Multiple Instruction streams Multiple Data stream (MIMD)
computers. All multiprocessing computers are MIMD computers. The
framework of an MIMD computer is shown in figure 11.2.
Manipal University Jaipur B1648 Page No. 237
Computer Architecture Unit 1
Figure 11.2: The Framework of an MIMD Computer
MIMD is also known as multiple independent processors which operate as a
component of huge systems. For example parallel processors, multi-
processors and multi-computers. There are two forms of MIMD machines: •
multiprocessors (shared-memory machines)
• multi-computers (message-passing machines)
11.4.1 Multiprocessor
Multiprocessor are systems with multiple CPUs, which are capable of
independently executing different tasks in parallel. They have the following
main features:
• They have either shared common memory or unshared distributed
memories.
• They also share resources for example I/O devices, system utilities,
program libraries, and databases.
• They are operated on integrated operating system that gives interaction
among processors and their programs at the task, files, job levels and also
in data element level.
Types of multiprocessors
There are 3 types of multi-processors they are distributed in the way in which
shared memory is implemented. (See figure 11.3). They are:
• UMA (Uniform Memory Access),
Manipal University Jaipur B1648 Page No. 238
Computer Architecture Unit 1
• NUMA (Non Uniform Memory Access)
• COMA (Cache Only Memory Access)
Figure 11.3: Shared Memory Multiprocessors
Basically the memory is divided into several modules that is why large
multiprocessors into different categories. Let’s discuss them in detail.
UMA (Uniform Memory Access): In this category every processor and
memory module has similar access time. Hence each memory word can be
read as quickly as other memory word. If not then quick references are
slowed down to match the slow ones, so that programmers cannot find the
difference this is called uniformity here. Uniformity predicts the performance
which is a significant aspect for code writing. Figure 11.4 shows uniform
memory access from the CPU on the left.
Figure 11.4: Uniform and Non-Uniform Memory Access
Manipal University Jaipur B1648 Page No. 239
Computer Architecture Unit 11
Modern UMA machines are of small size and with single bus multiprocessors.
In the early design of scalable shared memory systems, large UMA machines
with a switching network and hundreds of processors were common.
Well-known examples of those multiprocessors are the NYU Ultra computer
as well as the Denelcor HEP. In their designs numerous features had been
introduced which act as an important achievement in today’s parallel
computers architecture. Nevertheless, early systems do not have local main
memory or cache memory, which has showed its importance for attaining high
performance in scalable shared memory systems. It is not appropriate for
building scalable parallel computers but it is very good for constructing small-
sized single bus multi-processor. Such as Encore Multimax of Encore
Computer Corporation introduced in late 80s and Silicon Graphics Computing
Systems introduced in late 90s.
NUMA (Non Uniform Memory Access): They are intended for avoiding the
memory access disadvantage of Uniform Memory Access machines. The
logically shared memory is spread between all the processing nodes of NUMA
machines, giving rise to distributed shared memory architectures.
Figure 11.4 shows non uniform memory between the left and right disks.
Although these parallel computers became highly scalable, yet they are
extremely sensible for data allocation in local memories. Accessing a remote
memory segment of a node is slower as compared to accessing a local
memory segment of a node. Multi-computers having distributed memory are
similar to the architecture of these machines. Major dissimilarity depends on
the organization of address space. In multiprocessors, a global address space
that is equally visible from each processor is applied.
In other words, all the memory locations can be accessed by CPU clearly. In
local memories of multi-computers, the address space is duplicated in the
processing elements. This dissimilarity in the memory’s address space is all
well showed in software level. NUMA machines programming depends on the
global address space (shared memory) principle while distributed memory
multi-computers programming depends on the message-passing paradigm.
COMA (Cache Only Memory Access): You can say that COMA machine act
as non-uniform but differently. It also avoids the effects of static memory
allocation of NUMA and Cache Coherent Non-Uniform Memory Architecture
(CC-NUMA) machines. This is done by doing two activities;
Manipal University of Jaipur B1648 Page No.
240
Computer Architecture Unit 1
• including large caches as node memories
• excluding main memory blocks from the local memory of nodes
Cache memory is present in the above architectures. Main memory does not
exist nor in the form of NUMA and CC-NUMA distributed memory neither in
the form of UMA’s central shared memory,
Similarly, the requirement of carrying addresses explicitly is removed by virtual
memory. The COMA provides the allocation of static data which is driven by
demand to local memories. When the data is required, it is always attracted
towards the cache (local) memory with respect of cache coherence scheme.
In COMA machines, same cache coherence schemes can be executed as in
other shared memory systems. The dissimilarity is that these managing the
replacement. COMA machines are scalable parallel architectures. So
schemes supporting large-scale parallel systems can be only applied. For
example cache coherence protocols like hierarchical cache coherent schemes
directory schemes. Two representative COMA architectures are: KSR1
(Kendall Square Research high performance computer) and DDM (Data
Diffusion Machine).
11.4.2 Multi-computer
A multi-computer consists of numerous von Neumann computers that are
associated with interconnection network. For accessing the local memory and
sending/receiving messages on network, every computer on the network will
executes there programs. Typically, both memory and I/O is distributed among
the processors. So, each individual processor-memory- I/O module in a multi-
computer forms a node and is essentially a separate stand-alone autonomous
computer. Multi-computer is actually a group of MIMD computers with
physically distributed memory as shown in figure 11.5.
Manipal University Jaipur B1648 Page No. 241
Computer Architecture Unit 1
Figure11.5: The Framework of a Multi-computer
To support a larger number of processors, the memory here is spread among
the CPUs. This yields cost-effective higher bandwidth as most of the accesses
made by each processor are to its local memory.
Because remote memory which is also known as NORMA (No Remote
Memory Access), it cannot be directly accessed by multi-computers. There
are 2 types of multi- computers
• MPPs (Massively Parallel Processors), contains several processors that
are connected with high speed interconnection network. MPPs are very
costly super-computers. Such as Cray T3E and IBM SP/2
• Regular PCs or workstations, that are possibly rack mounted, as well as
connected by commercial off-the-shelf interconnection method.
Basically there is not much difference between the two categories but network
used in MPP is very expensive than the network used in regular PC’s or
workstation. These self assembled machines having several names for
example NOW (Network of Workstations) and COW (Cluster of Workstations).
Self Assessment Questions
6. All multiprocessing computers are _____________ computers.
7. In UMA machines, each memory word cannot be read as quickly as other
memory word. (True/ False).
Manipal University Jaipur B1648 Page No. 242
Computer Architecture Unit 1
8. NUMA stands for ____________________ .
Activity 1:
Prepare a collage depicting two columns, one for multiprocessor while the
other for multi-computer and paste diagrams, notes, pictures etc of the
various machines found under each heading.
11.5 Problems of Scalable Computers
There are two fundamental problems to be solved in any scalable computer
system:
1. Tolerate and hide latency of remote loads.
2. Tolerate and hide idling due to synchronisation among parallel processes.
Remote loads are unavoidable in scalable parallel systems which use some
form of distributed memory. Accessing a local memory usually requires only
one clock cycle while access to a remote memory cell can take two orders of
magnitude longer time.
If a processor issuing such a remote load operation should wait for the
completeness of the operation without doing any useful work, the remote load
would significantly slow down the computation. Since the rate of load
instructions is high in usual programs, the latency problem would eliminate all
the potential benefits of parallel activities. A typical case is shown if figure 11.6,
where P0 has to load two values A and B from two remote memory block M1
and Mn in order to evaluate the expression A+B.
Manipal University Jaipur B1648 Page No. 243
Computer Architecture Unit 1
Figure 11.6: The Remote Load Problem
The pointers to A and B are rA and rB stored in the local memory of P0. Access
of A and B are realised by the "rloadrA" and "rloadrB" instructions that should
travel through the interconnection network in order to fetch A and B.
The situation is even worse if the values of rA and rB are currently not available
in M1 and Mn. M1 and Mn will be generated by some other process which will
be executed later on. In this case where idling occurs due to synchronisation
among parallel processes, the original process on P0 should wait
unpredictable time resulting in unpredictable latency.
Solutions to the problems
In order to solve the above-mentioned problems several possible
hardware/software solutions were proposed and applied in various parallel
computers. They are as follows:
1. Application of cache memory
2. Pre-fetching
Manipal University Jaipur B1648 Page No. 244
Computer Architecture Unit 1
3. Introduction of threads and fast context switching mechanism among
threads
4. Using non-blocking writes
As a result of this study, various methods are being utilized to reduce or at
least hide the latency. We will now discuss these methods.
1. Application of cache memory: Data replication is the 1st latency
reduction or hiding method. If we keep several copies of data in different
locations then the access from those locations can be faster. This
replication method is also known as caching in which several copies of
data are kept nearby to location they are used and belongs.
Now one more strategy is to make peer copies of data equally not like the
asymmetric primary/secondary relationship used in caching. As several
copies of data are maintained in whatsoever form, now the major issues
are when, where and by whom data blocks are being placed. The solution
varies from active placement on demand through hardware to the
worldwide placement at load time in subsequent compiler directives.
2. Pre-fetching: It is the next method for reducing or hiding latency. In this
method the data is fetched before it is required, so that this method can
overlap the regular execution. Hence the data will be in the location, when
the item is required. Pre-fetching is under the program control or may be
automatic. Cache loads not only the word but the entire line because
following words can also be required soon.
This method can be controlled explicitly also. When compiler wants data
he can put explicit instruction for getting data. The instruction must be
written in advance so that data should be there on time. For using this
technique, compiler should have the entire knowledge of system and its
timings. He should also have control on the location where data is placed.
3. Multithreading: Multithreading is another method for reducing or hiding
latency. Multithreading is a process which is very common in modern
computers. Basically it is multi-programming in which several processes
can run at the same time. If we want the switching process to be speedy
then we have to provide each process a memory map and hardware
registers. Now if one process blocks while waiting for remote data, then
the switching can be done to the process that can be able to continue.
Basically the processor executes the 1st instruction form thread 1, 2nd
Manipal University Jaipur B1648 Page No. 245
Computer Architecture Unit 1
instruction from thread 2 and so on. This is the way through which
processor can be kept busy even in lengthy latencies for independent
threads. Actually for switching processes after each instruction to reduce
latencies, some systems automatically switch between the processes.
4. Non-blocking writes: Last method for reducing or hiding latency is non-
blocking wires. In this method the memory operation starts but the
program will continue executing. But normally when a STORE instruction
is carried out, at that time CPU waits till the instruction completes before
continuing.
Activity 2:
Visit a library and read books on computer architecture to find out more
ways of resolving the problems of scalable computers.
Self Assessment Questions
7. ________ are unavoidable in scalable parallel systems which use
some form of distributed memory.
8. Pre-fetching can never be controlled explicitly (True/ False).
11.6 Main Design Issues of Scalable MIMD Architecture
The main design issues in scalable parallel computers are as follows:
1. Processor design
2. Interconnection network design
3. Memory system design
4. I/O system design
Let’s discuss them in detail:
1. Processor design: The current generation of commodity processors
contains several built-in parallel architecture features like pipelining,
parallel instruction issue logic, etc. They also directly support the built of
small- and mid-size multiple processor systems by providing atomic
storage access, pre-fetching, cache coherency, message passing, etc.
However, they cannot tolerate remote memory load and idling due to
synchronisation which are the fundamental problems of scalable parallel
systems. To solve these problems a new approach is needed in processor
design. Multithreaded architectures offer a promising solution in the very
near future.
Manipal University Jaipur B1648 Page No. 246
Computer Architecture Unit 1
2. Interconnection network design: Interconnection network design was a
key problem in the data-parallel architectures since they aimed at
massively parallel systems, too.
In the current part those design issues will be reconsidered that are
relevant for the case when commodity microprocessors are to be applied
in the network. The central design issue in distributed memory multi-
computers is the selection of the interconnection network and the
hardware support of message passing through the network.
3. Memory system design: Memory design is the crucial topic in shared
memory multiprocessors. In these parallel systems the maintenance of a
logically shared memory plays a central role. Early multiprocessors
applied physically shared memory which becomes a bottleneck in scalable
parallel computers. Recent generation of multiprocessors employs a
distributed shared memory supported by distributed cache system. The
maintenance of cache coherency is a nontrivial problem which requires
careful hardware/software design.
4. I/O system design: In scalable parallel computers one of the main
problems is the handling of I/O devices in an efficient way. The problem
seems to be particularly serious when large data volumes should be
moved among I/O devices and remote processors. The main question is
how to avoid the disturbance of the work of internal computational
processors. The problem of I/O system design appears in every class of
MIMD systems.
Self Assessment Questions
9. _____________ was a key problem in the data-parallel architectures
since they aimed at massively parallel systems.
10. Early multiprocessors applied _____________ which becomes a
bottleneck in scalable parallel computers.
11.7 Summary
• A vector machine consists of a scalar unit and a vector unit. The scalar
unit works on scalars and has architecture similar to that in the traditional
processors. The vector unit is responsible for performing vector
operations.
• An implementation technique by which the execution of multiple
instructions can be overlapped is called pipelining. This pipeline technique
splits up the sequential process of an instruction cycle into sub-processes
Manipal University Jaipur B1648 Page No. 247
Computer Architecture Unit 1
that operates concurrently in separate segments.
• Computers with multiple processors that are capable of executing vector
arithmetic operations using multiple instruction streams and multiple data
streams are called Multiple Instruction streams Multiple Data stream
(MIMD) computers.
• MIMD is divided into two categories multiprocessors (shared-memory
machines) and multi-computers (message-passing machines).
• There are two fundamental problems to be solved in any scalable
computer system: 1. tolerate and hide latency of remote loads. 2. tolerate
and hide idling due to synchronisation among parallel processes.
• The main design issues in scalable parallel computers are :
1. Processor design
2. Interconnection network design
3. Memory system design
4. I/O system design
11.8 Glossary
• CC-NUMA machine: Cache-Coherent Non-Uniform Memory Access
machine.
• COMA machine: Cache Only Memory Access machine.
• COW: Cluster of Workstations.
• DDM: Data Diffusion Machine.
• LMD: Load Memory Data.
• MPPs: Massively Parallel Processors.
• Multi-computer: It contains numerous von Neumann computers that are
associated with interconnected network.
• Multiprocessor: Systems with multiple CPUs, which are capable of
independently executing different tasks in parallel.
• NORMA: NO Remote Memory Access.
• NOW: Network of Workstations.
• NUMA machine: Non Uniform Memory Access machine.
• Register: It is associated with each segment in the pipeline to provide
isolation between each segment.
• UMA machine: Uniform Memory Access machine.
Manipal University Jaipur B1648 Page No. 248
Computer Architecture Unit 1
11.9 Terminal Questions
1. Define vectorisation.
2. List the five components of vector-register machine architecture and write
a brief note on each.
3. What do you understand by pipelining? State steps in which instructions
are executed in pipelining.
4. Differentiate between multiprocessor and multi-computer.
5. Write short notes on:
A) UMA
B) NUMA
C) COMA
6. Describe the various problems of scalable computers. Explain the ways to
resolve them.
7. What are the main design issues of scalable MIMD architecture?
11.10 Answers
Self Assessment Questions
1. CDC Star 100
2. Vector
3. Instruction throughput
4. Virtual parallelism
5. True
6. MIMD
7. False
8. Non Uniform Memory Access
9. Remote loads
10. False
11. Interconnection network design
12. Physically shared memory
Terminal Questions
1. A vector may refer to a type of one dimensional array. Vectorisation is
collecting the group of data elements distributed in memory and after that
placing them in linear sequential register files. Refer Section 11.2.
2. The five components are vector registers, scalar registers, vector
functional units, vector load/store unit, and main memory. Refer Section
11.2.
Manipal University Jaipur B1648 Page No. 249
Computer Architecture Unit 1
3. An implementation technique by which the execution of multiple
instructions can be overlapped is called pipelining. In pipelining, the CPU
executes each instruction in a series of small common steps. Refer
Section 11.3.
4. Multiprocessor are systems with multiple CPUs, which are capable of
independently executing different tasks in parallel.Multi-computer contains
numerous von Neumann computers that are associated with
interconnected network. Refer Section 11.4for more details.
5. In this category every processor and memory module has similar access
time. Refer Section 11.4.1.
NUMA machines are intended for avoiding the memory access
disadvantage of Uniform Memory Access machines. Refer Section 11.4.1.
You can say that COMA machine act as non-uniform but differently. . It
also avoids the effects of static memory allocation of NUMA and Cache
Coherent Non-Uniform Memory Architecture (CC-NUMA) machines. Refer
Section 11.4.1.
6. There are two fundamental problems to be solved in any scalable computer
system:
1. Tolerate and hide latency of remote loads.
2. Tolerate and hide idling due to synchronisation among parallel
processes.
In order to solve the above-mentioned problems several possible
hardware/software solutions were proposed and applied in various parallel
computers. Refer Section 11.5.
7. The main design issues in scalable parallel computers are:
1. Processor design
2. Interconnection network design
3. Memory system design
4. I/O system design. Refer Section 11.6
References:
• Hwang, K. Advanced Computer Architecture. McGraw-Hill, 1993.
• Godse, D. A. & Godse, A. P. Computer Organization. Technical
Publications.
• Hennessy, John L., Patterson, David A. & Goldberg David, Computer
Architecture: A Quantitative Approach, Morgan Kaufmann; 5th edition,
Manipal University Jaipur B1648 Page No. 250
Computer Architecture Unit 1
2011.
• Sima, Dezso, Fountain, Terry J. & Kacsuk, Peter, Advanced computer
architectures - a design space approach. Addison-Wesley-Longman: I-
XXIII, 1-766.
E-references:
• http://www.cs.umd.edu/class/fall2001/cmsc411/projects/MIMD/mimd.
html.
• http://www.docstoc.com/docs/2685241/Computer-Architecture-
Introduction-to-MIMD-architectures.
Manipal University Jaipur B1648 Page No. 251
Computer Architecture Unit 1
Unit 12 Storage Systems
Structure:
12.1 Introduction
Objectives
12.2 Storage System
12.3 Types of Storage Devices
Magnetic storage
Optical storage
12.4 Connecting I/O devices to CPU/Memory
Input-output vs. memory bus
Isolated versus memory-mapped I/O
Example of I/O interface
12.5 Reliability, Availability and Dependability of Storage System
12.6 RAID
Mirroring (RAID 1)
Bit-Interleaved parity (RAID 3)
Block-Interleaved distributed parity (RAID 5)
12.7 I/O Performance Measures
12.8 Summary
12.9 Glossary
12.10 Terminal Questions
12.11 Answers
12.1 Introduction
In the previous unit, you studied the concept of vectorisation and pipelining.
Also, you studied the MIMD architectural concepts, problems of scalable
computers. We also learnt the main design issues of scalable MIMD
architecture.
A computer must have a system to get information from the outside world and
must be able to communicate results to the external world. It is required to
enter programs as well as data in computer memory in order to process them.
Also, it is required to record or display the results (for the user) received from
calculations. To use computer in an efficient manner it is necessary to prepare
numerous programs as well as data beforehand. Then these programs and
data are broadcasted into storage medium. Then, the information available in
disk is transmitted into the memory of a computer in a rapid manner.
Manipal University Jaipur B1648 Page No. 252
Computer Architecture Unit 1
Outcomes provided by programs are transmitted into a storage having high
speed. For example, they can be transmitted into disks. Later, they can be
transmitted into an output device in order to give output of outcomes.
In this unit, you will study storage systems and various topics such as different
types of storage systems, connecting I/O devices to CPU/Memory, availability,
dependability and reliability of the storage system. We will also explain the
concept of RAID and the I/O measures.
Objectives:
After studying this unit, you should be able to:
• define storage system
• describe various types of storage devices
• describe the process of connecting I/O device to CPU/memory
• discuss the reliability, availability and dependability of storage system
• explain the concept of RAID
• discuss on I/O performance measure.
12.2 Storage System
To illustrate computer system, a lot of distinction is made frequently among
computer organisation and computer architecture. By Computer architecture,
we mean those system attributes which are visible to a developer. By
Computer organisation, we mean operational units in addition to the
connection between them that recognise the specification of architecture.
Each time PC is shut down, the contents of the PC’s randomaccess memory
(RAM) are lost. That is because RAM is electronic and requires a constant
source of power to retain its contents. Likewise, whenever the program ends,
operating system discards the information which includes that program had
placed in RAM in order to make room for other programs. To keep information
from one computer session to another, one must store the information within
a file that ultimately stores on disk.
Nowadays, it is necessary to have storage systems for computing. Every
recognized platform of computing, which ranges from handy devices to huge
super computers, make use of storage systems. This is done to store data for
the short term or everlastingly.
Earlier punch card was used which stores a small amount of bytes of data.
Now storage systems comprise the capacity to store a large amount of bytes
Manipal University Jaipur B1648 Page No. 253
Computer Architecture Unit 1
in relatively a lesser amount of space and power utilisation.
We have given below some definitions of storage in reference to computers.
• It is defined as a device which has the capability to store data. Disk and
tape drives are considered as storage devices.
• Storage is considered as the place in computer where the data is placed
in a visual form in order to get accessed by a processor.
• Storage, which can also be called as computer data storage or memory,
refer to constituents of computer, recording media, and devices that
preserve digital data utilised for calculating for a period of time.
Self Assessment Questions
1. __________ signifies those systems attributes which are visible to a
developer.
2. ______________ signifies operational units in addition to the
connection between them that recognise the specification of
architecture.
3. RAM stands for _______________ .
12.3 Types of Storage Devices
Data is stored on devices or physical components. These devices or physical
components are known as storage media. Now let us define storage devices.
Storage devices are the components of hardware which are used to either
read or write to the storage media. The different types of storage devices
which are utilised to store data as well as information are:
• Magnetic Storage
• Optical Storage
Let’s discuss them in detail as below:
12.3.1 Magnetic storage
The most general and stable form of removable-storage technology is
magnetic storage. The magnetic storage device has a layer of coating of some
magnetic substance on a rigid or flexible surface. The drive is equipped with
a read-write head assembly that can convert the data and instructions
represented in the form of 0 and 1 into some form of magnetic signal. These
magnetic signals can then be stored on the medium. Storage devices such as
hard drives, diskette drives, and tape drive make use of the same kind of
medium. They make use of similar methods for either reading or writing data.
Manipal University Jaipur B1648 Page No. 254
Computer Architecture Unit 1
The exterior side of diskettes as well as magnetic tapes is layered with a
magnet based material like iron oxide. Here, polarization is used to store the
data. That is, every particle in magnetic material line up itself in a particular
direction. A magnet has one important advantage - Its state can be maintained
without providing electricity constantly.
The exterior side of disks are layered with numerous small particles of irons.
Thus you can store data on these types of disks. All particles work as a
magnet. Electromagnets are contained in the read/write heads of a disk drive.
When the head is passed over the disk, electromagnets produce magnetic
fields in iron. A chain of 1s and 0s is stored in read/write heads. This is done
by interchanging the path of current in the electromagnet.
Let’s discuss three types of magnetic storage viz., disks, hard drives and tape
drives as below:
Disks: Disks let the user store information from one computer session to the
next. Floppy disk was introduced by IBM. The first floppy disks used to be 8-
inch in diameter. As it got smaller and smaller gradually it started being called
diskette. Next smaller diskette was 5.25-inch in diameter.
Earlier, 3.5-inch diameter diskettes having 1.44 MB storage space were most
popular on microcomputers for storing data and programs. You can easily
calculate that as many as 400 pages of printed book can be stored on a single
floppy disk. Zip disks are similar in looks to floppy disks. They are slightly
bigger and thicker than floppy disks.
Hard drive: PC’s hard drive is a fast drive that is normally capable of storing
several hundred megabytes of data. To reduce chances of a disk-head crash
or disk damaging, never move PC while it is on. Each disk drive within PC has
a unique one-letter name.
The A: drive normally corresponds to floppy drive. Drive B: stands for second
floppy drive if there is any. Likewise, the C: drive is normally for hard disk. If
CD-ROM drive exists, the drive may be the D: OR E: drive, depending on the
system's configuration. When storing a file on disk, use the drive name to
select the drive onto which the user wants to record the file’s contents. Unlike
PC’s RAM that stores information electrically, disks record information
magnetically, much like recording a television program on a VHS tape or a
song on a cassette tape. Within a disk drive, there is a small device called a
read/write head that can record (magnetise) or read information on the disk’s
Manipal University Jaipur B1648 Page No. 255
Computer Architecture Unit 1
surface. Within the drive, a disk spins rapidly past the read/write head each as
shown in figure 12.1. A floppy disk, for example, may spin past the read/write
head at 300 revolutions per minute (RPMs), whereas a hard drive may spin at
3,600 to 10,000 RPMs.
Figure 12.1: Disks Head and Magnetic Surface
To better understand how the drive records information on disk, examine a
floppy disk. In the middle of the back of the floppy disk, a small metal spindle
opening can be seen. When someone inserts a floppy disk into a drive, the
drive uses this spindle opening to spin the disk.
The drive then opens the disk’s metal shutter, as shown in figure 12.2, to
access the disk’s surface. By gently sliding the shutter one can open to see
the disk media. Do not, however, touch the surface; doing so may damage the
disk and the information it contains.
Figure 12.2: Cross Section of Floppy Disk
Because disks magnetise information to their surface, the information does
not require constant electricity, as does RAM. However, keep the disks away
from devices such as your phone, television, or even static electricity that may
result in a magnetic flux that changes the information recorded in the disk.
Within PC, one will normally have at least two disk drives: a high- capacity fast
Manipal University Jaipur B1648 Page No. 256
Computer Architecture Unit 1
hard drive and a floppy-disk drive that lets insert and remove disks. Normally,
PC’s hard drive resides within PC’s system unit.
Tape drive: The function of tape drive is to read as well as write the data to
tape surface. An audio cassette also functions in the same way. The only
dissimilarity is that a tape drive burns digital data. Tape storage generally
stores data which is not needed frequently. The example of such data is
backup copies related to the hard disk. Tape drive is required to write data in
a serial manner. This is due to the reason that a tape appears as a long strip
which is made up of magnetic material. Direct access offered by media like
disks appears to be faster as compared to the process of tape drive which
writes data serially.
When it is required to access the particular data on a tape, then the drive starts
scanning through the entire data. That is, the data which is not required is also
going through scanning. Thus, this has an effect in slow access time. Access
time differs according to the speed with which the drive is accessing, position
available on the tape in addition to the length of tape.
12.3.2 Optical storage
A kind of optical storage which is most extensively used is called as the CD
(compact disk). Compact disk is utilised in CDR, CDRW, CD-ROM, DVD-
ROM, in addition to Photo CD systems. Nowadays, systems with DVD-ROM
drives are preferred, rather than standard CD-ROM units. The devices
included in optical storage are used to store the data over reflective surface.
Additionally, a ray of laser light is used to read the data. A thin ray of light is
directed and concentrated by means of lenses, mirrors, and prisms. All the
light having same wavelength helps in creating laser beam focus.
CD-ROM: It symbolises compact disk read only memory. To read data from
CD-ROM, a laser beam is directed on the surface of a spinning disk. The areas
that reflect back the light are read as 1s, and the ones that scatter the light
and do not reflect back are read as 0s. This is shown in figure 12.3.
Manipal University Jaipur B1648 Page No. 257
Computer Architecture Unit 1
Data on this device is stored in a long spiral starting from the disk edge. Also
it’s ending take place at the centre.
Figure 12.3: Working of Optical Storage
650 MB of data can be stored in a standard CD. Also, audio of almost 70
minutes can be stored in it.
DVD-ROM: It symbolises digital video (or versatile) disk read only memory. It
is defined as a medium having high-density which can store a complete movie
on a disk. High-storage capacity is achieved by storing the data on both sides
of the disk. The latest versions of the DVDs comprise of data tracks layers.
Firstly, a laser beam reads from 1st layer. After that, it moves to the 2nd layer
to read, and so on.
Photo CD, CD-R, CD-RW: Through CD-Recordable (CD-R), your individual
CD-ROM disks can be created. Any CDROM drive can read your CD-ROM
disks. You cannot change the information after it is written to CD.
By means of CD-RW (CD-Rewritable) drives, the data can be written as well
as overwritten to CDs by the user. Similar to a floppy disk, you can revise the
data by using a CD-RW. Photo CD is a well-liked form of CD-R. It is considered
Manipal University Jaipur B1648 Page No. 258
Computer Architecture Unit 1
as a standard which is formulated by Kodak and is used to store digitised
photographic images on CD.
Self Assessment Questions
4. Physical components on which data is stored are called ________ .
5. RPM is the acronym for _____________ .
6. To read data from _____________ , a laser beam is directed on the
surface of a spinning disk.
12.4 Connection of I/O Devices to CPU/Memory
It is essential for computer to have a system to get information from the outside
world and must be able to the communicate results to the external world. As
we know, it is required to enter programs as well as data into the memory of a
computer for processing. Also, it is required to record or display the results (for
the user) received from calculations. I/O interface offers a technique which is
used to transfer information among input-output devices and internal memory.
To interface computer peripherals with CPU, special communication links are
required. . You can consider peripheral as an external device which offers
input as well as output for computer. For instance, mouse, keyboard, printer,
etc. This is done through I/O bus which connects the peripheral devices to the
CPU.
In figure 12.4, we have shown the communication link among various
peripherals and processor.
Figure 12.4: Connection of I/O Bus to Input-Output Devices
The Input-Output bus consists of the following: • data lines • address lines •
Manipal University Jaipur B1648 Page No. 259
Computer Architecture Unit 1
control lines
A general-purpose computer makes use of printer, magnetic disk. In
computers, magnetic tape is utilised for backup storage. All peripheral devices
are connected with it by means of interface unit.
All interfaces decode the address as well as control obtained from I/O bus.
Every interface decodes them for peripheral. Also it offers signals for
peripheral controller. Data flow is synchronised and the transfer among
processor and peripheral is administered. Every peripheral comprises its
individual controller. A specific electromechanical device is operated by this
controller. For instance, paper movement, print timing in addition to the
printing characters selection are controlled by means of printer controller.
Perhaps, a controller is stored individually or is physically incorporated with
peripheral. Input-Output bus from processor is connected every peripheral
interface.
It is required for the processor to place the address of a device on address
lines in order to converse with a device. Every interface which is connected to
I/O bus includes address decoder. The function of address decoder is to
monitor address lines. The path among bus lines and device that are
controlled by interface gets activated, when the interface identifies its own
address. The interface disables those peripherals whose address is not
matching with the address in bus. Address lines contain the address. Also, a
function code is provided by processor in control lines.
An interface chosen replies to the function code. Then and continues to
implement it. You can consider function code as an Input-Output command.
Basically, the instruction which is carried out in interface and it is connected in
peripheral unit is known as a function code.
Interface may obtain different kinds of commands. The different kinds of
commands are:
• Control: We give a control command to activate the peripheral.
Particularly, a control command given relies on peripheral. Every
peripheral obtains its own differentiated series of commands, according to
its operation mode.
• Status: This command is used for testing different conditions of status
in peripheral as well as interface. For instance, before initiating a transfer,
computer may want to verify the peripheral’s status. When the transfer is
Manipal University Jaipur B1648 Page No. 260
Computer Architecture Unit 1
going on, some errors may take place. These errors are observed by
interface.
• Data output: In this command, the interface responds by transmitting
data. Data is transmitted from bus into any of its registers. As an example,
consider a tape unit. By means of a control command, the computer
begins the tape moving. Then the status of tape is monitored by processor.
This is done by using status command.
• Data input: By giving this command, interface obtains a data item from
peripheral. This data item is placed in buffer register of interface. The
availability of data is checked by the processor. This is done by using
status command. Then, data input command is issued. Here, the interface
puts the data on data lines. Also the data gets accepted by the processor.
12.4.1 Input-Output vs. memory bus
It is required for processor to converse with memory unit in order to converse
with I/O. Memory bus consists of the following:
• data
• address
• read/write control lines
Computer buses can communicate with I/O and memory by using the following
techniques:
• Make use of two different buses, the first bus for memory and the
second bus for I/O.
• Make use of a common bus for I/O as well as memory. However
different control lines should be there for each.
• Make use of a common bus for I/O as well as memory having common
control lines.
In case of first technique, the computer comprises the following:
• data
• address
• control buses, one bus for I/O and other for accessing memory
This is performed in computers having an individual IOP (input-output
Processor) (IOP) and CPU (Central Processing Unit). By means of a memory
bus, the memory converses with central processing unit as well as input-
output processing. IOP also converses with input as well as output devices.
This is done through an individual I/O bus having its individual data, address
Manipal University Jaipur B1648 Page No. 261
Computer Architecture Unit 1
in addition to control lines. IOP provides independent path for transferring
information among internal memory and external devices.
12.4.2 Isolated versus memory-mapped I/O
Information transfer between I/O or memory and CPU can be carried over one
common bus. Memory transfer and I/O transfer differs in that they use
separate lines for read and write operations. CPU task is to distinguish that r
the address on the address lines is for an interface register or for memory
word. It is done by enabling either the read lines or the write lines.
During the I/O transfer, control lines are enabled for I/O read and I/O writes
operations. During a memory transfer, the lines are enabled for the memory
read and memory write operations. By this configuration, I/O interface
addresses are isolated from the addresses assigned to memory. This
arrangement is known as isolated I/O method. It is used in the common bus
to assign addresses. In memory mapped I/O, all peripherals devices are
treated as memory locations.
12.4.3 Example of I/O interface
Figure 12.5 shows an example of an I/O interface. It has two data registers
called ports, a control register, a status register, bus buffers, and timing and
control circuits. The interface communicates with the CPU through the data
bus.
Manipal University Jaipur B1648 Page No. 262
Computer Architecture Unit 1
CS RSI RSO Register selected
0XX None: data bus in high-tmpedance
1 0 0 Port A register
1 0 1 Port ft register
1 1 0 Control register
1 1 1 Status register
Figure 12.5: Example of I/O Interface Unit
The chip select and register select inputs determine the address assigned to
the interface. The I/O read and writes are two control lines that specify an input
or output, respectively. The four registers: Port A Register, Port B register,
Control Register and Status register communicate directly with the I/O device
attached to the interface. The input-output data to and from the device can be
transferred into either port A or port B.
The interface may operate with an output device or with an input device, or
with a device that requires both input and output. If the interface is connected
Manipal University Jaipur B1648 Page No. 263
Computer Architecture Unit 1
to a printer, it will only output data, and if it services a character reader, it will
only input data. A magnetic disk unit is used to transfer data in both directions
but not at the same time, so the interface can use bidirectional lines. A
command is passed to the I/O device by sending a word to the appropriate
interface register.
In a system like this, the function code in the I/O bus is not needed because
control is sent to the control register, status information is received from the
status register, and data are transferred to and from ports A and B registers.
Thus the transfer of data, control, and status information is always via the
common data bus.
The distinction between data, control, or status information is determined from
the particular interface register with which the CPU communicates. The control
register gets control information from the CPU. By loading appropriate bits into
the control register, the interface and the I/O device attached to it can be
placed in a variety of operating modes. For example, port A may be defined
as an input port and port B as an output port, A magnetic tape unit may be
instructed to rewind the tape or to start the tape moving in the forward
direction. The bits in the status register are used for status conditions and for
recording errors that may occur during the data transfer. For example, a status
bit may indicate that port-A has received a new data item from the I/O device.
The interface registers uses bi-directional data bus to communicate with the
CPU. The address bus selects the interface unit through the chip select and
the two register select inputs. A circuit must be provided externally (usually, a
decoder) to detect the address assigned to the interface registers. This circuit
enables the chip select (CS) input to select the address bus. The two register
select-inputs RSl and RSO are usually connected to the two least significant
lines of the address bus. Out of those two inputs, select one of the four
registers in the interface as specified in the table accompanying the diagram.
The content of the selected register is transferred into the CPU via the data
bus when the I/O read signal is ended. The CPU transfers binary information
into the selected register via the data bus when the I/O write input is enabled.
Self Assessment Questions
7. ________ is used in computers for backup storage.
8. ________ from the processor is attached to all peripheral interfaces.
9. A ____________ is issued to test various status conditions in the
Manipal University Jaipur B1648 Page No. 264
Computer Architecture Unit 1
interface and the peripheral.
Activity 1:
Visit an IT organisation and observe the functioning of the I/O interface and
the data lines, control lines, and I/O bus architecture. Also, check whether
the I/O system used is isolated or memory-mapped.
12.5 Reliability, Availability and Dependability of Storage System
Response time and throughput are given considerable attention while
processor designing, although reliability is given more attention in storage
than processors. The terms reliability, availability and dependability are often
confused with each other. Here is a clearer distinction:
Reliability - Is anything broken?
Availability - Is the system still available to the users?
Dependability - Is it worth to trust the system?
Adding hardware can therefore improve availability (for example, Error
Correcting Code (ECC) on memory), but it cannot improve reliability (the
DRAM is still broken). Reliability can only be improved by bettering
environmental conditions, by building from more reliable components, or by
building with fewer components. Another term, data integrity, refers to
consistent reporting when information is lost because of failure; this is very
important in some applications.
Disk array is one innovation that improves both availability and performance
of storage systems. Since price per megabyte is independent of disk size,
potential throughput can be increased by having many disk drives and, hence,
many disk arms.
Simply spreading data over multiple disks, called striping, automatically forces
accesses to several disks. (Although arrays improve throughput, latency is not
necessarily improved.) The drawback to arrays is that with more devices,
reliability drops: N devices generally have 1/N the reliability of a single device.
So, if a single disk fails, the lost information can be reconstructed from
redundant information. The only danger is in having another disk failure
between the time a disk fails and the time it is replaced (termed mean time to
repair, or MTTR). Since the mean time to failure (MTTF) of disks is five or
more years, and the MTTR is measured in hours, redundancy can make the
Manipal University Jaipur B1648 Page No. 265
Computer Architecture Unit 1
availability of 100 disks much higher than that of a single disk. These systems
have become known by the acronym RAID, which stands for redundant array
of inexpensive disks. We will study this topic in the next section.
Self Assessment Questions
10. ____________ refers to consistent reporting when information is lost
because of failure.
11. ____________ is an innovation that improves both availability and
performance of storage systems
12.6 RAID
RAID is the acronym for ‘redundant array of inexpensive disks’. There are
several approaches to redundancy that have different overhead and
performance. The Patterson, Gibson, and Katz 1987 paper introduced the
term RAID. It used a numerical classification for these schemes that has
become popular; in fact, the non-redundant disk array is sometimes called
RAID 0.One disadvantage is discovering when the disk fails. Magnetic disks
help provide information about their correct operation. There is information
recorded in each sector which helps detect the errors in that sector.
Transferring of sectors will help the electronics attached to discover the failure
of disks or loss of information.
The levels of RAID are as follows:
12.6.1 Mirroring (RAID 1)
Mirroring or shadowing is the traditional solution to disk failure. It uses twice
as many disks. Data is simultaneously written on two disks, one non-
redundant and one redundant disk so that there are two copies of the data.
The system goes to the mirror disk in case one disk fails to get the required
information. This technique is the most expensive solution.
12.6.2 Bit-Interleaved parity (RAID 3)
Bit-Interleaved parity is an error detection technique where character bit
patterns are forced into parity so the total number of one (1) bit is always odd
or even. This is done by adding a “1” or “0” bit to each byte as the
character/byte is transmitted. At the other end of the transmission the parity is
checked for accuracy. BIP is also a method used at the physical layer (high
speed transmission of binary data) level to monitor errors.
The cost of higher availability can be reduced to 1/N, where N is the number
Manipal University Jaipur B1648 Page No. 266
Computer Architecture Unit 1
of disks in a group. In this case, we need only enough redundant information
required to restore the lost information, instead of having the complete original
copy. Reads or writes go to all disks in the group, with one extra disk to hold
the check information in case there is a failure.
RAID 3 is popular in applications with large data sets, for example multimedia
and several scientific codes. Parity is one such scheme. Parity is the example
of the redundant disk which is having the sum of all the data in the other disks.
When a disk fails, the data of the all the good disks is subtracted from the
parity disk. The remaining information is the missing information. Here, it is
assumed that failures are too rare that taking longer to recover from failure but
reducing redundant storage is a good trade-off. Mirroring effect can be
considered the special case of one data disk and one parity disk (N=1). Only
duplicating the data can accomplish parity, thus, mirrored disks have the
advantage of simplifying the calculations included in parity. The redundancy
of N = 1 has the highest overhead for increasing disk availability.
12.6.3 Block-interleaved distributed parity (RAID 5)
This level uses the same ratio of disks (data disks and check disks) as RAID
3, but data is accessed differently. In the prior organisation every access went
to all disks. Some applications would prefer to do smaller accesses, allowing
independent accesses to occur in parallel. That is the purpose of this next
RAID level. Since error-detection information in each sector is checked on
reads to see if data is correct, such “small reads” to each disk can occur
independently as long as the minimum access is one sector. Writes are
another matter. It would seem that each small write would demand that all
other disks be accessed to read the rest of the information needed to
recalculate the new parity. In our example, a “small write” would require
reading the other three data disks, adding the new information, and then
writing the new parity to the parity disk and the new data to the data disk.
The main thing to remember to reduce this overhead is that parity is simply a
sum of information. By watching which bits change when we write the new
information, we need only to change the corresponding bits on the parity disk.
We must read the old data, compare old data to the new data to see which
bits change, read the old parity, change the corresponding bits, and then write
the new data and new parity. Thus, the small write involves four disk accesses
for two disks instead of accessing all disks. This organisation is RAID 4. RAID
Manipal University Jaipur B1648 Page No. 267
Computer Architecture Unit 1
4 supports mixtures of large reads, large writes, small reads and small writes.
One drawback to the system is that the parity disk must be updated on every
write, so it is the bottleneck for sequential writes.
To fix the parity-write bottleneck, the parity information is spread throughout
all the disks so that there is no single bottleneck for writes. This distributed
parity organisation is RAID 5. Figure 12.6 shows how data is distributed in
RAID 4 and RAID 5.
Figure 12.6: Block-interleaved Parity (RAID 4) versus Distributed Block-
interleaved Parity (RAID 5)
As the organisation on the right shows, in RAID 5 the parity associated with
each row of data blocks is no longer restricted to a single disk. This
organisation allows for multiple writes to occur simultaneously as long as the
parity blocks are not located in the same disks. For example, a write to block
8 on the right must also access its parity block P2, thereby occupying the first
and third disks. A second write to block 5 on the right, implying an update to
its parity block P1, accesses the second and fourth disks and thus could occur
at the same time as the prior write to block 8. Thus, RAIDs are playing an
increasing role in storage systems.
Self Assessment Questions
12. RAID is the acronym for ____________ .
13. ____________ uses twice as many disks.
12.7 I/O Performance Measures
Manipal University Jaipur B1648 Page No. 268
Computer Architecture Unit 1
The two most common measures of I/O performance are diversity and
capacity.
• Diversity: Which I/O devices can connect to the computer system?
• Capacity: How many I/O devices can connect to a computer system?
Other traditional measures to performance are throughput (sometimes called
bandwidth) and response time (sometimes called latency). Figure 12.7 shows
the simple producer-server model. The producer creates tasks to be
performed and places them in a buffer; the server takes tasks from the first-
in-first-out buffer and performs them.
Figure 12.7: Producer-Server Model of Response Time and Through-put
Response time is defined as the time taken by a task since it is placed in the
buffer till it is completed by the server. Throughput, in simple words, is the
average number of tasks completed by the server over a period of time. To
reach the maximum level of throughput, the server should never be idle, and
the buffer should never be empty. Whereas, response time is the time spent
in the buffer and is minimised when the buffer is empty.
Improving performance does not always mean improvements in both
response time and throughput. Throughput is increased by adding more
servers as shown in figure 12.8, as it helps spread data across two disks
instead of one. This enables the tasks to be performed parallelly.
Manipal University Jaipur B1648 Page No. 269
Computer Architecture Unit 1
Unfortunately, this does not help response time, unless the workload is held
constant and the time in the buffers is reduced because of more resources.
Figure 12.8: Single-Producer Model Extended with another Server and Buffer
How does the architect balance these conflicting demands? If the computer
is interacting with human beings, figure 12.9 suggests an answer.
Workload
Conventional interactive workload
(1.0 sec. system response
time)
Conventional interactive workload
(0.3 sec. system response
time)
High-function graphics
workload (1.0 sec. system
response time)
High-function graphics
workload (0.3 sec. system
response time)
■ Entry time ■ System response time □ Think time
Manipal University Jaipur B1648 Page No. 270
Computer Architecture Unit 1
Figure 12.9: An Interactive Computer Divided into Entry Time, System
Response Time, and User Think Time
This figure presents the results of two studies of interactive environments: one
keyboard oriented and one graphical. An interaction, or transaction, with a
computer is divided into three parts:
1. Entry time - The time for the user to enter the command. The graphics
system in figure 12.9 took 0.25 seconds on average to enter a command
versus 4.0 seconds for the keyboard system.
2. System response time - The time between when the user enters the
command and the complete response is displayed.
3. Think time - The time from the reception of the response until the user
begins to enter the next command.
The sum of these three parts is called the transaction time. Several studies
report that user productivity is inversely proportional to transaction time;
transactions per hour are a measure of the work completed per hour by the
user.
The results in figure 12.9 show that reduction in response time actually
decreases transaction time by more than just the response time reduction:
Cutting system response time by 0.7 seconds saves 4.9 seconds (34%) from
the conventional transaction and 2.0 seconds (70%) from the graphics
transaction. This implausible result is explained by human nature: People need
less time to think when given a faster response.
Self Assessment Questions
14. _______ is also known as bandwidth.
15. ________________ is sometimes called latency.
Activity 2:
Visit an organisation. Find the level of reliability, availability and dependability
of the system used. Also, measure the I/O performance.
12.8 Summary
Let us recapitulate the important concepts discussed in this unit:
• A computer must have a system to get information from the outside world
and must be able to communicate results to the external world.
• Each time PC is shut down, the contents of the PC’s random-access
Manipal University Jaipur B1648 Page No. 271
Computer Architecture Unit 1
memory (RAM) are lost.
• There are two main categories of the storage devices: Magnetic Storage
and Optical Storage.
• Read and write data to the surface of a tape the same way as an
audiocassette - difference is that a computer tape drive writes digital data.
• The most widely used type of optical storage medium is the compact disk
(CD), which is used in CD-ROM, DVD-ROM, CDR, CDRW and Photo CD
systems.
• Response time and throughput are given considerable attention while
processor designing, although reliability is given more attention in storage
than processors.
• RAID is the acronym for redundant array of inexpensive disks. There are
several approaches to redundancy that have different overhead and
performance.
• Response time is defined as the time taken by a task since it is placed in
the buffer till it is completed by the server. Throughput is the average
number of tasks completed by the server over a period of time.
12.9 Glossary
• Bus Interface: Communication link between the processor and several
peripherals.
• CD-R: Compact disk recordable
• CD-ROM: Compact disk read only memory
• CD-RW: Compact disk rewritable
• DVD-ROM: Digital video (or versatile) disk read only memory
• Input devices: Computer peripherals used to enter data into the computer.
• Input-Output Interface: This gives a method for transferring information
between internal memory and I/O devices.
• Input-Output Processor (IOP): An external processor that
communicates directly with all I/O devices and has direct memory access
capabilities.
• Output devices: Computer peripherals used do get output from the
computer.
• RAM: Random Access Memory
• RPM: Revolutions per Minute
12.10 Terminal Questions
Manipal University Jaipur B1648 Page No. 272
Computer Architecture Unit 1
1. What do you understand by system storage?
2. Explain briefly the various types of storage devices available.
3. Describe the communication link between the processor and several
peripherals.
4. What is the difference between isolated I/O and memory mapped I/O?
What are the advantages and disadvantages of each?
5. Give an example of I/O interface unit
6. Define raid. Also explain the levels of raid.
12.11 Answers
Self Assessment Questions
1. Computer architecture
2. Computer organisation
3. Random access memory
4. Storage media
5. Revolutions per minute
6. CD-ROM
7. Magnetic tape
8. I/O bus
9. Status command
10. Data integrity
11. Disk array
12. Redundant array of inexpensive disks
13. Mirroring or shadowing
14. Throughput
15. Response time
Terminal Questions
1. To keep information from one computer session to another, one must store
the information within a file that ultimately stores on disk. This is called
storage system. Refer Section 12.2.
2. There are two main categories of the storage devices: magnetic storage
and optical storage. Refer section 12.3.
3. Peripherals connected to a computer require special communication links
for interfacing them with the CPU. This is done through i/o bus which
connects the peripheral devices to the CPU. Refer section 12.4.
4. Memory transfer and I/O transfer differs in that they use separate read and
Manipal University Jaipur B1648 Page No. 273
Computer Architecture Unit 1
write lines. Refer section 12.4.2.
5. An example of an I/O interface has two data registers called ports, a control
register, a status register, bus buffers, and timing and control circuits.
Refer section 12.4.3.
6. Raid is the acronym for redundant array of inexpensive disks. There are
several approaches to redundancy that have different overhead and
performance. Refer section 12.6.
References:
• Kai Hwang, Advanced Computer Architecture, Parallelism, Scalablility,
Programmability, Mgh.
• Micheal J. Flynm, Computer Architecture, Pipelined & Parallel Processor
Design, Narosa.
• J.P. Haycs: Computer Architecture & Organisation - Mgm
• Nicholas P. Carter, Schaum’s Outline Of Computer Architecture, Mc.
Graw-Hill Professional.
E-references:
• www.es.ele.tue.nl
• www.stanford.edu
• ece.eng.wayne.edu
Manipal University Jaipur B1648 Page No. 274
Computer Architecture Unit 1
Unit 13 Scalable, Multithreaded And
Data Flow Architecture
Structure:
13.1 Introduction
Objectives
13.2 Multithreading
What is a thread?
Need of multithreading
Benefits of multithreading system
13.3 Principles of Multithreading
13.4 Scalable and MultithreadedArchitecture
13.5 Computational Models
13.6 Von Neumann- based Multithreaded Architectures
Organisation and Operation of the Von Neumann architecture Key
features
13.7 Dataflow architecture
Dataflow programming
Dataflow graph
13.8 Hybrid Multithreaded Architecture
13.9 Summary
13.10 Glossary
13.11 Terminal Questions
13.12 Answers
13.1 Introduction
In the previous unit, you studied about storage systems. You covered various
aspects such as types of storage devices, connecting I/O devices to
CPU/memory, reliability, availability and dependability, RAID, I/O performance
measures. Multithreading is a type of multitasking. Prior to Win32, the only
type of multitasking that existed was the cooperative multitasking, which did
not have the concept of priorities. The multithreading system has a concept of
priorities and therefore, is also called background processing or pre-emptive
multitasking.
Dataflow architecture is in direct contrast to the traditional Von Neumann
architecture or control flow architecture. Although dataflow architecture has
not been used in any commercially successful computer hardware, it is very
Manipal University Jaipur B1648 Page No. 275
Computer Architecture Unit 1
relevant in many software architectures such as database engine designs and
parallel computing frameworks. A system, whose performance improves after
adding hardware, proportionally to the capacity added, is said to be a scalable
system.
In this unit, you will learn about multithreading, dataflow and scalable
architecture and their various aspects such as principles of multithreading,
scalable and multithreaded architecture, computational models, Von
Neumann - based multithreaded architectures, dataflow architecture and
Hybrid multithreaded architecture.
Objectives:
After studying this unit, you should be able to:
• define the term multithreading
• recognise the need and benefits of multithreading
• describe the principles of multithreading
• identify scalable and multithreaded architecture
• discuss about computational models
• describe von Neumann- based multithreaded architectures
• explain dataflow architecture
• create hybrid multithreaded architecture
13.2 Multithreading
Multithreading is the capability of a processor to do multiple things at one time.
The Windows operating system uses the API (Application Programming
Interface) calls to manipulate threads in multithreaded applications. Before
discussing the concepts of multithreading, let us first understand what a thread
is.
13.2.1 What is a thread?
Each process, which occurs when an application is run, consists of at least
one thread that contains code. All the code within a thread, when it is active,
is performed consecutively, one line after another. In a multithreading system,
many threads belonging to that particular process run concurrently. A thread
is viewed as an independent program counter within a process and the
location of the instruction that the thread is operating on is indicated by this. A
thread has the following features:
• A state of thread execution
Manipal University Jaipur B1648 Page No. 276
Computer Architecture Unit 1
• The saved thread context when not running.
• A stack tracing the execution path.
• Some space for local variables
Multithreading is supported by almost all modern operating systems such
Windows XP, Solaris, Linux, and OS/2 while the traditional operating systems
such as DOS and UNIX support the concept of single threading. A Java Virtual
Machine (JVM) is also an example of multithreading system.
13.2.2 Need of multithreading
Both Single and multithreaded process models as shown in figure 13.1 have
their own importance. But here we will discuss about the need of
multithreading.
Figure 13.1: Single Threaded and Multithreaded Process Models
Multithreading is needed to create an application that is able to perform more
than one task at once. For example, all GUI (Graphical User Interface)
programs can perform more than one task (such as editing a document as well
as printing it) at a time. A multithreading system can perform the following
tasks:
• Manage inputs from many windows and devices
• Distinguish tasks of varying priority.
• Allow the user interface to remain responsive all the time
• Allocate time to the background tasks
Manipal University Jaipur B1648 Page No. 277
Computer Architecture Unit 1
Although these tasks can be performed using more than one process, it is
generally more efficient to use a single multithreaded application because the
system can perform a context switch more quickly for threads than processes.
Moreover, all threads of a process share the same address space and
resources, such as files and pipes.
13.2.3 Benefits of multithreading system
A multithreading system provides the following benefits over a multiprocessing
system:
• Threads advance the communication between different execution traces
as the same user address space is shared.
• In an existing process, creating a new thread is much less timeconsuming
than creating a brand-new process.
• Termination of thread also takes less time.
• Also, control switching among two threads within a same process takes
less time than switching between two processes.
Self Assessment Questions
1. The multithreading system has a concept of priorities and therefore, is
also called __________ or __________ .
2. It takes much more time to create a new thread in an existing process
than to create a brand-new process. (True/False)
13.3 Principles of Multithreading
There are important parameters which characterise the performance of
multithreading. They include:
• The number of threads in each processor
• The number of remote reads in each thread
• Context switch mechanism
• Remote memory latency
• Remote memory servicing mechanism
• The number of instructions in athread
Let’s briefly explain the implications of these issues below.
• The number of active threads: This specifies the level of parallelism. The
level of parallelism can be categorised into computation parallelism and
communication parallelism. Computation parallelism is the 'conventional'
parallelism while communication parallelism is the means by which threads
can correspond with other threads existing in other processors.
Manipal University Jaipur B1648 Page No. 278
Computer Architecture Unit 1
• The numbers of remote reads in each thread verifies the number of
occurrences of thread switching and subsequently run length. There is a
thread switching, for every remote read. The number of switches is
proportional to the number of remote reads. Hence, it is desirable that the
remote reads be distributed evenly over the life of a thread. This allocation
results into thread run length. Thread run length is determined by the
number of uninterrupted instructions carried out between two consecutive
remote reads. The performance of multithreading is robustly affected by
this factor. In a small run length, it will be hard to bear the latency because
there are not adequate instructions to perform while the remote read is
excellent.
• Thread switch depicts how the control of a thread is transferred to another
thread. There are two types of context switches: explicit switching and
implicit switching. Implicit switching share registers from the multiple
threads while explicit switching does not. Implicit switching is literally
implicit that shows that there is fundamentally no evident switching from
the viewpoint of a register.
This method, thus, needs minute or no switching overhead. However, the
scheduling of registers and threads can be a difficult job. In the explicit
switching, threads do not share registers. A single thread uses all the
registers. Therefore, there is no issue as to how registers and threads are
scheduled.
• Communication latency is the aim of multithreading. The technology
used to build the machine and the interconnection network influencing its
variability. The network bandwidth is desired to be comparable with the
processor clock speed. Large disparity between the machine clock speed
and the network bandwidth can be problematic when using multithreading.
• The number of instructions in a thread is known as thread granularity.
Thread granularity can be classified into three categories: fine- grain,
medium-grain, and coarse-grain. Fine-grain threading usually is a thread
of a few to tens of instructions. It is basically for instructionlevel
multithreading. Medium-grain threading is considered as a looplevel or
function-level threading, and it consists of hundreds of instructions.
Coarse-grain threading is treated as a task-level
threading, where each thread consists of thousands of instructions.
Manipal University Jaipur B1648 Page No. 279
Computer Architecture Unit 1
Self Assessment Questions
3. The number of switches is proportional to the number of remote reads.
(True/ False)
4. typically refers to a thread of a few to tens of instructions.
Activity 1:
Visit a library and find out more details about the various models of
multithreading like Blocked model, Forking model, Process-pool model and
Multiplexing model.
13.4 Scalable and Multithreaded Architecture
Scalability is the skill to enhance the amount of processing that can be done
by adding up further resources to a system. It is different from performance as
it does not improve performance but rather sustains performance by providing
higher throughput. In other words, performance is the system response time
under a typical load whereas scalability is the ability of a system to increase
that load without degrading response time.
A computer architecture that is designed to execute more than one processor
is called a scalable architecture. Almost all business applications are scalable.
Scalability can be achieved in several ways, such as using more dominant
CPUs or adding extra CPUs.
There are two different modes of scaling: Scale up and scale out. Scaling up
is achieved by adding extra resources to a single machine to allow an
application to service more requests. The most common ways to do this are
by adding memory (RAM) or to use a faster CPU. Scaling out is achieved by
adding servers to a server group to make applications scale by scattering the
processing among multiple computers. An understanding of the bottlenecks
and the applications of each scaling method is required before a particular
method can be productively utilised.
Multithreading is the capability of processor to utilise multiple threads of
execution simultaneously in one application. In simple words, it allows a
program to do two things at once. When an application is run, each of the
processes contains at least one thread. However, many concurrent threads
may belong to one process in a multithreading system. For example, a Java
Virtual Machine (JVM) is a system of one process with multiple threads. Most
Manipal University Jaipur B1648 Page No. 280
Computer Architecture Unit 1
recent operating systems, such as Solaris, Linux, Windows 2000 and OS/2,
support multiple processes with multiple threads per process. However, the
traditional operating system, MS-DOS supports a single user process and a
single thread. Some traditional UNIX systems are multiprogramming systems
as they maintain multiple user processes but only one execution path is
allowed for each process.
Self Assessment Questions
5. Almost all business applications are _________________ .
6. When an application is run, each of the processes contains at least one
13.5 Computational Models
A mathematical model in computational science that requires widespread
computational resources to examine the performance of a complex system by
computer simulation is known as a computation model. This system is often a
compound non-linear system for which easy, instinctive logical solutions are
not readily presented. Instead of drawing out a mathematical analytical
solution to the problem, testing of the model is done by changing the factors
of the system in the computer, and examining the dissimilarities in the
conclusion of the tests conducted. These computational experiments help
derive/deduce the theories of operation of the model. Mathematical language
is used to describe a system. Not only the natural sciences and engineering
disciplines but also the social science physicists, engineers, computer
scientists, and economists use mathematical models most extensively. Thus,
the term ‘mathematical modelling’ is given to the process of developing a
mathematical model.
The combination of languages and the computer architecture in a common
foundation or paradigm is called Computational Model. The concept of
computational model represents a higher level of abstraction than either the
computer architecture or the programming language. There are basically three
types of computational models as follows:
• Von Neumann model
• Dataflow model
• Hybrid multithreaded model
In the following sections we will discuss each one of these models in detail.
Manipal University Jaipur B1648 Page No. 281
Computer Architecture Unit 1
Self Assessment Questions
7. The combination of languages and the computer architecture in a
common foundation or paradigm is called ___________ .
8. Computational model uses mathematical language to describe a system
(True/ False)
13.6 Von Neumann-based Multithreaded Architectures
The foundation of computer architectures - how computers and computer
systems are structured, designed, and put into operation - certainly makes a
connection to the "Von Neumann architecture" as a basis for comparison. This
is because, virtually, every electronic computer built is always rooted in this
architecture.
The Central Processing Unit (CPU), consisting of the control unit and the ALU
(Arithmetic and Logic Unit), is the core of the Von Neumann computer
architecture. The CPU relates with a memory and an input/output (I/O)
subsystem and a stream of instructions are executed. As per this architecture
both data and instructions, are stored in the memory system in the same way.
Therefore, it defines the memory content completely in the way it is interpreted.
This is vital, suppose, for a program compiler that converts a user-
understandable programming language into the instruction stream
understandable by machine. The compiler gives ordinary data as the output.
However, the CPU then executes these data as instructions. It can carry out
various instructions for moving and altering data, and for deciding upon the
instructions to be executed next. The assortment of instructions is referred to
as the instruction set, and collectively with the resources required for their
implementation, it is called the Instruction Set Architecture (ISA).
The implementation of instruction is done by a cyclic clock signal. Even though
numerous sub-steps have to be carried out for the implementation of each
instruction, improved CPU implementation technologies are there that can go
beyond these steps such that, preferably, one instruction can be performed
each clock cycle.
Clock rates of today's processors are in the range of 2 to 3.4 GHz and they
allow up to 600 million basic operations (such as adding two numbers or
copying a data item to a storage location) to be performed each second. With
the recent advancements in technology, CPU speeds have increased rapidly.
Consequently, the factors such as the slower I/O operations and the memory
Manipal University Jaipur B1648 Page No. 282
Computer Architecture Unit 1
system limits the overall speed of a computer system since the speed of these
components have improved at a slower rate than CPU technology.
The average speed of memory systems can be improved by caches by
keeping the most commonly used data in a fast memory that is near to the
processor. One more factor obstructing CPU speed boosts is the naturally
sequential character of the Von Neumann instruction implementation. Now,
through parallel processing architectures, methods of executing various
instructions concurrently are being developed.
13.6.1 Organisation and operation of the Von Neumann architecture
As mentioned in section 13.6, the core of a computer system with Von
Neumann architecture is the CPU. This element obtains (i.e., reads)
instructions and data from the main memory and coordinates the entire
carrying out of every instruction. It is usually structured into two different
subunits: the Arithmetic and Logic Unit (ALU), and the control unit. Figure 13.2
shows the basic components of a Von Neumann model.
Figure 13.2: The Basic Components of a Computer with Von Neumann
Architecture
The ALU merges and converts data using arithmetic operations, such as
addition, subtraction, multiplication, and division, and logical operations, such
as bit-wise negation, AND, and OR.
The control unit interprets the instructions retrieved from the memory and
manages the operation of the whole system. It establishes the sequence in
which instructions are carried out and offers all of the electrical signals
essential to manage the operation of the ALU and the interfaces to the other
Manipal University Jaipur B1648 Page No. 283
Computer Architecture Unit 1
system components.
The memory is a set of storage cells, and each of this can be in one of two
different states. One state signifies a value of “0”, and the other state signifies
a value of “1”. By separating these two unlike logical states, each cell is
proficient of storing a distinct binary digit, or bit, of information. These bit
storage cells are analytically arranged into words, each of which is b bits wide.
Every word is allotted a unique address in the range [0, .................. , N - 1].
The CPU spots the word that it requires either to read or write by storing its
distinctive address in a special memory address register (MAR). A register
provisionally stores a value within the CPU. The memory acts in response to
a read request by interpreting the value stored at the desired address and
transferring it to the CPU via the CPU-memory data bus. The value is then for
the short term stored in the memory buffer register (MBR) (also sometimes
called the memory data register) before it is used by the control unit or ALU.
For a write operation, the CPU stores the value it desires to write into the MBR
and the corresponding address in the MAR. The value is then copied by the
memory from the MBR into the address pointed to by the MAR.
At last, the input/output (I/O) devices connect the computer system with the
exterior world. These devices let the programs and data to be entered into the
system and give a way for the system to manage an output device. Each I/O
port has a distinctive address to which the CPU can either read or write a
value. From the CPU's opinion, an I/O device is accessed similar to the way it
accesses memory. In fact, in a number of systems the hardware makes it
appear to the CPU that the I/O devices are memory locations. This
configuration, in which no difference between memory and I/O devices is seen
by the CPU, is referred to as memory-mapped I/O. In this case, no distinctive
I/O instructions are necessary.
13.6.2 Key features
In a basic organisation, processors having Von Neumann architecture are
differentiated from simple pre-programmed (or hardwired) controllers as they
posses several key features. First, the same main memory stores both
instructions and data. Consequently, instructions and data are not
distinguished. Also, different types of data, such as a floating-point value, or a
character code, an integer value, are all not distinguished. A particular bit
pattern’s explanation completely depends on how the CPU infers it. The same
data stored at a particular memory location can be inferred as an instruction
Manipal University Jaipur B1648 Page No. 284
Computer Architecture Unit 1
or data at different times. For example, when a compiler executes, it reads the
source code of a program written in a high-level language, such as FORTRAN
or COBOL, and is converted into a series of instructions that can be executed
by the CPU. The memory stores the output of the compiler like any other type
of data. On the other hand, the compiler output data can be implemented by
the CPU simply by interpreting them as instructions. Thus, the same values
accumulated in memory are considered as data by the compiler, but are then
taken as instructions executable by the CPU. Another outcome of this theory
is that every instruction must indicate how it deduces the data upon which it
functions. Therefore, suppose, Von Neumann architecture will have one set of
arithmetic instructions for functioning on integer values and another set for
functioning on floating-point values.
The second chief factor says that memory is retrieved by name (i.e., address)
irrelevant of the bit pattern stored at each address. Due to this feature, we can
interpret the values stored in memory as addresses or data or instructions.
Therefore, programs can alter addresses via the same set of instructions that
the CPU uses to alter data. This elasticity of how values in memory are read
permits very compound, vigorously changing patterns to be produced by the
CPU to access any range of data structure in spite of the kind of value being
read or written. Ultimately, an additional chief feature of the Von Neumann
scheme is that the sequence in which a program performs its instructions is
sequential, unless that order is openly altered. Program counter (PC), a
special register in the CPU, carries the address of the following instruction in
memory to be performed. After each instruction is carried out, the value in the
PC is increased to point to the following instruction in the series to be
implemented. This sequential implementation series can be transformed by
the program with the help of branch instructions that stores a fresh value into
the PC register.
On the other hand, special hardware can sense some outside event, such as
a suspension, and load a fresh value into the PC to cause the CPU to
commence executing a new series of instructions. Though this concept of
executing one action at a time really makes simpler the writing of programs
and the design and running of the CPU, it also limits the prospective
performance of this architecture.
Self Assessment Questions
9. The instruction set together with the resources needed for their execution
Manipal University Jaipur B1648 Page No. 285
Computer Architecture Unit 1
is called the _______________.
10. The memory is a collection of storage cells, each of which can be in one
of two different states (True/False).
Activity 2:
Surf the internet to find out details about architecture called Harvard
architecture and compares it with Von Neumann architecture.
13.7 Dataflow Architecture
In a traditional computer design, the processor executes instructions, which
are stored in memory in particular sequences. In each processor, the
instruction executions are in serial order and therefore are slow. There are four
possible ways of executing instructions:
1. Control-flow Method: In this mechanism, an instruction is executed when
the previous one in a defined sequence has been executed. This is the
traditional way.
2. Demand-driven Method: In this mechanism, an instruction is executed
when the results of the instruction are required by other instruction
3. Pattern-driven Method: In this mechanism, an instruction is executed
when particular data patterns appear.
4. Dataflow Method: In dataflow method, an instruction is executed when
the operands required become available
Dataflow architecture is a computer architecture that directly contrasts the
traditional control flow architecture (Von Neumann architecture). It does not
have a program counter and the execution of instructions is solely determined
based on the availability of input arguments to the instructions. The dataflow
architecture is very relevant in many software architectures today including
parallel computing frameworks. This architecture was proposed in the 1970s
and early 1980s by Jack Dennis of Massachusetts Institute of Technology
(MIT).
13.7.1 Dataflow programming
Software written using dataflow architecture consists of a collection of
independent components running in parallel that communicate via data
channels. In a dataflow model, a node is a computational component, and an
arrow is a buffered data channel. A control algorithm is divided into nodes first.
Each concurrently executing node is a self-contained software part with well-
Manipal University Jaipur B1648 Page No. 286
Computer Architecture Unit 1
defined functionality.
Data channels provide the sole mechanism by which nodes can interact and
communicate with each other by ensuring lower coupling and greater
reusability. Data channels can also be implemented transparently between
processors to carry messages between components that are physically
distributed. In the dataflow architecture, a control application is composed of
function bodies and data channels, and the connections between function
bodies and data channels are described in a dataflow graph. Consequently,
designing a control application mainly involves constructing such a dataflow
graph by selecting function bodies from the design library and connecting them
together.
Additional user-defined or application-specific function bodies are also easily
supported. A model of dataflow programming is shown in figure 13.3.
13.7.2 Dataflow graph
Data flow computational model uses directed graph to describe a computation.
This graph is called dataflow graph or data dependency graph. This graph
consists of nodes and edges (arcs). Nodes represent operations and edges
represent data paths. Dataflow is a distributed model of computation as there
is no single locus of control. Dataflow graph is asynchronous as execution of
a node starts when matching data is available at a node's input ports. In the
original dataflow models, data tokens are consumed when the node executes.
Manipal University Jaipur B1648 Page No. 287
Computer Architecture Unit 1
Some models were extended with "Sticky tokens", the tokens that stay much
like a constant input and match with tokens arriving on other inputs. Nodes
can have varying granularity, from instructions to functions. Once a node is
activated and the nodal operation is performed, this is called "Fired Results",
which are passed along the arc to waiting node. This process is repeated until
all of the nodes are fired and the final result is created. More than one node
can also be fired simultaneously. Arithmetic operators and conditional
operators act as nodes.
Manipal University Jaipur B1648 Page No. 288
Computer Architecture Unit 1
The Dynamic Critical Path: The dynamic critical path of dataflow graph is
simultaneously a function of program dependences, runtime execution path,
hardware resources and dynamic scheduling. All critical events must be last
arrival events. Such an event is the last one, which enables data to be latched.
Events correspond to signal transitions on the edges of the dataflow graphs.
Most often, the last-arrival event is the last input to reach an operation.
However, for moderate operations the last arrival event is the input that
enables the computation of the output. In lenient execution, all forward
branches are executed simultaneously.
Manipal University Jaipur B1648 Page No. 289
Computer Architecture Unit 1
In a typical execution, multiple critical events may correspond to the same
hardware structure. In strict execution, the multiplier is on the critical path while
in lenient execution, the multiplier is critical only when its result is used by latter
computations.
Self Assessment Questions
11. In a dataflow model, a ____________ is a computational component,
and an _____________ is a buffered data channel.
12. Once a node is activated and the nodal operation is performed, this is
called __________ .
13.8 Hybrid Multithreaded Architecture
The dataflow model and Von Neumann control-flow model are considered as
two edges of execution models which is the foundation for a variety of
architecture models. However, it has been argued that the two models are in
reality not orthogonal. Commencing with the functioning model of a pure data-
flow graph, one can without difficulty expand the model to support Von
Neumann style program implementation. A region in a dataflow graph can be
grouped collectively as a thread to be implemented in a sequence beneath its
own personal program counter control, while the commencement and
synchronisation of threads are data-driven.
It has been considered that there are equipments besides this range which
trade instruction scheduling ease for improved low level synchronisation and
that there exists some optimal point connecting the two limits i.e. a new hybrid
model which interchangeably joins features of both Von Neumann and Data-
flow, as well as depicts parallelism at a required stage. Such hybrid
multithreaded architecture models have been projected by a number of
examining groups with their birth in either inert dataflow or vibrant dataflow.
Now we will study the fundamentals of some research projects.
• McGill Dataflow Architecture: It is motivated by the static dataflow
model. The McGill Data flow Architecture Model has been projected
depending on the argument-fetching principle. The structural design heads
off from a straight execution of data flow graphs by having commands fetch
information from memory or registers rather than having instructions
deposit operands (tokens) in operand receivers of successor instructions.
An event (called a signal) will be posted to notify instructions depending
on the outcome of the instruction on its completion. This executes a
Manipal University Jaipur B1648 Page No. 290
Computer Architecture Unit 1
tailored model of dataflow calculation called dataflow signal graphs. The
structural design includes characteristics to support proficient loop
implementation through dataflow software pipelining, and the support of
threaded function activations.
• lannucci's Model: lannucci combined dataflow ideas, depending on his
study on the MIT Dynamic Tagged Token Dataflow Architecture (TTDA)
and the knowledge achieved, with sequential thread implementation. This
was done to characterise a hybrid computation model depicted in his Ph.D.
thesis. His ideas later took the form of multithreaded architecture project
at IBM Yorktown Research Centre. The structural design involves
characteristics such as a cache memory with synchronisation controls,
prioritised processor ready queues and features for well-organised
process migration to make possible load balancing.
• P-RISC: This hybrid model explores the possibility of constructing a
multithreaded architecture around an RISC processor. P-RISC model
divides the compound dataflow instructions into separate synchronisation,
arithmetic and fork/control instructions. This eliminates the requirement of
incidence bits on the token store (or frame memory) as proposed in the
Monsoon machine. P-RISC also allows the compiler to accumulate
instructions into longer threads, substituting a number of the dataflow
synchronisation with conventional program counter based
synchronisation.
• *T (Star T): The monsoon project at MIT was followed by the Star-T
project. It used extension of the off-the-shelf process or architecture to
define a multiprocessor architecture using to support fine-grain
communication and set up user micro-threads. The architecture is
projected to hold the latency-hiding feature of the Monsoon split-phase
global memory operations while being compatible with conventional
Massively Parallel Architecture (MPA's) based on Von Neumann model.
Inter-node traffic consists of a tag (called continuation, a pair comprising a
context and instruction pointer).
All inter-node communications are performed using the split-phase
transactions (request and response messages); processors never block
when issuing a remote request and the network interface for message
Manipal University Jaipur B1648 Page No. 291
Computer Architecture Unit 1
handling is well integrated into the processor pipeline. A separate co-
processor handles all the responses to remote requests.
Self Assessment Questions
13. _________ combined dataflow ideas with sequential thread
execution to define a hybrid computation model.
14. P-RISC explores the possibility of constructing a multithreaded
architecture around a CISC processor. (True/ False)
13.9 Summary
• Multithreading is needed to create an application that is able to perform
more than one task at once.
• There are various needs, benefits and principles of multithreading.
• Scalability is defined as the ability to increase the amount of processing
that can be done by adding more resources to a system.
• The combination of languages and the computer architecture in a common
foundation or paradigm is called Computational Model.
• There are basically three types of computational models as follows: Von
Neumann model, Dataflow model and Hybrid multithreaded model.
• The heart of the Von Neumann computer architecture is the Central
Processing Unit (CPU), consisting of the control unit and the ALU
(Arithmetic and Logic Unit).
• Dataflow architecture does not have a program counter and the execution
of instructions is solely determined based on the availability of input
arguments to the instructions.
• A hybrid model synergistically combines features of both Von Neumann
and Data-flow, as well as exposes parallelism at a desired level.
13.10 Glossary
• Background processing: Another name for the multithreading system
which has a concept of priorities.
• Communication parallelism: refers to the way threads can
communicate with other threads residing in other processors.
• Computation parallelism: refers to the 'conventional' parallelism
• GUI: Graphical User Interface, these programs can perform more than one
task (such as editing a document as well as printing it) at a time.
• JVM: Java Virtual Machine, it is an example of multithreading system.
Manipal University Jaipur B1648 Page No. 292
Computer Architecture Unit 1
• Pre-emptive multitasking: Another name for the multithreading system
which has a concept of priorities.
• Thread granularity: refers to the number of instructions in a thread.
• Thread switch: refers to how the control of a thread is transferred to an
other thread.
13.11 Terminal Questions
1. Define multithreading. What is the need of multithreading? Enumerate its
benefits.
2. Briefly explain the principles of multithreading.
3. Discuss scalable and multithreaded architectures.
4. What is meant by Computational models?
5. Write short notes on:
a) Von Neumann- based multithreaded architectures
b) Dataflow architecture
c) Hybrid multithreaded architecture
13.12 Answers
Self Assessment Questions
1. Background processing, pre-emptive multitasking
2. False
3. True
4. Fine-grain threading
5. Scalable
6. Thread
7. Computational Model
8. True
9. Instruction set architecture (ISA)
10. True
11. Node, arrow
12. Fired Results
13. Iannucci
14. False
Terminal Questions
1. Multithreading is a type of multitasking. The multithreading system has a
concept of priorities and therefore, is also called background processing
or pre-emptive multitasking. Refer Section 13.2.
Manipal University Jaipur B1648 Page No. 293
Computer Architecture Unit 1
2. There are important parameters which characterise the performance of
multithreading. Refer Section 13.3.
3. Scalability is defined as the ability to increase the amount of processing
that can be done by adding more resources to a system. Refer Section
13.4.
4. The combination of languages and the computer architecture in a common
foundation or paradigm is called Computational Model. Refer Section 13.5.
5. a) The heart of the Von Neumann computer architecture is the Central
Processing Unit (CPU), consisting of the control unit and the ALU
(Arithmetic and Logic Unit). Refer Section 13.6.
b) Dataflow architecture is a computer architecture that directly contrasts
the traditional control flow architecture (Von Neumann architecture).
Refer Section 13.7.
c) Hybrid model synergistically combines features of both Von Neumann
and Data-flow, as well as exposes parallelism at a desired level. Refer
Section 13.8.
References:
• Hwang, K. (1993) Advanced Computer Architecture. McGraw-Hill.
• Godse, D. A. & Godse, A. P. (2010) Computer Organisation. Technical
Publications.
• Hennessy, J. L., Patterson, D. A. & Goldberg D.(2011). Computer
Architecture: A Quantitative Approach, Morgan Kaufmann.
• Sima, Dezso, Fountain, T. J. & Kacsuk, P. (1997). Advanced computer
architectures - a design space approach. Addison-Wesley-Longman: I-
XXIII, 1-766.
E-references:
• http://users.ece.utexas.edu/~bevans/courses/ee382c/projects/spring02/
mishra-oney/LitSurveyReport.pdf
• http://www.google.co.in/search?hl=en&biw=1080&bih=619&q=von+neu
mann+architecture+pdf&revid=1463745648&sa=X&ei=mgCMT-
rmD4rQrQeuur2yCw&ved=0CBwQ1QIoADgK
Manipal University Jaipur B1648 Page No. 294