KEMBAR78
Computer Architecture Notes | PDF | Computer Data Storage | Central Processing Unit
0% found this document useful (0 votes)
21 views96 pages

Computer Architecture Notes

Uploaded by

Sonakshi saha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views96 pages

Computer Architecture Notes

Uploaded by

Sonakshi saha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 96

mywbut.

com

Introduction to Computer System


Representation of Basic Information

The basic functional units of computer are made of electronics circuit and it works with electrical
signal. We provide input to the computer in form of electrical signal and get the output in form
of electrical signal.

There are two basic types of electrical signals, namely, analog and digital. The analog signals are
continuous in nature and digital signals are discrete in nature.

The electronic device that works with continuous signals is known as analog device and the
electronic device that works with discrete signals is known as digital device. In present days
most of the computers are digital in nature and we will deal with Digital Computer in this course.

Computer is a digital device, which works on two levels of signal. We say these two levels of
signal as High and Low. The High-level signal basically corresponds to some high-level signal
(say 5 Volt or 12 Volt) and Low-level signal basically corresponds to Low-level signal (say 0
Volt). This is one convention, which is known as positive logic. There are others convention also
like negative logic.

Since Computer is a digital electronic device, we have to deal with two kinds of electrical
signals. But while designing a new computer system or understanding the working principle of
computer, it is always difficult to write or work with 0V or 5V.

To make it convenient for understanding, we use some logical value, say,


LOW (L) - will represent 0V and
HIGH (H) - will represent 5V

Computer is used to solve mainly numerical problems. Again it is not convenient to work with
symbolic representation. For that purpose we move to numeric representation. In this convention,
we use 0 to represent LOW and 1 to represent HIGH.

0 means LOW
1 means HIGH

To know about the working principle of computer, we use two numeric symbols only namely 0
and 1. All the functionalities of computer can be captured with 0 and 1 and its theoretical
background corresponds to two valued Boolean algebra.

With the symbol 0 and 1, we have a mathematical system, which is knows as binary number
system. Basically binary number system is used to represent the information and manipulation of
information in computer. This information is basically strings of 0s and 1s.

1
mywbut.com

The smallest unit of information that is represented in computer is known as Bit (Binary Digit ),
which is either 0 or 1. Four bits together is known as Nibble, and Eight bits together is known as
Byte.

Computer Organization and Architecture

Computer technology has made incredible improvement in the past half century. In the early part
of computer evolution, there were no stored-program computer, the computational power was
less and on the top of it the size of the computer was a very huge one.

Today, a personal computer has more computational power, more main memory, more disk
storage, smaller in size and it is available in affordable cost.

This rapid rate of improvement has come both from advances in the technology used to build
computers and from innovation in computer design. In this course we will mainly deal with the
innovation in computer design.

The task that the computer designer handles is a complex one: Determine what attributes are
important for a new machine, and then design a machine to maximize performance while staying
within cost constraints.

This task has many aspects, including instruction set design, functional organization, logic
design, and implementation.

While looking for the task for computer design, both the terms computer organization and
computer architecture come into picture.

It is difficult to give precise definition for the terms Computer Organization and Computer
Architecture. But while describing computer system, we come across these terms, and in
literature, computer scientists try to make a distinction between these two terms.

Computer architecture refers to those parameters of a computer system that are visible to a
programmer or those parameters that have a direct impact on the logical execution of a program.
Examples of architectural attributes include the instruction set, the number of bits used to
represent different data types, I/O mechanisms, and techniques for addressing memory.

Computer organization refers to the operational units and their interconnections that realize the
architectural specifications. Examples of organizational attributes include those hardware details
transparent to the programmer, such as control signals, interfaces between the computer and
peripherals, and the memory technology used.

In this course we will touch upon all those factors and finally come up with the concept how
these attributes contribute to build a complete computer system.

2
mywbut.com

Basic Computer Model and different units of Computer

The model of a computer can be described by four basic units in high level abstraction which is
shown in figure 1.1. These basic units are:

 Central Processor Unit

 Input Unit

 Output Unit

 Memory Unit

Figure 1.1: Basic Unit of a Computer

A. Central Processor Unit (CPU):

Central processor unit consists of two basic blocks:

• The program control unit has a set of registers and control circuit to generate control
signals.
• The execution unit or data processing unit contains a set of registers for storing data and
an Arithmetic and Logic Unit (ALU) for execution of arithmetic and logical operations.

In addition, CPU may have some additional registers for temporary storage of data.

B. Input Unit:

With the help of input unit data from outside can be supplied to the computer. Program or data is
read into main storage from input device or secondary storage under the control of CPU input
instruction.

Example of input devices: Keyboard, Mouse, Hard disk, Floppy disk, CD-ROM drive etc.

C. Output Unit:

With the help of output unit computer results can be provided to the user or it can be stored in
storage device permanently for future use. Output data from main storage go to output device
under the control of CPU output instructions.

3
mywbut.com

Example of output devices: Printer, Monitor, Plotter, Hard Disk, Floppy Disk etc.

D. Memory Unit:

Memory unit is used to store the data and program. CPU can work with the information stored in
memory unit. This memory unit is termed as primary memory or main memory module. These
are basically semi conductor memories.

There are two types of semiconductor memories -

• Volatile Memory : RAM (Random Access Memory).


• Non-Volatile Memory: ROM (Read only Memory), PROM (Programmable ROM)
EPROM (Erasable PROM), EEPROM (Electrically Erasable PROM).

Secondary Memory : There is another kind of storage device, apart from primary or main
memory, which is known as secondary memory. Secondary memories are non volatile memory
and it is used for permanent storage of data and program.

Example of secondary memories: Hard Disk, Floppy Disk, Magnetic Tape (These
are magnetic devices), CD-ROM (is optical device), Thumb drive (or pen drive)
(is semiconductor memory)

Digital and Analog Signals

Signals carry information and are defined as any physical quantity that varies with time, space,
or any other independent variable. For example, a sine wave whose amplitude varies with respect
to time or the motion of a particle with respect to space can be considered as signals. A system
can be defined as a physical device that performs an operation on a signal. For example, an
amplifier is used to amplify the input signal amplitude. In this case, the amplifier performs some
operation(s) on the signal, which has the effect of increasing the amplitude of the desired
information-bearing signal.

Signals can be categorized in various ways; for example discrete and continuous time domains.
Discrete-time signals are defined only on a discrete set of times. Continuous-time signals are
often referred to as continuous signals even when the signal functions are not continuous; an
example is a square-wave signal.

4
mywbut.com

Figure 1a: Analog Signal

Figure 1b: Digital Signal

Basic Working Principle of a Computer

Before going into the details of working principle of a computer, we will analyze how computers
work with the help of a small hypothetical computer.

In this small computer, we do not consider about Input and Output unit. We will consider only
CPU and memory module. Assume that somehow we have stored the program and data into
main memory. We will see how CPU can perform the job depending on the program stored in
main memory.

P.S. - Our assumption is that students understand common terms like program, CPU, memory
etc. without knowing the exact details.

Consider the Arithmetic and Logic Unit (ALU) of Central Processing Unit:

Consider an ALU which can perform four arithmetic operations and four logical operations
To distinguish between arithmetic and logical operation, we may use a signal line,

5
mywbut.com

0 - in that signal, represents an arithmetic operation and


1 - in that signal, Represents a logical operation.

In the similar manner, we need another two signal lines to distinguish between four arithmetic
operations.

The different operations and their binary code is as follows:

Arithmetic Logical
000 ADD 100 OR
001 SUB 101 AND
010 MULT 110 NAND
011 DIV 111 NOR

Consider the part of control unit; its task is to generate the appropriate signal at right moment.

There is an instruction decoder in CPU which decodes this information in such a way that
computer can perform the desired task

The simple model for the decoder may be considered that there is three input lines to the
decoder and correspondingly it generates eight output lines. Depending on input combination
only one of the output signals will be generated and it is used to indicate the corresponding
operation of ALU.

In our simple model, we use three storage units in CPU,


Two -- for storing the operand and
one -- for storing the results.
These storage units are known as register.

But in computer, we need more storage space for proper functioning of the Computer.

Some of them are inside CPU, which are known as register. Other bigger chunk of storage space
is known as primary memory or main memory. The CPU can work with the information
available in main memory only.

To access the data from memory, we need two special registers one is known as Memory Data
Register (MDR) and the second one is Memory Address Register (MAR).

Data and program is stored in main memory. While executing a program, CPU brings instruction
and data from main memory, performs the tasks as per the instuction fetch from the memory.
After completion of operation, CPU stores the result back into the memory.

In next section, we discus about memory organization for our small machine.

6
mywbut.com

Main Memory Organization

Main memory unit is the storage unit; there are several locations for storing information in the
main memory module.

The capacity of a memory module is specified by the number of memory location and the
information stored in each location.

A memory module of capacity 16 X 4 indicates that, there are 16 location in the memory module
and in each location, we can store 4 bit of information.

We have to know how to indicate or point to a specific memory location. This is done by address
of the memory location.

We need two operations to work with memory.

READ This operation is to retrieve the data from memory and bring it
Operation: to CPU register
WRITE This operation is to store the data to a memory location from
Operation: CPU register

We need some mechanism to distinguish these two operations READ and WRITE.

With the help of one signal line, we can differentiate these two operations. If the content of this
signal line is
0, we say that we will do a READ operation; and if it is
1, then it is a WRITE operation.
To transfer the data from CPU to memory module and vice-versa, we need some connection.
This is termed as DATA BUS.

The size of the data bus indicates how many bit we can transfer at a time. Size of data bus is
mainly specified by the data storage capacity of each location of memory module.

We have to resolve the issues how to specify a particular memory location where we want to
store our data or from where we want to retrieve the data.

This can be done by the memory address. Each location can be specified with the help of a
binary address.

If we use 4 signal lines, we have 16 different combinations in these four lines, provided we use
two signal values only (say 0 and 1).

To distinguish 16 locations, we need four signal lines. These signal lines use to identify a
memory location is termed as ADDRESS BUS. Size of address bus depends on the memory size.
For a memory module of capacity of 2n location, we need n address lines, that is, an address bus
of size n.

7
mywbut.com

We use an address decoder to decode the address that is present in address bus

As for example, consider a memory module of 16 location and each location can store 4 bit of
information
The size of address bus is 4 bit and the size of the data bus is 4 bit
The size of address decoder is 4 X 16.

There is a control signal named R/W.


If R/W = 0, we perform a READ operation and
if R/W = 1, we perform a WRITE operation

If the contents of address bus is 0101 and contents of data bus is 1100 and R/W = 1, then 1100
will be
written in location 5.

If the contents of address bus is 1011 and R/W=0, then the contents of location 1011 will be
placed in data bus.

Memory Instruction

We need some more instruction to work with the computer. Apart from the instruction needed to
perform task inside CPU, we need some more instructions for data transfer from main memory to
CPU and vice versa.
In our hypothetical machine, we use three signal lines to identify a particular instruction. If we
want to include more instruction, we need additional signal lines.

Instruction Code Meaning


1000 LDAI imm Load register A with data that is given in the program
1001 LDAA addr Load register A with data from memory location addr
1010 LDBI imm Load register B with data
1011 LDBA addr Load register B with data from memory location addr
1100 STC addr Store the value of register C in memory location addr
1101 HALT Stop the execution
1110 NOP No operation
1111 NOP No operation

With this additional signal line, we can go upto 16 instructions. When the signal of this new line
is 0, it will indicate the ALU operation. For signal value equal to 1, it will indicate 8 new
instructions. So, we can design 8 new memory access instructions.

8
mywbut.com

We have added 6 new instructions. Still two codes are unused, which can be used for other
purposes. We show it as NOP means No Operation.

We have seen that for ALU operation, instruction decoder generated the signal for appropriate
ALU operation.

Apart from that we need many more signals for proper functioning of the computer. Therefore,
we need a module, which is known as control unit, and it is a part of CPU. The control unit is
responsible to generate the appropriate signal.

As for example, for LDAI instruction, control unit must generate a signal which enables the
register A to store in data into register A.

One major task is to design the control unit to generate the appropriate signal at appropriate time
for the proper functioning of the computer.

Consider a simple problem to add two numbers and store the result in memory, say we want to
add 7 to 5.

To solve this problem in computer, we have to write a computer program. The program is
machine specific, and it is related to the instruction set of the machine.

For our hypothetical machine, the program is as follows

Instruction Binary HEX Memory Location


LDAI 5 1000 0101 85 (0, 1)
LDBI 7 1010 0111 A7 (2, 3)
ADD 0000 0 (4)
STC 15 1100 1111 CF (5, 6)
HALT 1101 D (7)

Main Memory Organization: Stored Program

The present day digital computers are based on stored-program concept introduced by Von
Neumann. In this stored-program concept, programs and data are stored in separate storage unit
called memories.

Central Processing Unit, the main component of computer can work with the information stored
in storage unit only.

In 1946, Von Neumann and his colleagues began the design of a stored-program computer at the
Institute for Advanced Studies in Princeton.This computer is referred as the IAS computer.

9
mywbut.com

The structure of IAS computer is shown in Figure 1.2.

Figure 1.2: Structure of a first generation computer : IAS

The IAS computer is having three basic units:

 The Central Processing Unit (CPU).


 The Main Memory Unit.
 The Input/Output Device.

Central Processing Unit:

This is the main unit of computer, which is responsible to perform all the operations. The CPU of
the IAS computer consists of a data processing unit and a program control unit.

The data processing unit contains a high speed registers intended for temporary storage of
instructions, memory addresses and data. The main action specified by instructions is performed
by the arithmetic-logic circuits of the data processing unit.

The control circuits in the program control unit are responsible for fetching instructions,
decoding opcodes, controlling the information movements correctly through the system, and
providing proper control signals for all CPU actions.

10
mywbut.com

The Main Memory Unit:

It is used for storing programs and data. The memory locations of memory unit is uniquely
specified by the memory address of the location. M(X) is used to indicate the location of the
memory unit M with address X.

The data transfer between memory unit and CPU takes place with the help of data register DR.
When CPU wants to read some information from memory unit, the information first brings to
DR, and after that it goes to appropriate position. Similarly, data to be stored to memory must put
into DR first, and then it is stored to appropriate location in the memory unit.

The address of the memory location that is used during memory read and memory write
operations are stored in the memory register AR.

The information fetched from the memory is a operand of an instruction, then it is moved from
DR to data processing unit (either to AC or MQ). If it is an operand, then it is moved to program
control unit (either to IR or IBR).

Two additional registers for the temporary storage of operands and results are included in data
processing units: the accumulator AC and the multiplier-quotient register MQ.

Two instructions are fetch simultaneously from M and transferred to the program control unit.
The instruction that is not to be executed immediately is placed in the instruction buffer register
IBR. The opcode of the other instruction is placed in the instruction register IR where it is
decoded.

In the decoding phase, the control circuits generate the required control signals to perform the
specified operation in the instruction.

The program counter (PC) is used to store the address of the next instruction to be fetched from
memory.

Input Output Device:

Input devices are used to put the information into computer. With the help of input devices we
can store information in memory so that CPU can use it. Program or data is read into main
memory from input device or secondary storage under the control of CPU input instruction.

Output devices are used to output the information from computer. If some results are evaluated
by computer and it is stored in computer, then with the help of output devices, we can present it
to the user. Output data from the main memory go to output device under the control of CPU
output instruction.

11
mywbut.com

Arithmetic and logic Unit (ALU)

ALU is responsible to perform the operation in the computer.

The basic operations are implemented in hardware level. ALU is having collection of two types
of operations:

 Arithmetic operations

 Logical operations

Consider an ALU having 4 arithmetic operations and 4 logical operation.

To identify any one of these four logical operations or four arithmetic operations, two control
lines are needed. Also to identify the any one of these two groups- arithmetic or logical, another
control line is needed. So, with the help of three control lines, any one of these eight operations
can be identified.

Consider an ALU is having four arithmetic operations - Addition, subtraction, multiplication and
division. Also consider that the ALU is having four logical operations: OR, AND, NOT & EX-
OR.

We need three control lines to identify any one of these operations. The input combination of
these control lines are shown below:
Control line is used to identify the group: logical or arithmetic, ie
: arithmetic operation : logical operation.
Control lines and are used to identify any one of the four operations in a group. One
possible combination is given here.

1
mywbut.com

A decode is used to decode the instruction. The block diagram of the ALU is shown in
figure 2.1.

Figure 2.1: Block Diagram of the ALU

The ALU has got two input registers named as A and B and one output storage register, named
as C. It performs the operation as:

C = A op B

The input data are stored in A and B, and according to the operation specified in the control
lines, the ALU perform the operation and put the result in register C.

As for example, if the contents of controls lines are, 000, then the decoder enables the addition
operation and it activates the adder circuit and the addition operation is performed on the data
that are available in storage register A and B . After the completion of the operation, the result is
stored in register C.

We should have some hardware implementations for basic operations. These basic operations
can be used to implement some complicated operations which are not feasible to implement
directly in hardware.

There are several logic gates exists in digital logic circuit. These logic gates can be used to
implement the logical operation. Some of the common logic gates are mentioned here.

AND gate: The output is high if both the inputs are high. The AND gate and its truth table is
shown in Figure 2.2.

2
mywbut.com

Figure 2.2: AND gate and its truth table.

OR gate: The output is high if any one of the inputs is high. The OR gate and its truth table is
shown in Figure 2.3.

Figure 2.3: OR gate and its truth table.

EX-OR gate: The output is high if either of the input is high. The EX-OR gate and its truth table
is given in Figure 2.4.

Figure 2.4: EX-OR gate and its truth table.

3
mywbut.com

If we want to construct a circuit which will perform the AND operation on two 4-bit number, the
implementation of the 4-bit AND operation is shown in the Figure-2.5.

Figure2.5: 4-bit AND operator

Arithmetic Circuit

Binary Adder:
Binary adder is used to add two binary numbers.

In general, the adder circuit needs two binary inputs and two binary outputs. The input variables
designate the augends and addend bits; the output variables produce the sum and carry.

The binary addition operation of single bit is shown in the truth table.

C: Carry Bit

S: Sum Bit

The simplified sum of products expressions are

4
mywbut.com

The circuit implementation is

Figure 2.6: Circuit diagram and Block diagram of Half Adder

This circuit can not handle the carry input, so it is termed as half adder.The circuit diagram and
block diagram of Half Adder is shown in Figure 2.6.

Full Adder:

A full adder is a combinational circuit that forms the arithmetic sum of three bits. It consists of
three inputs and two outputs.

Two of the input variables, denoted by x and y, represent the two bits to be added. The third input
Z, represents the carry from the previous lower position.

The two outputs are designated by the symbols S for sum and C for carry.

5
mywbut.com

The simplified expression for S and C are

We may rearrange these two expressions as follows:

The circuit diagram full adder is shown in the figure.

Figure 2.7: Circuit diagram and block diagram of Full Adder

The circuit diagram and block diagram of a Full Adder is shown in the Figure 2.7. n-such single
bit full adder blocks are used to make n-bit full adder.

6
mywbut.com

To demonstrate the binary addition of four bit numbers, let us consider a specific example.

Consider two binary numbers

A =1 0 0 1 B=0011

To get the four bit adder, we have to use 4 full adder blocks. The carry output the lower bit is
used as a carry input to the next higher bit.

The circuit of 4-bit adder shown in the Figure 2.8.

Figure 2.8: A 4-bit adder circuit.

Binary Subtractor:

The subtraction operation can be implemented with the help of binary adder circuit, because

7
mywbut.com

We know that 2's complement representation of a number is treated as a negative number of the
given number.

We can get the 2's complements of a given number by complementing each bit and adding 1 to
it.

The circuit for subtracting A-B consist of an added with inverter placed between each data input
B and the corresponding input of the full adder. The input carry must be equal to 1 when
performing subtraction.

The operation thus performed becomes A , plus the 1's complement of B , plus 1. This is equal to
A plus 2's complement of B.

With this principle, a single circuit can be used for both addition and subtraction. The 4 bit adder
subtractor circuit is shown in the figure. It has got one mode ( M ) selection input line, which will
determine the operation,

If , then

If then
1's complement of

Figure 2.9: 4-bit adder subtractor

The circuit diagram of a 4-bit adder substructures shown in the Figure 2.9.

The operation of OR gate:

8
mywbut.com

if
if

Implemental issue of some operations


Multiplication

Multiplication of two numbers in binary representation can be performed by a process of SHIFT


and ADD operations. Since the binary number system allows only 0 and 1's, the digit
multiplication can be replaced by SHIFT and ADD operation only, because multiplying by 1
gives the number itself and multiplying by 0 produces 0 only.

The multiplication process is illustrated with a numerical example.

The process consists of looking at successive bits of the multiplier, least significant bit first. If
the multiplier bit is a 1, the multiplicand is copied down, otherwise, zeros are copied down. The
numbers copied down in successive lines are shifted one position to the left from the previous
number. Finally, the numbers are added and their sum forms the product.

When multiplication is implemented in a digital computer, the process is changed slightly.

Instead of providing registers to store and add simultaneously as many binary numbers as there
are bits in the multiplier, it is convenient to provide an adder for the summation of only two
binary numbers and successively accumulate the partial products in a register. It will reduce the
requirements of registers.

Instead of sifting the multiplicand to the left, the partial product is shifted to right.

9
mywbut.com

When the corresponding bit of the multiplier is 0, there is no need to add all zeros to the partial
product.

An algorithm to multiply two binary numbers. Consider that the ALU does not provide the
multiplication operation, but it is having the addition operation and shifting operation. Then we
can write a micro program for multiplication operation and provide the micro program code in
memory. When a multiplication operation is encountered, it will execute this micro code to
perform the multiplication.

Binary Multiplier, Hardware Implementation

The block diagram of binary multiplier is shown in the Figure 2.10..

Figure 1.10: Block diagram of binary multiplier

The multiplicand is stored in register B and the multiplier is stored in register Q.

The partial product is formed in register A and stored in A and Q

The counter P is initially set to a number equal to the number of bits in the multiplier. The
counter is decremented by 1 after forming each partial product. When the content of the counter
reaches zero, the product is formed and the process stops.

Initially, the multiplicand is in register B and the multiplier in Q. The register A is reset to 0.

The sum of A and B forms a partial product- which is transferred to the EA register.

Both partial product and multiplier are shifted to the right. The least significant bit of A is shifted

10
mywbut.com

into the most significant position of Q; and 0 is shifted into E.

After the shift, one bit of the partial product is shifted into Q, pushing the multiplier bits one
position to the right.

The right most flip flop in register Q, designated by Q0 will hold the bit of the multiplier which
must be inspected next. If the content of this bit is 0, then it is not required to add the
multiplicand, only shifting is needed. If the content of this bit is 1, then both addition and shifting
are needed.

After each shifter, value of counter P is decremented and the process continues till the counter
value becomes 0.

The final result is available in ( EAQ ) registers combination.

To control the operation, it is required to design the appropriate control logic that is shown in the
block diagram.

The flow chart of the multiplication operation is given in the Figure 2.11.

Figure 2.11: Flow chart of the multiplication operation

The working of multiplication algorithm is shown here with the help of an example.

Multiplicand

11
mywbut.com

Problems

Q: Assume that the EX-OR gate has a propagation delay of 20ns and that the AND and OR
gate has a propagation delay of 10ns. What is the propagation delay of the full adder circuit?
What is the propagation delay of an 8-bit adder, which is constructed by connecting 8 full adder
in cascading manner.

Ans:
The propagation delay of the full adder circuit is 40ns, i.e. if we provide both the input and carry
at time ti, then after 40ns we will get stable output at sum and carry_out bit.

12
mywbut.com

Since the 8-bit adder is constructed by connecting 8 full adder in cascading manner, there will be
some delay in propagating the carry bit.

Since the carry output of i-th full adder is provided as an input to the (i+1)-th full adder, the
addition operation of (i+1)-th full adder cannot be performed until there is a stable output from i-
th full adder and which will appear after 40ns.

The first bit takes 40ns to provide the carry output. Similarly, second bit will take another 40ns
to produce the stable output.

Therefore, the total propagation delay is 40 X 8 = 320ns.

13
mywbut.com

Pipelining
Introduction to Pipelining

It is observed that organization enhancements to the CPU can improve performance. We have
already seen that use of multiple registers rather than a single a accumulator, and use of cache
memory improves the performance considerably. Another organizational approach, which is
quite common, is instruction pipelining.

Pipelining is a particularly effective way of organizing parallel activity in a computer system. The
basic idea is very simple. It is frequently encountered in manufacturing plants, where pipelining
is commonly known as an assembly line operation.

By laying the production process out in an assembly line, product at various stages can be
worked on simultaneously. This process is also referred to as pipelining; because, as in a
pipeline, new inputs are accepted at one end before previously accepted inputs appear as
outputs at the other end.

To apply the concept of instruction execution in pipeline, it is required to break the instruction
in different task. Each task will be executed in different processing elements of the CPU.

As we know that there are two distinct phases of instruction execution: one is instruction fetch
and the other one is instruction execution. Therefore, the processor executes a program by
fetching and executing instructions, one after another.

Let and refer to the fetch and execute steps for instruction . Execution of a program
consists of a sequence of fetch and execute steps is shown in the Figure 9.1.

Figure 9.1: Sequential Execution.

1
mywbut.com

Now consider a CPU that has two separate hardware units, one for fetching instructions and
another for executing them.

The instruction fetch by the fetch unit is stored in an intermediate storage buffer

The results of execution are stored in the destination location specified by the instruction.

For simplicity it is assumed that fetch and execute steps of any instruction can be completed in
one clock cycle.

The operation of the computer proceeds as follows:

• In the first clock cycle, the fetch unit fetches an instruction (instruction , step
) and stored it in buffer at the end of the clock cycle.

• In the second clock cycle, the instruction fetch unit proceeds with the fetch
operation for instruction (step ).

• Meanwhile, the execution unit performs the operation specified by instruction


which is already fetched and available in the buffer (step ).

• By the end of the second clock cycle, the execution of the instruction is
completed and instruction is available.

• Instruction is stored in buffer replacing , which is no longer needed.

• Step is performed by the execution unit during the third clock cycle, while
instruction is being fetched by the fetch unit.

2
mywbut.com

• Both the fetch and execute units are kept busy all the time and one instruction is
completed after each clock cycle except the first clock cycle.

• If a long sequence of instructions is executed, the completion rate of instruction


execution will be twice that achievable by the sequential operation with only
one unit that performs both fetch and execute.

Basic idea of instruction pipelining with hardware organization is shown in the Figure 9.2.

Figure 9.2: Hardware Organization

The pipeline execution of fetch and execution cycle is shown in the Figure 9.3

Figure 9.3: Pipeline Execution.

3
mywbut.com

The processing of an instruction need not be divided into only two steps. To gain further speed
up, the pipeline must have more stages.

Let us consider the following decomposition of the instruction execution:

• Fetch Instruction (FI): Read the next expected instruction into a buffer.

• Decode Instruction ((DI): Determine the opcode and the operand specifiers.

• Calculate Operand (CO): calculate the effective address of each source operand.

• Fetch Operands (FO): Fetch each operand from memory.

• Execute Instruction (EI): Perform the indicated operation.

• Write Operand (WO): Store the result in memory.

There will be six different stages for these six subtasks. For the sake of simplicity, let us assume
the equal duration to perform all the subtasks. It the six stages are not of equal duration, there
will be some waiting involved at various pipeline stages.

The timing diagram for the execution of instruction in pipeline fashion is shown in the Figure
9.4.

4
mywbut.com

Figure 9.4: Timing sequence in pipeline execution.

From this timing diagram it is clear that the total execution time of 8 instructions in this 6 stages
pipeline is 13-time unit. The first instruction gets completed after 6 time unit, and there after in
each time unit it completes one instruction.

Without pipeline, the total time required to complete 8 instructions would have been 48 (6 X 8)
time unit. Therefore, there is a speed up in pipeline processing and the speed up is related to
the number of stages.

5
mywbut.com

Pipeline Performance

The cycle time of an instruction pipeline is the time needed to advance a set of instructions
one stage through the pipeline. The cycle time can be determined as

where

maximum stage delay (delay through stage which experience the largest delay)
number of stages in the instruction pipeline.
time delay of a latch, needed to advance signals and data from one stage to the next.
In general, the time delay is equivalent to a clock pulse and

Now suppose that n instructions are processed and these instructions are executed one after
another. The total time required to execute all n instructions is

A total of k cycles are required to complete the execution of the first instruction, and the
remaining (n-1) instructions require (n-1) cycles.

The time required to execute n instructions without pipeline is

because to execute one instruction it will take cycle.

The speed up factor for the instruction pipeline compared to execution without the pipeline is
defined as:

6
mywbut.com

In general, the number of instruction executed is much more higher than the number of stages
in the pipeline So, the n tends to , we have

i.e. We have a k fold speed up, the speed up factor is a function of the number of stages in the
instruction pipeline.

Though, it has been seen that the speed up is proportional to number of stages in the pipeline,
but in practice the speed up is less due to some practical reason. The factors that affect the
pipeline performance is discussed next.

Effect of Intermediate storage buffer:

Consider a pipeline processor, which process each instruction in four steps;

F: Fetch, Read the instruction from the memory

D: Decode, decode the instruction and fetch the source operand (S)

O: Operate, perform the operation

W: Write, store the result in the destination location.

The hardware organization of this four-stage pipeline processor is shown in the Figure 9.5.

Figure 9.5: Four stage pipeline processor

7
mywbut.com

In the preceding section we have seen that the speed up of pipeline processor is related to
number of stages in the pipeline, i.e, the greater the number of stages in the pipeline, the faster
the execution rate.

But the organization of the stages of a pipeline is a complex task and if affects the performance
of the pipeline.

The problem related to more number of stages:

At each stage of the pipeline, there is some overhead involved in moving data from buffer to
buffer and in performing various preparation and delivery functions. This overhead can
appreciably lengthen the total execution time of a single instruction.

The amount of control logic required to handle memory and register dependencies and to
optimize the use of the pipeline increases enormously with the number of stages.

Apart from hardware organization, there are some other reasons which may effect the
performance of the pipeline.

(A) Unequal time requirement to complete a subtask:

Consider the four-stage pipeline with processing step Fetch, Decode, Operand and write.

The stage-3 of the pipeline is responsible for arithmetic and logic operation, and in general one
clock cycle is assigned for this task

Although this may be sufficient for most operations, but some operations like divide may
require more time to complete.
Figure 9.6 shows the effect of an operation that takes more than one clock cycle to complete an
operation in operate stage.

8
mywbut.com

Clock Cycle
Instruction 1 2 3 4 5 6 7 8 9
s

I1

I2

I3

I4

Figure 9.6: Operation takes more than one clock cycle

The operate stage for instruction takes 3 clock cycle to perform the specified operation.
Clock cycle 4 to 6 required to perform this operation and so write stage is doing nothing during
the clock cycle 5 and 6, because no data is available to write.

Meanwhile, the information in buffer B2 must remain intake until the operate stage has
completed its operation.

This means that stage 2 and stage 1 are blocked from accepting new instructions because the
information in B1 cannot be overwritten by a new fetch instruction.

The contents of B1, B2 and B3 must always change at the same clock edge.

Due to that reason, pipeline operation is said to have been stalled for two clock cycle. Normal
pipeline operation resumes in clock cycle 7.

Whenever the pipeline stalled, some degradation in performance occurs.

Role of cache memory:

The use of cache memory solves the memory access problem

9
mywbut.com

Occasionally, a memory request results in a cache miss. This causes the pipeline stage that
issued the memory request to take much longer time to complete its task and in this case the
pipeline stalls. The effect of cache miss in pipeline processing is shown in the Figure 9.7.

Clock Cycle
Instruction 1 2 3 4 5 6 7 8 9
s

I1

I2

I3

Time
Figure 9.7: Effect of cache miss in pipeline processing

Clock Cycle
1 2 3 4 5 6 7 8 9 10
Stages
F: Fetch F1 F2 F2 F2 F2 F4 F3

D: Decode D1 idle idle idle D2 D3

O: Operate O1 idle idle idle O2 O3

W: Write W1 idle idle idle W2 W3

Time

Figure 9.8: Function performed by each stage as a function of time.

Function performed by each stage as a function of time ias shown in Figure 9.8.

In this example, instruction is fetched from the cache in cycle 1 and its execution proceeds
normally.

10
mywbut.com

The fetch operation for instruction which starts in cycle 2, results in a cache miss.
The instruction fetch unit must now suspend any further fetch requests and wait for to
arrive.

We assume that instruction is received and loaded into buffer B1 at the end of cycle 5, It
appears that cache memory used here is four time faster than the main memory.

The pipeline resumes its normal operation at that point and it will remain in normal operation
mode for some times, because a cache miss generally transfer a block from main memory to
cache.

From the figure, it is clear that Decode unit, Operate unit and Write unit remain idle for three
clock cycle.

Such idle periods are sometimes referred to as bubbles in the pipeline.

Once created as a result of a delay in one of the pipeline stages, a bubble moves downstream
until it reaches the last unit. A pipeline can not stall as long as the instructions and data being
accessed reside in the cache. This is facilitated by providing separate on chip instruction and
data caches.

Dependency Constraints:

Consider the following program that contains two instructions, followed by

:A A+5

:B 3*A

When this program is executed in a pipeline, the execution of can begin before the
execution of completes. The pipeline execution is shown in Figure 9.9.

Clock Cycle
1 2 3 4 5 6
Stages
F: Fetch F1 D1 O1 W1

D: Decode F2 D2 O2 W2

11
mywbut.com

Figure 9.9: Pipeline execution of two instructions

In clock cycle 3, the specific operation of instruction i.e. addition takes place and at that time
only the new updated value of A is available. But in the clock cycle 3, the instruction is
fetching the operand that is required for the operation of . Since in clock cycle 3 only,
operation of instruction is taking place, so the instruction will get the old value of A , it
will not get the updated value of A , and will produce a wrong result. Consider that the initial
value of A is 4. The proper execution will produce the result as
B=27
:

But due to the pipeline action, we will get the result as

Due to the data dependency, these two instructions can not be performed in parallel.

Therefore, no two operations that depend on each other can be performed in parallel. For
correct execution, it is required to satisfy the following:

• The operation of the fetch stage must not depend on the operation performed
during the same clock cycle by the execution stage.

• The operation of fetching an instruction must be independent of the execution


results of the previous instruction.

• The dependency of data arises when the destination of one instruction is used as
a source in a subsequent instruction.

12
mywbut.com

Branching

In general when we are executing a program the next instruction to be executed is brought
from the next memory location. Therefore, in pipeline organization, we are fetching
instructions one after another.

But in case of conditional branch instruction, the address of the next instruction to be fetched
depends on the result of the execution of the instruction.

Since the execution of next instruction depends on the previous branch instruction, sometimes
it may be required to invalidate several instruction fetches. Consider the following instruction
execution sequence of Figure 9.10.

Time 1 2 3 4 5 6 7 8 9 10

Instruction

In this instruction sequence, consider that is a conditional branch instruction.

The result of the instruction will be available at clock cycle 5. But by that time the fetch unit has
already fetched the instruction and

If the branch condition is false, then branch won't take place and the next instruction to be
executed is which is already fetched and available for execution.

Now consider that when the condition is true, we have to execute the instruction After clock
cycle 5, it is known that branch condition is true and now instruction has to be executed.

13
mywbut.com

But already the processor has fetched instruction and It is required to invalidate these
two fetched instruction and the pipe line must be loaded with new destination instruction .

Due to this reason, the pipeline will stall for some time. The time lost due to branch instruction
is often referred as branch penalty.

Time
1 2 3 4 5 6 7 8 9 10
Instruction
I1 F1 D1 O1 W1

I2 F2 D2 O2 W2

I3 F3 D3 O3 W3

I4 F4 D4

I5 F5

I10 F10 D10 O10 W10

I11 F11 D11 O11 W11

Figure 9.11: The effect of branch takes place

The effect of branch takes place is shown in the Figure 9.11. Due to the effect of branch takes
place, the instruction I4 and I5 which has already been fetched is not executed and new
instruction I10 is fetched at clock cycle 6.

There is not effective output in clock cycle 7 and 8, and so the branch penalty is 2.

The branch penalty depends on the number of stages in the pipeline. More numbers of stages
results in more branch penalty.

Dealing with Branches:

14
mywbut.com

One of the major problems in designing an instruction pipe line is assuming a steady flow of
instructions to the initial stages of the pipeline.

The primary problem is the conditional branch instruction until the instruction is actually
executed, it is impossible to determine whether the branch will be taken or not.

A variety of approaches have been taken for dealing with conditional branches:

• Multiple streams
• Prefetch branch target
• Loop buffer
• Branch prediction
• Delayed branch

Multiple streams

A single pipeline suffers a penalty for a branch instruction because it must choose one of two
instructions to fetch next and sometimes it may make the wrong choice.

A brute-force approach is to replicate the initial portions of the pipeline and allow the pipeline
to fetch both instructions, making use of two streams.

There are two problems with this approach.

• With multiple pipelines there are contention delays for access to the registers
and to memory

• Additional branch instructions may enter the pipeline (either stream) before the
original branch decision is resolved. Each such instruction needs as additional
stream.

Prefetch Branch target

When a conditional branch is recognized, the target of the branch is prefetced, in addition to
the instruction following the branch. This target is then saved until the branch instruction is
executed. If the branch is taken, the target has already been prefetched.

Loop Buffer:

15
mywbut.com

A top buffer is a small, very high speed memory maintained by the instruction fetches stage of
the pipeline and containing the most recently fetched instructions, in sequence.

If a branch is to be taken, the hardware first cheeks whether the branch target is within the
buffer. If so, the next instruction is fetched from the buffer.

The loop buffer has three benefits:

1. With the use of prefetching, the loop buffer will contain some instruction
sequentially ahead of the current instruction fetch address. Thus, instructions
fetched in sequence will be available without the usual memory access time.

2. If a branch occurs to a target just a few locations ahead of the address of the
branch instruction, the target will already be in the buffer. This is usual for the
common occurrence of IF-THEN and IF-THEN-ELSE sequences.

3. This strategy is particularly well suited for dealing with loops, or iterations; hence
the name loop buffer. If the loop buffer is large enough to contain all the
instructions in a loop, then those instructions need to be fetched from memory
only once, for the first iteration. For subsequent iterations, all the needed
instructions are already in the buffer.

The loop buffer is similar in principle to a cache dedicated to instructions. The differences are
that the loop buffer only retains instructions in sequence and is much smaller in size and hence
lower in cost.

Branch Prediction:

Various techniques can be used to predict whether a branch will be taken or not. The most
common techniques are:

• Predict never taken

• Predict always taken

• Predict by opcode

16
mywbut.com

• Taken/not taken switch

• Branch history table.

The first three approaches are static; they do not depend on the execution history upto the
time of the conditional branch instructions.

The later two approaches are dynamic- they depend on the execution history.

Predict never taken always assumes that the branch will not be taken and continue to fetch
instruction in sequence.

Predict always taken assumes that the branch will be taken and always fetch the brunet target

In these two approaches it is also possible to minimize the effect of a wrong decision.

If the fetch of an instruction after the branch will cause a page fault or protection violation, the
processor halts its prefetching until it is sure that the instruction should be fetched.

Studies analyzing program behaviour have shown that conditional branches are taken more
than 50% of the time , and so if the cost of prefetching from either path is the same,
then always prefetching from the branch target address should give better performance than
always prefetching from the sequential path.

However, in a paged machine, prefetching the branch target is more likely to cause a page fault
than prefetching the next instruction in the sequence and so this performance penalty should
be taken into account.

Predict by opcode approach makes the decision based on the opcade of the branch instruction.
The processor assumes that the branch will be taken for certain branch opcodes and not for
others. Studies reported in showed that success rate is greater than 75% with the
strategy.

Lilja,D "Reducing the branch penalty in pipeline processors", computer, July 1988.

17
mywbut.com

Dynamic branch strategies attempt to improve the accuracy of prediction by recording the
history of conditional branch instructions in a program. Scheme to maintain the history
information:

• One or more bits can be associated with each conditional branch instruction that
reflect the recent history of the instruction.

• These bits are referred to as a taken/not taken switch that directs the processor
to make a particular decision the next time the instruction is encountered.

• Generally these history bits are not associated with the instruction in main
memory. It will unnecessarily increase the size of the instruction. With a single
bit we can record whether the last execution of this instruction resulted a branch
or not.

• With only one bit of history, an error in prediction will occur twice for each use
of the loop: once for entering the loop. And once for exiting.

If two bits are used, they can be used to record the result of the last two instances of the
execution of the associated instruction.

The history information is not kept in main memory, it can be kept in a temporary high speed
memory.

One possibility is to associate these bits with any conditional branch instruction that is in a
cache. When the instruction is replaced in the cache, its history is lost.

Another possibility is to maintain a small table for recently executed branch instructions with
one or more bits in each entry.

The branch history table is a small cache memory associated with the instruction fetch stage of
the pipeline. Each entry in the table consists of three elements:

• The address of the branch instruction.

• Some member of history bits that record the state of use of that instruction.

18
mywbut.com

• Information about the target instruction, it may be the address of the target
instruction, or may be the target instruction itself.

Delayed Branch:

Instruction
Ij Fj Ej

Ij + 1 Fj + 1

Ik Fk Ek

Ik + 1 Fk + 1 Ek + 1
Time

Figure 9.12: Instructions in pipeline with branch instruction.

The Figure 9.12 shows the execution of instructions in pipeline where the instruction Ij is a
branch instruction.

The instruction is a branch instruction. The processor begins fetching instruction before
it determine whether the current instruction, , is a branch instruction.

When execution of is completed and a branch must be made, the processor must discard
the instruction that was fetched and now fetch the instruction at the branch target.

The location following a branch instruction is called a branch delay slot. There may be more
than one branch delay slot, depending on the time it takes to execute a branch instruction.

The instructions in the delay slots are always fetched and at least partially executed before the
branch decision is made and the branch target address is computed.

Delayed branching is a technique to minimize the penalty incurred as a result of conditional


branch instructions.

19
mywbut.com

The instructions in the delay slots are always fetched, so we can arrange the instruction in delay
slots to be fully executed whether or not the branch is taken.

The objective is to plane useful instruction in these slots. If no useful instructions can be placed
in the delay slots, these slots must be filled with NOP (no operation) instructions.

While feeling up the delay slots with instructions, it is required to maintain the original
semantics of the program.

For example consider the consider the following code segments given in the Figure 9.13.

I1 LOOP Shift_left R1

I2 Decrement R2

I3 Branch_if 0 LOOP

I4 NEXT Add R1,R3

Figure 9.13: Original Program Loop

Here register is used as a counter to determine the number of times the contents of register
are sifted left.
Consider a processor with a two-stage pipeline and one delay slot. During the execution phase
of the instruction , the fetch unit will fetch the instruction After evaluating the branch
condition only, it will be clear whether instruction or will be executed next.
The nature of the code segment says that it will remain in the top depending on the initial value
of and when it becomes zero, it will come out from the loop and execute the instruction .
During the loop execution, every time there is a wrong fetch of instruction . The code
segment can be recognized without disturbing the original meaning of the program. The
reordered code segment is shown in Figure 9.14.

20
mywbut.com

LOOP Decrement R2

Branch_if 0 LOOP

Shift_left R1

NEXT Add R1,R3

Figure 9.14: Reordered instructions for program loop

In this case, the shift instruction is fetched while the branch instruction is being executed.

After evaluating the branch condition, the processor fetches the instruction at LOOP or at NEXT,
depending on whether the branch condition is true or false, respectively.

In either case, it completes execution of the shift instruction.

Logically the program is executed as if the branch instruction was placed after the shift
instruction. That is, branching takes place one instruction later than where the branch
instruction appears in the instruction sequence in the memory, hence the name “delayed
branch” .

Figure 9.15 shows the execution timing for the last two passes through the loop of reordered
instructions.

Figure 9.16 shows the execution timing for the last two passes through the loop of the original
program loop.

21
mywbut.com

Instruction
Decrement F E

Branch F E

Shift F E

Decrement F E

Branch F E

Shift F E

Add F E
Time

Figure 9.15: Execution timing for last two passes through the loop of reordered instruction.

Instruction
Shift F E

Decrement F E

Branch F E

Add F *

Shift F E

Decrement F E

Branch F E

Add F E
Time

22
mywbut.com

Figure 9.16: Execution timing for last two passes through the loop of the original program loop.
*Note : Execution Unit Idle

Problems

Q 1: Explain the concept of instruction pipeline.

Q 2: How do you evaluate the performance enhancement of a pipeline processor with d


number of phases with respect to a processor without pipeline?

Q 3: Why is a two stage instruction pipeline unlikely to cut the instruction cycle time in half,
compared with the use of no pipeline?

Q 4: What is branch penalty?

Q 5: What do you mean by static branch strategies and dynamic branch strategies to deal with
branches in pipeline processor?

Q 6: What is a branch history table and how it is used to deal with branches?

Q 7: Explain the concept of delayed branching technique.

Q 8: What is a loop buffer? How loop buffer is used to handle the branching in pipeline
processor?

23
mywbut.com

Memory
We have already mentioned that digital computer works on stored programmed concept introduced by
Von Neumann. We use memory to store the information, which includes both program and data.

Due to several reasons, we have different kind of memories. We use different kind of memory at
different level.

The memory of computer is broadly categories into two categories:

 Internal and
 external

Internal memory is used by CPU to perform task and external memory is used to store bulk information,
which includes large software and data.

Memory is used to store the information in digital form. The memory hierarchy is given by:

 Register
 Cache Memory
 Main Memory
 Magnetic Disk
 Removable media (Magnetic tape)

Register:
This is a part of Central Processor Unit, so they reside inside the CPU. The information from main
memory is brought to CPU and keep the information in register. Due to space and cost constraints, we
have got a limited number of registers in a CPU. These are basically faster devices.

Cache Memory:
Cache memory is a storage device placed in between CPU and main memory. These are semiconductor
memories. These are basically fast memory device, faster than main memory.

We cannot have a big volume of cache memory due to its higher cost and some constraints of the CPU.
Due to higher cost we cannot replace the whole main memory by faster memory. Generally, the most
recently used information is kept in the cache memory. It is brought from the main memory and placed
in the cache memory. Now days, we get CPU with internal cache.

Main Memory:
Like cache memory, main memory is also semiconductor memory. But the main memory is relatively
slower memory. We have to first bring the information (whether it is data or program), to main
memory. CPU can work with the information available in main memory only.

1
mywbut.com

Magnetic Disk:
This is bulk storage device. We have to deal with huge amount of data in many application. But we don't
have so much semiconductor memory to keep these information in our computer. On the other hand,
semiconductor memories are volatile in nature. It loses its content once we switch off the computer. For
permanent storage, we use magnetic disk. The storage capacity of magnetic disk is very high.

Removable media:
For different application, we use different data. It may not be possible to keep all the information in
magnetic disk. So, which ever data we are not using currently, can be kept in removable media.
Magnetic tape is one kind of removable medium. CD is also a removable media, which is an optical
device.

Register, cache memory and main memory are internal memory. Magnetic Disk, removable media are
external memory. Internal memories are semiconductor memory. Semiconductor memories are
categoried as volatile memory and non-volatile memory.

RAM: Random Access Memories are volatile in nature. As soon as the computer is switched off, the
contents of memory are also lost.

ROM: Read only memories are non volatile in nature. The storage is permanent, but it is read only
memory. We can not store new information in ROM.

Several types of ROM are available:

• PROM: Programmable Read Only Memory; it can be programmed once as per


user requirements.

• EPROM: Erasable Programmable Read Only Memory; the contents of the


memory can be erased and store new data into the memory. In this case, we
have to erase whole information.

• EEPROM: Electrically Erasable Programmable Read Only Memory; in this type of


memory the contents of a particular location can be changed without effecting
the contents of other location.

Main Memory
The main memory of a computer is semiconductor memory. The main memory unit of computer is
basically consists of two kinds of memory:

2
mywbut.com

RAM : Random access memory; which is volatile in nature.


ROM : Read only memory; which is non-volatile.

The permanent information are kept in ROM and the user space is basically in RAM.

The smallest unit of information is known as bit (binary digit), and in one memory cell we can store one
bit of information. 8 bit together is termed as a byte.

The maximum size of main memory that can be used in any computer is determined by the addressing
scheme.

A computer that generates 16-bit address is capable of addressing upto 216 which is equal to 64K
memory location. Similarly, for 32 bit addresses, the total capacity will be 232 which is equal to 4G
memory location.

In some computer, the smallest addressable unit of information is a memory word and the machine is
called word-addressable.

In some computer, individual address is assigned for each byte of information, and it is called byte-
addressable computer. In this computer, one memory word contains one or more memory bytes which
can be addressed individually.

A byte addressable 32-bit computer, each memory word contains 4 bytes. A possible way of address
assignment is shown in figure3.1. The address of a word is always integer multiple of 4.

The main memory is usually designed to store and retrieve data in word length quantities. The word
length of a computer is generally defined by the number of bits actually stored or retrieved in one main
memory access.

Consider a machine with 32 bit address bus. If the word size is 32 bit, then the high order 30 bit will
specify the address of a word. If we want to access any byte of the word, then it can be specified by the
lower two bit of the address bus.

3
mywbut.com

Figure 3.1: Address assignment to a


4-byte word

The data transfer between main memory and the CPU takes place through two CPU registers.

 MAR : Memory Address Register


 MDR : Memory Data Register.

If the MAR is k-bit long, then the total addressable memory location will be 2k.

If the MDR is n-bit long, then the n bit of data is transferred in one memory cycle.

The transfer of data takes place through memory bus, which consist of address bus and data
bus. In the above example, size of data bus is n-bit and size of address bus is k bit.

It also includes control lines like Read, Write and Memory Function Complete (MFC) for
coordinating data transfer. In the case of byte addressable computer, another control line to be
added to indicate the byte transfer instead of the whole word.

For memory operation, the CPU initiates a memory operation by loading the appropriate data
i.e., address to MAR.

4
mywbut.com

If it is a memory read operation, then it sets the read memory control line to 1. Then the
contents of the memory location is brought to MDR and the memory control circuitry indicates
this to the CPU by setting MFC to 1.

If the operation is a memory write operation, then the CPU places the data into MDR and sets
the write memory control line to 1. Once the contents of MDR are stored in specified memory
location, then the memory control circuitry indicates the end of operation by setting MFC to 1.

A useful measure of the speed of memory unit is the time that elapses between the initiation of
an operation and the completion of the operation (for example, the time between Read and
MFC). This is referred to as Memory Access Time. Another measure is memory cycle time. This
is the minimum time delay between the initiation two independent memory operations (for
example, two successive memories read operation). Memory cycle time is slightly larger than
memory access time.

Binary Storage Cell:

The binary storage cell is the basic building block of a memory unit.

The binary storage cell that stores one bit of information can be modeled by an SR latch with
associated gates. This model of binary storage cell is shown in the figure 3.2.

Figure 3.2: Binary Storage cell made up of SR-Latch

1 bit Binary Cell (BC)

The binary cell stores one bit of information in its internal latch.

5
mywbut.com

Control input to binary cell

Select Read/Write Memory Operation


0 X None
1 0 Write
1 1 Read
The storage part is modeled here with SR-latch, but in reality it is an electronics circuit made up of
transistors.
The memory constructed with the help of transistors is known as semiconductor memory. The
semiconductor memories are termed as Random Access Memory (RAM), because it is possible to access
any memory location in random.
Depending on the technology used to construct a RAM, there are two types of RAM -

SRAM: Static Random Access Memory.


DRAM: Dynamic Random Access Memory.

Dynamic Ram (DRAM):

A DRAM is made with cells that store data as charge on capacitors. The presence or absence of charge in
a capacitor is interpreted as binary 1 or 0.
Because capacitors have a natural tendency to discharge due to leakage current, dynamic RAM require
periodic charge refreshing to maintain data storage. The term dynamic refers to this tendency of the
stored charge to leak away, even with power continuously applied.
A typical DRAM structure for an individual cell that stores one bit information is shown in the figure 3.3.

Figure 3.3: Dynamic RAM (DRAM) cell

6
mywbut.com

For the write operation, a voltage signal is applied to the bit line B, a high voltage represents 1 and a low
voltage represents 0. A signal is then applied to the address line, which will turn on the transistor T,
allowing a charge to be transferred to the capacitor.

For the read operation, when a signal is applied to the address line, the transistor T turns on and the
charge stored on the capacitor is fed out onto the bit line B and to a sense amplifier.

The sense amplifier compares the capacitor voltage to a reference value and determines if the cell
contains a logic 1 or a logic 0.

The read out from the cell discharges the capacitor, widh must be restored to complete the read
operation.

Due to the discharge of the capacitor during read operation, the read operation of DRAM is termed as
destructive read out.

Static RAM (SRAM):

In an SRAM, binary values are stored using traditional flip-flop constructed with the help of transistors. A
static RAM will hold its data as long as power is supplied to it.
A typical SRAM constructed with transistors is shown in the figure 3.4.

Figure 3.4: Static RAM (SRAM) cell

7
mywbut.com

Four transistors (T1, T2, T3, T4) are cross connected in an arrangement that produces a stable logic state.
In logic state 1, point A1 is high and point A2 is low; in this state T1 and T4 are off, and T2 and T3 are on .
In logic state 0, point A1 is low and point A2 is high; in this state T1 and T4 are on, and T2 and T3 are off .
Both states are stable as long as the dc supply voltage is applied.

The address line is used to open or close a switch which is nothing but another transistor. The address
line controls two transistors(T5 and T6).
When a signal is applied to this line, the two transistors are switched on, allowing a read or write
operation.
For a write operation, the desired bit value is applied to line B, and its complement is applied to line .
This forces the four transistors(T1, T2, T3, T4) into the proper state.

For a read operation, the bit value is read from the line B. When a signal is applied to the address line,
the signal of point A1 is available in the bit line B.

SRAM Versus DRAM :

 Both static and dynamic RAMs are volatile, that is, it will retain the information as long
as power supply is applied.
 A dynamic memory cell is simpler and smaller than a static memory cell. Thus a DRAM is
more dense,
i.e., packing density is high (more cell per unit area). DRAM is less expensive than
corresponding SRAM.
 DRAM requires the supporting refresh circuitry. For larger memories, the fixed cost of
the refresh circuitry is more than compensated for by the less cost of DRAM cells
 SRAM cells are generally faster than the DRAM cells. Therefore, to construct faster
memory modules (like cache memory) SRAM is used.

Internal Organization of Memory Chips

A memory cell is capable of storing 1-bit of information. A number of memory cells are
organized in the form of a matrix to form the memory chip. One such organization is shown in
the Figure 3.5.

8
mywbut.com

Figure 3.5: 16 X 8 Memory Organization

Each row of cells constitutes a memory word, and all cell of a row are connected to a common line
which is referred as word line. An address decoder is used to drive the word line. At a particular instant,
one word line is enabled depending on the address present in the address bus. The cells in each column
are connected by two lines. These are known as bit lines. These bit lines are connected to data input line
and data output line through a Sense/Write circuit. During a Read operation, the Sense/Write circuit
sense, or read the information stored in the cells selected by a word line and transmit this information
to the output data line. During a write operation, the sense/write circuit receive information and store it
in the cells of the selected word.

A memory chip consisting of 16 words of 8 bits each, usually referred to as 16 x 8 organization. The data
input and data output line of each Sense/Write circuit are connected to a single bidirectional data line in
order to reduce the pin required. For 16 words, we need an address bus of size 4. In addition to address
and data lines, two control lines, and CS, are provided. The line is to used to specify the
required operation about read or write. The CS (Chip Select) line is required to select a given chip in a
multi chip memory system.

9
mywbut.com

Consider a slightly larger memory unit that has 1K (1024) memory cells...

128 x 8 memory chips:

If it is organized as a 128 x 8 memory chips, then it has got 128 memory words of size 8 bits. So
the size of data bus is 8 bits and the size of address bus is 7 bits ( ). The storage
organization of 128 x 8 memory chip is shown in the figure 3.6.

Figure 3.6: 128 x 8 Memory Chip

1024 x 1 memory chips:

If it is organized as a 1024 x 1 memory chips, then it


has got 1024 memory words of size 1 bit only.

Therefore, the size of data bus is 1 bit and the size of


address bus is 10 bits ( ).

A particular memory location is identified by the


contents of memory address bus. A decoder is used
to decode the memory address. There are two ways
of decoding of a memory address depending upon
the organization of the memory module.

In one case, each memory word is organized in a row.


In this case whole memory address bus is used
together to decode the address of the specified
location. The memory organization of 1024 x 1
memory chip is shown in the figure 3.7. Figure 3.7: 1024 x 1 Memory chip

10
mywbut.com

In second case, several memory words are organized in one row. In this case, address bus is divided into
two gropus.

One group is used to form the row address and the second group is used to form the column address.
Consider the memory organization of 1024 x 1 memory chip. The required 10-bit address is divided into
two groups of 5 bits each to form the row and column address of the cell array. A row address selects a
row of 32 cells, all of which are accessed in parallel. However, according to the column address, only one
of these cells is connected to the external data line via the input output multiplexers. The arrangement
for row address and column address decoders is shown in the figure 3.8.

Figure 3.8: Organizaion of 1k x 1 Memory chip

The commercially available memory chips contain a much larger number of cells. As for example, a

memory unit of 1MB (mega byte) size, organised as 1M x 8, contains memory cells. It has got
220 memory location and each memory location contains 8 bits information. The size of address bus is
20 and the size of data bus is 8.
The number of pins of a memory chip depends on the data bus and address bus of the memory module.
To reduce the number of pins required for the chip, we use another scheme for address decoding. The
cells are organized in the form of a square array. The address bus is divided into two groups, one for
column address and other one is for row address. In this case, high- and low-order 10 bits of 20-bit
address constitute of row and column address of a given cell, respectively. In order to reduce the

11
mywbut.com

number of pin needed for external connections, the row and column addresses are multiplexed on ten
pins. During a Read or a Write operation, the row address is applied first. In response to a signal pulse
on the Row Address Strobe (RAS) input of the chip, this part of the address is loaded into the row
address latch.
All cell of this particular row is selected. Shortly after the row address is latched, the column address is
applied to the address pins. It is loaded into the column address latch with the help of Column Address
Strobe (CAS) signal, similar to RAS. The information in this latch is decoded and the appropriate
Sense/Write circuit is selected.

Figure 3.9: 1 MB(Mega Byte) Memory Chip

12
mywbut.com

For a Write operation, the information at the input lines are transferred to the selected circuits.

Figure 3.10: Organization of a 1M x 1 Memory chip.

The 1MB (Mega byte) memory chip with 20 address lines as shown in the figure 3.9. The same memory
chip (1MB) with 10 address lines ( where row & column address are multiplexed) is shown in Figure
3.10.

Now we discuss the design of memory subsystem using memory chips. Consider a memory chips of
capacity 16K x 8. The requirement is to design a memory subsystem of capacity 64K x 16. Each memory
chip has got eight lines for data bus, but the data bus size of memory subsytem is 16 bits.
The total requiremet is for 64K memory location, so four such units are required to get the 64K memory
location. For 64K memory location, the size of address bus is 16. On the other hand, for 16K memory
location, size of address bus is 14 bits.
Each chip has a control input line called Chip Select (CS). A chip can be enabled to accept data input or to
place the data on the output bus by setting its Chip Select input to 1. The address bus for the 64K
memory is 16 bits wide. The high order two bits of the address are decoded to obtain the four chip
select control signals. The remaining 14 address bits are connected to the address lines of all the chips.
They are used to access a specific location inside each chip of the selected row. The inputs of all
chips are tied together to provide a common control.

13
mywbut.com

Figure 3.11: 16k x 8 Memory chip

The block diagram of a 16k x 8 memory chip is shown in the figure 3.11. The block diagram of a 64k x 16
memory module constructed with the help of eight 16k x 8 memory chips is shown in the figure 3.12.

Figure 3.12: 64k x 16 Memory chip

14
mywbut.com

Instruction Set & Addressing

In this Module, we have three topics, viz.

1. Various addressing modes

2. Machine Instruction

3. Instruction Format

Various Addressing Modes

We have examined the types of operands and operations that may be specified by machine
instructions. Now we have to see how is the address of an operand specified, and how are the bits
of an instruction organized to define the operand addresses and operation of that instruction.

Addressing Modes:

The most common addressing techniques are:

o Immediate

o Direct

o Indirect

o Register

o Register Indirect

o Displacement

o Stack

All computer architectures provide more than one of these addressing modes. The question arises
as to how the control unit can determine which addressing mode is being used in a particular
instruction. Several approaches are used. Often, different opcodes will use different addressing

1
mywbut.com

modes. Also, one or more bits in the instruction format can be used as a mode field. The value of
the mode field determines which addressing mode is to be used.

What is the interpretation of effective address. In a system without virtual memory, the effective
address will be either a main memory address or a register. In a virtual memory system, the
effective address is a virtual address or a register. The actual mapping to a physical address is a
function of the paging mechanism and is invisible to the programmer.

To explain the addressing modes, we use the following notation:

A = contents of an address field in the instruction that refers to a memory


R = contents of an address field in the instruction that refers to a register
actual (effective) address of the location containing the referenced
EA =
operand
(X) = contents of location X

Immediate Addressing:

The simplest form of addressing is immediate addressing, in which the operand is actually
present in the instruction:

OPERAND = A

This mode can be used to define and use constants or set initial values of variables. The
advantage of immediate addressing is that no memory reference other than the instruction fetch
is required to obtain the operand. The disadvantage is that the size of the number is restricted to
the size of the address field, which, in most instruction sets, is small compared with the world
length.

Figure 4.1: Immediate Addressing Mode

The instruction format for Immediate Addressing Mode is shown in the Figure 4.1.

Direct Addressing:

A very simple form of addressing is direct addressing, in which the address field contains the
effective address of the operand:

2
mywbut.com

EA = A

It requires only one memory reference and no special calculation.

Figure 4.2: Direct Addressing Mode

The fetching of data from the memory location in case of direct addressing mode is shown in the
Figure 4.2. Here, 'A' indicates the memory address field for the operand.

Indirect Addressing:

With direct addressing, the length of the address field is usually less than the word length, thus
limiting the address range. One solution is to have the address field refer to the address of a word
in memory, which in turn contains a full-length address of the operand. This is know as indirect
addressing:

EA = (A)

Figure 4.3: Indirect Addressing Mode

3
mywbut.com

The exact memory location of the operand in case of indirect addressing mode is shown in the
Figure 4.2. Here 'A' indicates the memory address field of the required operands.

Register Addressing:

Register addressing is similar to direct addressing. The only difference is that the address field
referes to a register rather than a main memory address:

EA = R

The advantages of register addressing are that only a small address field is needed in the
instruction and no memory reference is required. The disadvantage of register addressing is that
the address space is very limited.

Figure 4.4: Register Addressing Mode.

The exact register location of the operand in case of Register Addressing Mode is shown in the
Figure 34.4. Here, 'R' indicates a register where the operand is present.

Register Indirect Addressing:

Register indirect addressing is similar to indirect addressing, except that the address field refers
to a register instead of a memory location.
It requires only one memory reference and no special calculation.

EA = (R)

Register indirect addressing uses one less memory reference than indirect addressing. Because,
the first information is available in a register which is nothing but a memory address. From that
memory location, we use to get the data or information. In general, register access is much more
faster than the memory access.

4
mywbut.com

Displacement Addressing:

A very powerful mode of addressing combines the capabilities of direct addressing and register
indirect addressing, which is broadly categorized as displacement addressing:

EA = A + (R)

Displacement addressing requires that the instruction have two address fields, at least one of
which is explicit. The value contained in one address field (value = A) is used directly. The other
address field, or an implicit reference based on opcode, refers to a register whose contents are
added to A to produce the effective address. The general format of Displacement Addressing is
shown in the Figure 4.6.

Three of the most common use of displacement addressing is:

• Relative addressing

• Base-register addressing

• Indexing

5
mywbut.com

Figure 4.6: Displacement Addressing

Relative Addressing:

For relative addressing, the implicitly referenced register is the program counter (PC). That is,
the current instruction address is added to the address field to produce the EA. Thus, the effective
address is a displacement relative to the address of the instruction.

Base-Register Addressing:

The reference register contains a memory address, and the address field contains a displacement
from that address. The register reference may be explicit or implicit.

In some implimentation, a single segment/base register is employed and is used implicitly. In


others, the programmer may choose a register to hold the base address of a segment, and the
instruction must reference it explicitly.

Indexing:

The address field references a main memory address, and the reference register contains a
positive displacement from that address. In this case also the register reference is sometimes
explicit and sometimes implicit.

Generally index register are used for iterative tasks, it is typical that there is a need to increment
or decrement the index register after each reference to it. Because this is such a common
operation, some system will automatically do this as part of the same instruction cycle.

This is known as auto-indexing. We may get two types of auto-indexing:


-- one is auto-incrementing and the other one is
-- auto-decrementing.

6
mywbut.com

If certain registers are devoted exclusively to indexing, then auto-indexing can be invoked
implicitly and automatically. If general purpose register are used, the autoindex operation may
need to be signaled by a bit in the instruction.

Auto-indexing using increment can be depicted as follows:


EA = A + (R)
R = (R) + 1

Auto-indexing using decrement can be depicted as follows:

EA = A + (R)
R = (R) - 1

In some machines, both indirect addressing and indexing are provided, and it is possible to
employ both in the same instruction. There are two possibilities: The indexing is performed
either before or after the indirection.

If indexing is performed after the indirection, it is termed postindexing

EA = (A) + (R)

First, the contents of the address field are used to access a memory location containing an
address. This address is then indexed by the register value.

With preindexing, the indexing is performed before the indirection:

EA = (A + (R) )

An address is calculated, the calculated address contains not the operand, but the address
of the operand.

Stack Addressing:

A stack is a linear array or list of locations. It is sometimes referred to as a pushdown list or last-
in-first-out queue. A stack is a reserved block of locations. Items are appended to the top of the
stack so that, at any given time, the block is partially filled. Associated with the stack is a pointer
whose value is the address of the top of the stack. The stack pointer is maintained in a register.
Thus, references to stack locations in memory are in fact register indirect addresses.

The stack mode of addressing is a form of implied addressing. The machine instructions need not
include a memory reference but implicitly operate on the top of the stack.

7
mywbut.com

Machine Instruction

The operation of a CPU is determine by the instruction it executes, referred to as machine


instructions or computer instructions. The collection of different instructions is referred as the
instruction set of the CPU.

Each instruction must contain the information required by the CPU for execution. The elements
of an instruction are as follows:

Operation Code:
Specifies the operation to be performed (e.g., add, move etc.). The operation is specified by a
binary code, know as the operation code or opcode.

Source operand reference:


The operation may involve one or more source operands; that is, operands that are inputs for the
operation.

Result operand reference:


The operation may produce a result.

Next instruction reference:


This tells the CPU where to fetch the next instruction after the execution of this instruction is
complete.

The next instruction to be fetched is located in main memory. But in case of virtual memory
system, it may be either in main memory or secondary memory (disk). In most cases, the next
instruction to be fetched immediately follow the current instruction. In those cases, there is no
explicit reference to the next instruction.When an explicit reference is needed, then the main
memory or virtual memory address must be given.

Source and result operands can be in one of the three areas:

 main or virtual memory,


 CPU register or
 I/O device.

The steps involved in instruction execution is shown in the Figure 4.7-

8
mywbut.com

Figure 4.7: Steps involved in instruction execution

Instruction Representation

Within the computer, each instruction is represented by a sequence of bits. The instruction is
divided into fields, corresponding to the constituent elements of the instruction. The instruction
format is highly machine specific and it mainly depends on the machine architecture. A simple
example of an instruction format is shown in the Figure 4.8. It is assume that it is a 16-bit CPU.
4 bits are used to provide the operation code. So, we may have to 16 (24 = 16) different set of
instructions. With each instruction, there are two operands. To specify each operands, 6 bits are
used. It is possible to provide 64 ( 26 = 64 ) different operands for each operand reference.

It is difficult to deal with binary representation of machine instructions. Thus, it has become
common practice to use a symbolic representation of machine instructions.

Opcodes are represented by abbreviations, called mnemonics, that indicate the


operations.
Common examples include:

ADD Add
SUB Subtract
MULT Multiply
DIV Division
LOAD Load data from
memory to CPU
STORE Store data to memory Figure 4.8: A simple instruction format.
from CPU.

9
mywbut.com

Operands are also represented symbolically. For example, the instruction

MULT R, X ; R R * X

may mean multiply the value contained in the data location X by the contents of register R and
put the result in register R

In this example, X refers to the address of a location in memory and R refers to a particular
register.

Thus, it is possible to write a machine language program in symbolic form. Each symbolic
opcode has a fixed binary representation, and the programmer specifies the location of each
symbolic operand.

Instruction Types

The instruction set of a CPU can be categorized as follows:

Data Processing:

Arithmatic and Logic instructions Arithmatic instructions provide computational capabilities for
processing numeric data. Logic (Boolean) instructions operate on the bits of a word as bits rather
than as numbers. Logic instructions thus provide capabilities for processing any other type of
data. There operations are performed primarily on data in CPU registers.

Data Storage:

Memory instructions are used for moving data between memory and CPU registers.

Data Movement:

I/O instructions are needed to transfer program and data into memory from storage device or
input device and the results of computation back to the user.

Control:

Test and branch instructions


Test instructions are used to test the value of a data word or the status of a computation. Branch
instructions are then used to branch to a different set of instructions depending on the decision
made.

10
mywbut.com

Number of Addresses

What is the maximum number of addresses one might need in an instruction? Most of the
arithmantic and logic operations are either unary (one operand) or binary (two operands). Thus
we need a maximum of two addresses to reference operands. The result of an operation must be
stored, suggesting a third address. Finally after completion of an instruction, the next instruction
must be fetched, and its address is needed.

This reasoning suggests that an instruction may require to contain four address references: two
operands, one result, and the address of the next instruction. In practice, four address instructions
are rare. Most instructions have one, two or three operands addresses, with the address of the
next instruction being implicit (obtained from the program counter).

Instruction Set Design

One of the most interesting, and most analyzed, aspects of computer design is instruction set
design. The instruction set defines the functions performed by the CPU. The instruction set is the
programmer's means of controlling the CPU. Thus programmer requirements must be considered
in designing the instruction set.

Most important and fundamental design issues:

Operation repertoire : How many and which operations to provide, and how
complex operations should be.
Data Types : The various type of data upon which operations are
performed.
Instruction format : Instruction length (in bits), number of addresses, size of
various fields and so on.
Registers : Number of CPU registers that can be referenced by
instructions and their use.
Addressing : The mode or modes by which the address of an operand is
specified.

Types of Operands

Machine instructions operate on data. Data can be categorised as follows :

Addresses: It basically indicates the address of a memory location. Addresses are nothing but the
unsigned integer, but treated in a special way to indicate the address of a memory location.
Address arithmatic is somewhat different from normal arithmatic and it is related to machine
architecture.

Numbers: All machine languages include numeric data types. Numeric data are classified into
two broad categories: integer or fixed point and floating point.

11
mywbut.com

Characters: A common form of data is text or character strings. Since computer works with bits,
so characters are represented by a sequence of bits. The most commonly used coding scheme is
ASCII (American Standard Code for Information Interchange) code.

Logical Data: Normally each word or other addressable unit (byte, halfword, and so on) is
treated as a single unit of data. It is sometime useful to consider an n-bit unit as consisting of n 1-
bit items of data, each item having the value 0 or 1. When data are viewed this way, they are
considered to be logical data. Generally 1 is treated as true and 0 is treated as false.

Types of Opearations

The number of different opcodes and their types varies widely from machine to machine.
However, some general type of operations are found in most of the machine architecture. Those
operations can be categorized as follows:

 Data Transfer

 Arithmatic

 Logical

 Conversion

 Input Output [ I/O ]

 System Control

 Transfer Control

Data Transfer:

The most fundamental type of machine instruction is the data transfer instruction. The data
transfer instruction must specify several things. First, the location of the source and destination
operands must be specified. Each location could be memory, a register, or the top of the stack.
Second, the length of data to be transferred must be indicated. Third, as with all instructions with
operands, the mode of addressing for each operand must be specified.

The CPU has to perform several task to accomplish a data transfer operation. If both source and
destination are registers, then the CPU simply causes data to be transferred from one register to
another; this is an operation internal to the CPU.

If one or both operands are in memory, then the CPU must perform some or all of the following
actions:

12
mywbut.com

a) Calculate the memory address, based on the addressing mode.


b) If the address refers to virtual memory, translate from virtual to actual memory
address.
c) Determine whether the addressed item is in cache.
d) If not, issue a command to the memory module.

Commonly used data transfer operation:

Operation Name Description


Move (Transfer) Transfer word or block from source to destination
Store Transfer word from processor to memory
Load (fetch) Transfer word from memory to processor
Exchange Swap contents of source and destination
Clear (reset) Transfer word of 0s to destination
Set Transfer word of 1s to destination
Push Transfer word from source to top of stack
Pop Transfer word from top of stack to destination

Arithmatic:

Most machines provide the basic arithmatic operations like add, subtract, multiply, divide etc.
These are invariably provided for signed integer (fixed-point) numbers. They are also available
for floating point number.

The execution of an arithmatic operation may involve data transfer operation to provide the
operands to the ALU input and to deliver the result of the ALU operation.

Commonly used data transfer operation:

Operation Name Description


Add Compute sum of two operands
Subtract Compute difference of two operands
Multiply Compute product of two operands
Divide Compute quotient of two operands
Absolute Replace operand by its absolute value
Negate Change sign of operand
Increment Add 1 to operand
Decrement Subtract 1 from operand

13
mywbut.com

Logical:

Most machines also provide a variety of operations for manipulating individual bits of a word or
other addressable units.

Most commonly available logical operations are:

Operation Name Description


AND Performs the logical operation AND bitwise
OR Performs the logical operation OR bitwise
NOT Performs the logical operation NOT bitwise
Exclusive OR Performs the specified logical operation Exculsive-OR
bitwise
Test Test specified condition; set flag(s) based on outcome
Compare Make logical or arithmatic comparison Set flag(s) based on
outcome
Set Control Variables Class of instructions to set controls for protection purposes,
interrupt handling, timer control etc.
Shift Left (right) shift operand, introducing constant at end
Rotate Left (right) shift operation, with wraparound end

Conversion:

Conversion instructions are those that change the format or operate on the format of data. An
example is converting from decimal to binary.

Input/Output :

Input/Output instructions are used to transfer data between input/output devices and
memory/CPU register.

Commonly available I/O operations are:

Operation Name Description


Input (Read) Transfer data from specified I/O port or device to
destination
(e.g., main memory or processor register)
Output (Write) Transfer data from specified source to I/O port or device.
Start I/O Transfer instructions to I/O processor to initiate I/O
operation.
Test I/O Transfer status information from I/O system to specified
destination

14
mywbut.com

System Control:

System control instructions are those which are used for system setting and it can be used only in
privileged state. Typically, these instructions are reserved for the use of operating systems. For
example, a system control instruction may read or alter the content of a control register. Another
instruction may be to read or modify a storage protection key.

Transfer of Control:

In most of the cases, the next instruction to be performed is the one that immediately follows the
current instruction in memory. Therefore, program counter helps us to get the next instruction.
But sometimes it is required to change the sequence of instruction execution and for that
instruction set should provide instructions to accomplish these tasks. For these instructions, the
operation performed by the CPU is to upload the program counter to contain the address of some
instruction in memory. The most common transfer-of-control operations found in instruction set
are: branch, skip and procedure call.

Branch Instruction

A branch instruction, also called a jump instruction, has one of its operands as the address of the
next instruction to be executed. Basically there are two types of branch instructions: Conditional
Branch instruction and unconditional branch instruction. In case of unconditional branch
instruction, the branch is made by updating the program counter to address specified in operand.
In case of conditional branch instruction, the branch is made only if a certain condition is met.
Otherwise, the next instruction in sequence is executed.
There are two common ways of generating the condition to be tested in a conditional branch
instruction
First most machines provide a 1-bit or multiple-bit condition code that is set as the result of some
operations. As an example, an arithmetic operation could set a 2-bit condition code with one of
the following four values: zero, positive, negative and overflow. On such a machine, there could
be four different conditional branch instructions:

BRP X ; Branch to location X if result is positive


BRN X ; Branch to location X if result is negative
BRZ X ; Branch to location X is result is zero
BRO X ; Branch to location X if overflow occurs

In all of these cases, the result referred to is the result of the most recent operation that set the
condition code.

Another approach that can be used with three address instruction format is to perform a
comparison and specify a branch in the same instruction.

For example,

15
mywbut.com

BRE R1, R2, X ; Branch to X if contents of R1 = Contents of R2.

Skip Instruction

Another common form of transfer-of-control instruction is the skip instruction. Generally, the
skip imples that one instruction to be skipped; thus the implied address equals the address of the
next instruction plus one instruction length. A typical example is the increment-and-skip-if-zero
(ISZ) instruction. For example,

ISZ R1

This instruction will increment the value of the register R1. If the result of the increment is zero,
then it will skip the next instruction.

Procedure Call Instruction

A procedure is a self contained computer program that is incorporated into a large program. At
any point in the program the procedure may be invoked, or called. The processor is instructed to
go and execute the entire procedure and then return to the point from which the call took place.
The procedure mechanism involves two basic instructions: a call instruction that branches from
the present location to the procedure, and a return instruction that returns from the procedure to
the place from which it was called. Both of these are forms of branching instructions.
Some important points regarding procedure call:

o A procedure can be called from more than one location.


o A procedure call can appear in a procedure. This allows the nesting of procedures
to an arbitrary depth.
o Each procedure call is matched by a return in the called program.

Since we can call a procedure from a variety of points, the CPU must somehow save the return
address so that the return can take place appropriately. There are three common places for storing
the return address:

• Register
• Start of procedure
• Top of stack

Consider a machine language instruction CALL X, which stands for call procedure at location X.
If the register apprach is used, CALL X causes the following actions:

16
mywbut.com

RN PC + IL
PC X

where RN is a register that is always used for this purpose, PC is the program counter and IL is
the instruction length. The called procedure can now save the contents of RN to be used for the
later return.

A second possibilities is to store the return address at the start of the procedure. In this case,
CALL X causes

X PC + IL
PC X + 1

Both of these approaches have been used. The only limitation of these approaches is that they
prevent the use of reentrant procedures. A reentrant procedure is one in which it is possible to
have several calls open to it at the same time.

A more general approach is to use stack. When the CPU executes a call, it places the return
address on the stack. When it executes a return, it uses the address on the stack.

It may happen that, the called procedure might have to use the processor registers. This will
overwrite the contents of the registers and the calling environment will lose the information. So,
it is necessary to preserve the contents of processor register too along with the return address.
The stack is used to store the contents of processor register. On return from the procedure call,
the contents of the stack will be popped out to appropriate registers.

In addition to provide a return address, it is also often necessary to pass parameters with a
procedure call. The most general approach to parameter passing is the stack. When the processor
executes a call, it not only stacks the return address, it stacks parameters to be passed to the
called procedures. The called procedure can access the parameters from the stack.Upon return,
return parameters can also be placed on the stack. The entire set of parameters, including return
address, that is stored for a procedure invocation is referred to as stack frame.

Most commonly used transfer of control operation:

Operation Name Description


Jump (branch) Unconditional transfer, load PC with specific address
Jump conditional Test specific condition; either load PC with specific address or
do nothing, based on condition
Jump to subroutine Place current program control information in known location;
jump to specific address

17
mywbut.com

Return Replace contents of PC and other register from known location


Skip Increment PC to skip next instruction
Skip Conditional Test specified condition; either skip or do nothing based on
condition
Halt Stop program execution

Instruction Format

Instruction Format:

An instruction format defines the layout of the bits of an instruction, in terms of its constituents
parts. An instruction format must include an opcode and, implicitly or explicitly, zero or more
operands. Each explit operand is referenced using one of the addressing mode that is available
for that machine. The format must, implicitly or explictly, indicate the addressing mode of each
operand. For most instruction sets, more than one instruction format is used. Four common
instruction format are shown in the Figure 4.9.

Figure 4.9: Four common Instruction formats

Instruction Length:

18
mywbut.com

On some machines, all instructions have the same length; on others there may be many different
lengths. Instructions may be shorter than, the same length as, or more than the word length.
Having all the instructions be the same length is simpler and make decoding easier but often
wastes space, since all instructions then have to be as long as the longest one. Possible
relationship between instruction length and word length is shown in the Figure 4.10.

Figure 4.10: Some Possible relationship between instructions and word length

Generally there is a correlation between memory transfer length and the instruction length. Either
the instruction length should be equal to the memory transfer length or one should be a multiple
of the other. Also in most of the case there is a correlation between memory transfer length and
word length of the machine.

Allocation of Bits:

For a given instruction length, there is a clearly a trade-off between the number of opcodes and
the power of the addressing capabilities. More opcodes obviously mean more bits in the opcode
field. For an instruction format of a given length, this reduces the number of bits available for
addressing.
The following interrelated factors go into determining the use of the addressing bits:

Number of Addressing modes:

Sometimes as addressing mode can be indicated implicitly. In other cases, the addressing mode
must be explicit, and one or more bits will be needed.

Number of Operands:

19
mywbut.com

Typical instructions on today's machines provide for two operands. Each operand address in the
instruction might require its own mode indicator, or the use of a mode indicator could be limited
to just one of the address field.

Register versus memory:

A machine must have registers so that data can be brought into the CPU for processing. With a
single user-visible register (usually called the accumulator), one operand address is implicit and
consumes no instruction bits. Even with multiple registers, only a few bits are needed to specify
the register. The more that registers can be used for operand references, the fewer bits are
needed.

Number of register sets:

A number of machines have one set of general purpose registers, with typically 8 or 16 registers
in the set. These registers can be used to store data and can be used to store addresses for
displacement addressing. The trend recently has been away from one bank of general purpose
registers and toward a collection of two or more specialized sets (such as data and displacement).

Address range:

For addresses that reference memory, the range of addresses that can be referenced is related to
the number of address bits. With displacement addressing, the range is opened up to the length of
the address register.

Address granularity:

In a system with 16- or 32-bit words, an address can reference a word or a byte at the designer's
choice. Byte addressing is convenient for character manipulation but requires, for a fixed size
memory, more address bits.

Variable-Length Instructions:

Instead of looking for fixed length instruction format, designer may choose to provide a variety
of instructions formats of different lengths. This tectic makes it easy to provide a large repertoire
of opcodes, with different opcode lengths. Addressing can be more flexible, with various
combinations of register and memory references plus addressing modes. With variable length
instructions, many variations can be provided efficiently and compactly. The principal price to
pay for variable length instructions is an increase in the complexity of the CPU.

Number of addresses :

The processor architecture is described in terms of the number of addresses contained in each
instruction. Most of the arithmatic and logic instructions will require more operands. All
arithmatic and logic operations are either unary
(one source operand, e.g. NOT) or binary (two source operands, e.g. ADD).

20
mywbut.com

Thus, we need a maximum of two addresses to reference source operands. The result of an
operation must be stored, suggesting a third reference.

Three address instruction formats are not common because they require a relatively long
instruction format to hold the three address reference.

With two address instructions, and for binary operations, one address must do double duty as
both an operand and a result.

In one address instruction format, a second address must be implicit for a binary operation. For
implicit reference, a processor register is used and it is termed as accumulator(AC). the
accumulator contains one of the operands and is used to store the result.

Consider a simple arithmatic expression to evalute:


Y= (A + B) / (C * D)

The evaluation of this expression in three address instruction format, two address instruction
format and one address instruction format is shown in the Figure 4.11, Figure 4.12 and Figure
4.13 respectively.

Figure 4.11: Three address instructions

21
mywbut.com

Figure 4.12: Two address instructions

Figure 4.13: One address instructions

Problems

Q 1: What are the typical elements of a machine instruction?

Q 2: What are the different categories of instructions?

Q 3: Why are transfer of control instructions needed ?

Q 4: If an instruction contains four addresses, what might be the purpose of each address ?

22
mywbut.com

Q 5: List and explain the important design issues for instruction set design.

Q 6: What are the different types of operands may present in an instruction.

Q 7: Briefly explain the following addressing modes- immediate addressing direct addressing,
indirect addressing displacement addressing and relative addressing

Q 8: What is indexed addressing and what is the advantage of auto indexing ?

Q 9: What are the advantages and disadvantages of using a variable-length instruction format ?

Q 10: An address field of an instruction contains decimal value 250. Where is the corresponding
operand located for-

a. Immediate addressing
b. Direct addressing
c. Indirect addressing
d. Register addressing
e. Register indirect addressing

23
mywbut.com

Multi-Processor / Parallel Processing


Parallel Processing:

Originally, the computer has been viewed as a sequential machine. Most computer
programming languages require the programmer to specify algorithms as sequence of
instruction.

Processor executes programs by executing machine instructions in a sequence and one at a


time.

Each instruction is executed in a sequence of operations (fetch instruction, fetch operands,


perform operation store result.)

It is observed that, at the micro operation level, multiple control signals are generated at the
same time.

Instruction pipelining, at least to the extent of overlapping fetch and execute operations, has
been around for long time.

By looking into these phenomenon, researcher has look into the matter whether some
operations can be performed in parallel or not.

As computer technology has evolved, and as the cost of computer hardware has dropped,
computer designers have sought more and more opportunities for parallelism, usual to
enhance performance and, in some cases, to increase availability.

The taxonomy first introduced by Flynn is still the most common way of categorizing systems
with parallel processing capability. Flynn proposed the following categories of computer
system:

• Single Instruction, Multiple Data (SIMD) system: A single machine instruction controls
the simultaneous execution of a number of processing elements on a lockstep basis.
Each processing element has an associated data memory, so that each instruction is
executed on a different set of data by the different processors. Vector and array
processors fall into this category

• Multiple Instruction, Single Data (MISD) system : A sequence of data is transmitted to a


set of processors, each of which executes a different instruction sequence. This
structure has never been implemented.

1
mywbut.com

• Multiple Instruction, Multiple Data (MIMD) system: A set of processors simultaneously


execute different instruction sequences on different data sets. SMPs, clusters, and
NUMA systems fits into this category.

With the MIMD organization, the processors are general purpose; each is able to process all of
the instructions necessary to perform the appropriate data transformation.

Further MIMD can be subdivided into two main categories:

• Symmetric multiprocessor (SMP): In an SMP, multiple processors share a single


memory or a pool of memory by means of a shared bus or other interconnection
mechanism. A distinguish feature is that the memory access time to any region of
memory is approximately the same for each processor.
• Nonuniform memory access (NUMA): The memory access time to different regions of
memory may differ for a NUMA processor.

The design issues relating to SMPs and NUMA are complex, involving issues relating to physical
organization, interconnection structures, inter processor communication, operating system
design, and application software techniques.

Symmetric Multiprocessors:

A symmetric multiprocessor (SMP) can be defined as a standalone computer system with the
following characteristic:

1. There are two or more similar processor of comparable capability.


2. These processors share the same main memory and I/O facilities and are interconnected
by a bus or other internal connection scheme.
3. All processors share access to I/O devices, either through the same channels or through
different channels that provide paths to the same device.
4. All processors can perform the same functions.
5. The system is controlled by an integrated operating system that provides interaction
between processors and their programs at the job, task, file and data element levels.

The operating system of a SMP schedules processors or thread across all of the processors. SMP
has potential advantages over uniprocessor architecture:

• Performance: A system with multiple processors will perform in a better way than one
with a single processor of the same type if the task can be organized in such a manner
that some portion of the work done can be done in parallel.

2
mywbut.com

• Availability: Since all the processors can perform the same function in a symmetric
multiprocessor, the failure of a single processor does not stop the machine. Instead, the
system can continue to function at reduce performance level.
• Incremental growth: A user can enhance the performance of a system by adding an
additional processor.
• Sealing: Vendors can offer a range of product with different price and performance
characteristics based on number of processors configured in the system.

Organization:

The organization of a multiprocessor system is shown in Figure 10.1

Figure 10.1: Block diagram of tightly coupled multiprocessors

• There are two or more processors. Each processor is self sufficient, including a control
unit, ALU, registers and cache.
• Each processor has access to a shared main memory and the I/O devices through an
interconnection network.
• The processor can communicate with each other through memory (messages and status
information left in common data areas).
• It may also be possible for processors to exchange signal directly.
• The memory is often organized so that multiple simultaneous accesses to separate
blocks of memory are possible.
• In some configurations each processor may also have its own private main memory and
I/O channels in addition to the shared resources.

3
mywbut.com

The organization of multiprocessor system can be classified as follows:

• Time shared or common bus


• Multiport memory
• Central control unit.

Time shared Bus:

Time shared bus is the simplest mechanism for constructing a multiprocessor system. The bus
consists of control, address and data lines. The block diagram is shown in Figure 10.2.

Figure 10.2: Time shared bus

The following features are provided in time-shared bus organization:

• Addressing: It must be possible to distinguish modules on the bus to determine the


source and destination of data
• Arbitration: Any I/O module can temporarily function as “master”. A mechanism is
provided to arbitrate competing request for bus control, using some sort of priority
scheme.
• Time shearing: when one module is controlling the bus, other modules are locked out
and if necessary suspend operation until bus access in achieved.

The bus organization has several advantages compared with other approaches:

4
mywbut.com

• Simplicity: This is the simplest approach to multiprocessor organization. The physical


interface and the addressing, arbitration and time sharing logic of each processor
remain the same as in a single processor system.
• Flexibility: It is generally easy to expand the system by attaching more processor to the
bus.
• Reliability: The bus is essentially a passive medium and the failure of any attached
device should not cause failure of the whole system.

The main drawback to the bus organization is performance. Thus, the speed of the system is
limited by the bus cycle time.

To improve performance, each processor can be equipped with local cache memory.

The use of cache leads to a new problem which is known as cache coherence problem. Each
local cache contains an image of a portion of main memory. If a word is altered in one cache, it
may invalidate a word in another cache. To prevent this, the other processors must perform an
update in its local cache.

Multiport Memory:

The multiport memory


approach allows the
direct, independent
access of main memory
modules by each
processor and io module.
The multiport memory
system is shown in Figure
10.3

Figure 10.3: Multiport memory

5
mywbut.com

The multiport memory approach is more complex than the bus approach, requiring a fair
amount of logic to be added to the memory system. Logic associated with memory is required
for resolving conflict. The method often used to resolve conflicts is to assign permanently
designated priorities to each memory port.

Non-uniform Memory Access (NUMA)

In NUMA architecture, all processors have access to all parts of main memory using loads and
stores. The memory access time of a processor differs depending on which region of main
memory is accessed. The last statement is true for all processors; however, for different
processors, which memory regions are slower and which are faster differ.

A NUMA system in which cache coherence is maintained among the cache of the various
processors is known as cache-cohence NUMA (CC-NUMA)

A typical CC-NUMA organization is shown in the Figure 10.4.

Figure 10.4: CC- NUMA Organization

There are multiple independent nodes, each of which is, in effect, an SMP organization.

6
mywbut.com

Each node contains multiple processors, each with its own L1 and L2 caches, plus main
memory.

The node is the basic building block of the overall CC NUMA organization

The nodes are interconnected by means of some communication facility, which could be a
switching mechanism, a ring, or some other networking facility.

Each node in the CC-NUMA system includes some main memory.

From the point of view of the processors, there is only a single addressable memory, with each
location having a unique system-wide address.

When a processor initiates a memory access, if the requested memory location is not in the
processors cache, then the L2 cache initiates a fetch operation.

If the desired line is in the local portion of the main memory, the line is fetch across the local
bus.

If the desired line is in a remote portion of the main memory, then an automatic request is send
out to fetch that line across the interconnection network, deliver it to the local bus, and then
deliver it to the requesting cache on that bus.

All of this activity is atomic and transparent to the processors and its cache.

In this configuration, cache coherence is a central concern. For that each node must maintain
some sort of directory that gives it an indication of the location of various portion of memory
and also cache status information.

Interconnection Networks:

In a multiprocessor system, the interconnection network must allow information transfer


between any pair of modules in the system. The traffic in the network consists of requests (such
as read and write), data transfers, and various commands.

7
mywbut.com

Single Bus:

The simplest and most economical means of interconnecting a number of modules is to use a
single bus.

Since several modules are connected to the bus and any module can request a data transfer at
any time, it is essential to have an efficient bus arbitration scheme.

In a simple mode of operation, the bus is dedicated to a particular source-destination pair for
the full duration of the requested transfer. For example, when a processor uses a read request
on the bus, it holds the bus until it receives the desired data from the memory module.

Since the memory module needs a certain amount of time to access the data bus, the bus will
be idle until the memory is ready to respond with the data.

Then the data is transferred to the processors. When this transfer is completed, the bus can be
assigned to handle another request.

A scheme known as the split- transaction protocol makes it possible to use the bus during the
idle period to serve another request.

Consider the following method of handling a series of read requests possibly from different
processor.

After transferring the address involved in the first request, the bus may be reassigned to
transfer the address of the second request; assuming that this request is to a different memory
module.

At this point, we have two modules proceeding with read access cycle in parallel.

If neither module has finished with its access, the bus may be reassigned to a third request and
so on.

Eventually, the first memory module completes its access cycle and uses the bus to transfer the
data to the corresponding source.

As other modules complete their cycles, the bus is needed to transfer their data to the
corresponding sources.

The split transaction protocol allows the bus and the available bandwidth to be used more
efficiently. The performance improvement achieved with this protocol depends on the
relationship between the bus transfer time and the memory access time.

In split- transaction protocol, performance is improved at the cost of increased bus complexity.

8
mywbut.com

There are two reasons why complexity increases:

 Since a memory module needs to know which source initiated a given


read request,
a source identification tag must be attached to the request.
 Complexity also increases because all modules, not just the processor,
must be
able to act as bus muster.

The main limitation of a single bus is that the number of modules that can be connected to the
bus is not that large. Networks that allow multiple independent transfer operations to proceed
in parallel can provide significantly increased data transfer rate.

Crossbar Network:

Crossbar switch is a versatile switching network. It is basically a network of switches. Any


module can be connected to any other module by closing the appropriate switch. Such
networks, where there is a direct link between all pairs of nodes are called fully connected
networks.

In a fully connected network, many


simultaneous transfers are possible. If n
sources need to send data to n distinct
destinations then all of these transfers
can take place concurrently. Since no
transfer is prevented by the lack of a
communication path, the crossbar is
called a nonblocking switch.

In the Figure 10.5 of crossbar


interconnection network, a single
switch is shown at each cross point. In
actual multiprocessor system, the paths Figure 10.5: Crossbar Interconnection Network
through the crossbar network are much
wider.

9
mywbut.com

If there are modules in a network, than the number of cross point is in a network to
interconnect modules. The total number of switches becomes large as increases.

In a crossbar switch, conflicts occur when two or more concurrent requests are made to the
same destination device. These conflicting requests are usually handled on a predetermined
priority basis.

The crossbar switch has the potential for the highest bandwidth and system efficiency.
However, because of its complexity and cost, it may be cost effective for a large multiprocessor
system.

Multistage Network:

The bus and crossbar systems use a single stage of switching to provide a path from a source to
a destination.

In multistage network, multiple stages of switches are used to setup a path between source and
destination.

Such networks are less costly than the crossbar structure, yet they provide a reasonably large
number of parallel paths between source and destinations.

In the Figure 10.6, it shows a three-stage network that called a shuffle network that
interconnects eight modules.

The term "shuffle" describes the pattern of connections from the outputs of one stage to the
inputs of the next stage.

The switchbox in the Figure 10.6 is a switch that can route either input to either output.

If the inputs request distinct outputs, they can both be routed simultaneously in the straight
through or crossed pattern.

10
mywbut.com

If both inputs request the same output, only one request can be satisfied. The other one is
blocked until the first request finishes using the switch.

Figure 10.6: Multistage Shuffle Network

A network consisting of stages can be used to interconnect modules. In this case, there is
exactly one path through the network from any module to any module . Therefore, this
network provides full connectivity between sources and destinations.

Many request patterns cannot be satisfied simultaneously. For example, the connection from P2
to P7 can not be provided at the same time as the connection from P3 to P6.

A multistage network is less expansive to implement than a crossbar network. If nodes are to
be interconnected using this scheme, then we must use stages with switches
per stage. Since each switches contains four switches, the total number of switches is

which, for a large network, is considerably less than the switches needed in a crossbar
network.

Multistage networks are less capable of providing concurrent connection than crossbar
switches. The connection path between and is indicated by RED lines in the Figure 10.6.

11
mywbut.com

12

You might also like