Digital Design and Computer Organization BCS302 Digital Design and Computer Organization BCS302
MODULE-5 In cases where an instruction occupies more than one word, steps 1 and 2 must be repeated as many times as
Basic Processing Unit:Some Fundamental Concepts: Register Transfers, Performing ALU operations, necessary to fetch the complete instruction. These two steps together are usually referred to as the fetch phase;
fetching a word from Memory, Storing a word in memory. Execution of a Complete Instruction. step 3 constitutes the decoding phase; and step 4 constitutes the execution phase.
Pipelining: Basic concepts, Role of Cache memory, Pipeline Performance
Basic Processing Unit
The heart of any computer is the central processing unit (CPU).
The CPU executes all the machine instructions and coordinates the activities of all other units during the
execution of an instruction.This unit is also called as the Instruction Set Processor (ISP).
The processor is generally called as the central processing unit (CPU) or micro processing unit (MPU)
High-performance processors have a pipelined organization where the execution of one instruction is
started before the execution of the preceding instruction is completed.
A program is a set of instructions performing a meaningful task. An instruction is command to the
processor & is executed by carrying out a sequence of sub-operations called as micro-operations.
5.1 Some Fundamental Concepts
Execution of a program by the processor starts with the fetching of instructions one at a time, decoding the
instruction and performing the operations specified. From memory, instructions are fetched from successive
locations until a branch or a jump instruction is encountered. The processor keeps track of the address of the
memory location containing the next instruction to be fetched using the program counter (PC) or Instruction
Pointer (IP). Fig: Single Bus Organization of the Depth inside a Processor
After fetching an instruction, the contents of the PC are updated to point to the next instruction in the To study these operations in detail, let us examine the internal organization of the processor. The main building
sequence. But, when a branch instruction is to be executed, the PC will be loaded with a different (jump/branch blocks of a processor are interconnected in a variety of ways. A very simple organization is shown in above
address). Instruction register, IR is another key register in the processor, which is used to hold the op-codes figure more complex structure that provides high performance will be presented at the end.
before decoding. IR contents are then transferred to an instruction decoder (ID) for decoding. Figure shows an organization in which the arithmetic and logic unit (ALU) and all the registers are
1. Fetch the contents of the memory location pointed to by the PC. The contents of this location are interconnected through a single common bus, which is internal to the processor. The data and address lines of
interpreted as an instruction code to be executed. Hence, they are loaded into the IR/ID. Symbolically, the external memory bus are shown in figure connected to the internal processor bus via the memory data
this operation can be written as register, MDR, and the memory address register, MAR, respectively. Register MDR has two inputs and two
IR ← [(PC)] outputs.
2. Assuming that the memory is byte addressable, increment the contents of the PC by 2, that is, Data may be loaded into MDR either from the memory bus or from the internal processor bus. The data
PC ← [PC] +2 stored in MDR may be placed on either bus. The input of MAR is connected to the internal bus, and its output is
3. Decode the instruction to understand the operation & generate the control signals necessary to carry out connected to the external bus. The control lines of the memory bus are connected to the instruction decoder and
the operation. control logic block. This unit is responsible for issuing the signals that control the operation of all the units
4. Carry out the actions specified by the instruction in the IR inside the processor and for interacting with the memory bus.
Dept. Of AI&ML, AIT-CKM Page 1 Dept. Of AI&ML, AIT-CKM Page 2
Digital Design and Computer Organization BCS302 Digital Design and Computer Organization BCS302
1. Transfer a word of data from one processor register to another or to the ALU Hence, at the next active edge of the clock, he flip-flops that constitute R4 will load the data present at their
2. Perform arithmetic or a logic operation and store the result in a processor register inputs. At the same time, the control signals Rlout and R4in will return to 0. We will use this simple model of
3. Fetch the contents of a given memory location and load them into a processor register the timing of data transfers for the rest of this chapter. However, we should point out that other schemes are
4. Store a word of data from a processor register into a given memory location possible. For example, data transfers may use both the rising and falling edges of the clock. Also, when edge-
5.1.1 Register Transfers triggered flip-flops are not used, two or more clock signals may be needed to guarantee proper transfer of data.
Instruction execution involves a sequence of steps in which data are transferred from one register to another. This is known as multi phase clocking.
For each register, two control signals are used to place the contents of that register on the bus or to load the data
on the bus into the register.
Fig: input and output gating for one bit register
An implementation for one bit of register Ri is shown in Figure. A two-input multiplexer is used to
select the data applied to the input of an edge-triggered D flip-flop. When the control input Riin is equal to 1, the
multiplexer selects the data on the bus. This data will be loaded into the flip-flop at the rising edge of the clock.
Fig: input and output gating for register When Riin is equal to 0, the multiplexer feeds back the value currently stored in the flip-flop
The input and output of register Ri are connected to the bus via switches controlled by the signals Riin and Riout The Q output of the flip-flop is connected to the bus via a tri-state gate. When Riout, is equal to 0, the
respectively. When Riin is set to 1, the data on the bus are loaded into Ri. Similarly, when Riout, is set to 1, the gate's output is in the high-impedance (electrically disconnected) state. This corresponds to the open-circuit
contents of register Riout are placed on the bus. While Riout is equal to 0, the bus can be used for transferring data state of a switch. When Riout, = 1, the gate drives the bus to 0 or 1, depending on the value of Q.
from other registers. 5.1.2 Performing ALU operations
Suppose that we wish to transfer the contents of register RI to register R4. This can be accomplished as follows: The ALU performs arithmetic operations on the 2 operands applied to its A and B inputs.
Enable the output of register R1out by setting Rlout, tc 1. This places the contents of R1 on the processor One of the operands is output of MUX; and, the other operand is obtained directly from processor-bus.
bus. The result (produced by the ALU) is stored temporarily in register Z.
Enable the input of register R4 by setting R4 in to 1. This loads data from the processor bus into register The sequence of operations for [R3] [R1]+[R2] is as follows:
R4. o R1out, Yin
All operations and data transfers within the processor take place within time periods defined by the o R2out, SelectY, Add, Zin
processor clock. The control signals that govern a particular transfer are asserted at the start of the clock cycle. o Zout, R3in
In our example, Rlout and R4in are set to 1. The registers consist of edge-triggered flip-flops. Instruction execution proceeds as follows:
Step 1 Contents from register R1 are loaded into register Y.
Dept. Of AI&ML, AIT-CKM Page 3 Dept. Of AI&ML, AIT-CKM Page 4
Digital Design and Computer Organization BCS302 Digital Design and Computer Organization BCS302
Step2 Contents from Y and from register R2 are applied to the A and B inputs of ALU; Addition is Where WMFC=control-signal that causes processor's control circuitry to wait for arrival of MFC signal.
performed & Result is stored in the Z register.
Step 3 The contents of Z register is stored in the R3 register.
The signals are activated for the duration of the clock cycle corresponding to that step. All other signals are
inactive.
Fig: Connection and control signals for register MDR
Move (R1), R2
MAR←[R1]
Start a read operation on the memory bus
Wait for MFC response from the Memory
Load MDR from memory bus
[R2]←[MDR]
Fig : Input and output gating for the registers
5.1.3 Fetching a word from Memory
To fetch instruction/data from memory, processor transfers required address to MAR. At the same time,
processor issues Read signal on control-lines of memory-bus.
When requested-data are received from memory, they are stored in MDR. From MDR, they are
transferred to other registers.
The response time of each memory access varies (based on cache miss, memory mapped I/O). To
accommodate this, MFC is used. (MFC Memory Function Completed).
MFC is a signal sent from addressed-device to the processor. MFC informs the processor that the
requested operation has been completed by addressed-device.
1) R1out, MARin, Read; desired address is loaded into MAR & Read command is issued.
2) MDRinE, WMFC; load MDR from memory-bus & Wait for MFC response from memory.
Fig: Timing of a Memory Read Operation
3) MDRout, R2in; load R2 from MDR.
Dept. Of AI&ML, AIT-CKM Page 5 Dept. Of AI&ML, AIT-CKM Page 6
Digital Design and Computer Organization BCS302 Digital Design and Computer Organization BCS302
5.1.4 Storing a word in memory Step3:Fetched instruction is moved into MDR and then to IR. The step 1 through 3 constitutes the Fetch
Consider the instruction Move R2,(R1). This requires the following sequence: Phase. At the beginning of step 4, the instruction decoder interprets the contents of the IR. This enables
R1out, MARin; desired address is loaded into MAR. the control circuitry to activate the control-signals for steps 4 through 7. The step 4 through 7 constitutes
R2out, MDRin, Write; data to be written are loaded into MDR & Write command is issued. the Execution Phase.
MDRoutE, WMFC; load data into memory-location pointed by R1 from MDR. Step4: Contents of R3 are loaded into MAR & a memory read signal is issued.
Step5: Contents of R1 are transferred to Y to prepare for addition.
5.2 Execution of a Complete Instruction
Step6: When Read operation is completed, memory-operand is available in MDR, and the addition is
Let us now put together the sequence of elementary operations required to execute one instruction. Consider the
performed.
instruction
Step7: Sum is stored in Z, and then transferred to R1.The End signal causes a new instruction fetch
Add (R3), R1
cycle to begin by returning to step1.
Which adds the contents of a memory location pointed to by R3 to register R1? Executing this instruction
5.2.1 Branching Instructions
requires the following actions:
Control sequence for an unconditional branch instruction is as follows:
1. Fetch the instruction.
2. Fetch the first operand (the contents of the memory location pointed to by R3).
3. Perform the addition.
4. Load the result into Rl
Fig: Control Sequence for an unconditional Branch Instruction
Instruction execution proceeds as follows:
Step 1-3: The processing starts & the fetch phase ends in step3.
Step 4: The offset-value is extracted from IR by instruction-decoding circuit. Since the updated value of
PC is already available in register Y, the offset X is gated onto the bus, and an addition operation is
performed.
Fig: Control Sequence for execution of the instruction ADD (R3).R1
Instruction execution proceeds as follows: Step 5: the result, which is the branch-address, is loaded into the PC.
Step1: The instruction-fetch operation is initiated by → loading contents of PC into MAR & → sending The branch instruction loads the branch target address in PC so that PC will fetch the next
a Read request to memory. The Select signal is set to Select4, which causes the Mux to select constant 4. instruction from the branch target address.
This value is added to operand at input B (PCs content), and the result is stored in Z. The branch target address is usually obtained by adding the offset in the contents of PC.
Step2: Updated value in Z is moved to PC. This completes the PC increment operation and PC will now The offset X is usually the difference between the branch target-address and the address immediately
point to next instruction. following the branch instruction.
Dept. Of AI&ML, AIT-CKM Page 7 Dept. Of AI&ML, AIT-CKM Page 8
Digital Design and Computer Organization BCS302 Digital Design and Computer Organization BCS302
In case of conditional branch, we have to check the status of the condition-codes before loading a
new value into the PC. e.g.:
Offset-field-of-IRout, Add, Zin, If N=0 then End
If N=0, processor returns to step 1 immediately after step 4. If N=1, step 5 is performed to load a new value into
PC.
5.3 Pipelining Fig:Hardware organization of pipelining for two segments
It is technique of decomposing a sequential process into sub operation, with each sub operation completed in The computer is controlled by a clock whose period is such that the fetch and execute steps of any instruction
dedicated segment that operates concurrently with all other segments. can each be completed in one clock cycle. Operation of the computer proceeds as in Figure 3.3. In the first
5.3.1 Basic concepts clock cycle, the fetch unit fetches an instruction I1 (step F1) and stores it in buffer Bl at the end of the clock
In computer architecture Pipelining means executing machine instructions concurrently. The pipelining is used cycle. In the second clock cycle, the instruction fetch unit proceeds with the fetch operation for instruction I2
in modern computers to achieve high performance. The speed of execution of programs is influenced by many (step F2). Meanwhile, the execution unit performs the operation specified by instruction I1, which is available
factors. One way to improve performance is to use faster circuit technology to build the processor and the main to it in buffer Bl (step E1). By the end of the second clock cycle, the execution of instruction I1 is completed
memory. Another possibility is to arrange the hardware so that more than one operation can be performed at the and instruction I2 is available. Instruction I2 is stored in B1, replacing I1, which is no longer needed. Step E2 is
same time. In this way, the number of operations performed per second is increased even though the elapsed performed by the execution unit during the third clock cycle, while instruction I3 is being fetched by the fetch
time needed to perform anyone operation is not changed. unit. In this manner, both the fetch and execute units are kept busy all the time.
Fig: Sequential Execution
Pipelining is a particularly effective way of organizing concurrent activity in a computer system. The basic idea
is very simple. It is frequently encountered in manufacturing plants, where pipelining is commonly known as an
assembly-line operation. The processor executes a program by fetching and executing instructions, one after
the other. Let Fi and Ei refer to the fetch and execute steps for instruction Ii. Executions of a program consist of
a sequence of fetch and execute steps,
Fig:Pipelined executions of instructions (Instructions Pipelining) for two segments
Now consider a computer that has two separate hardware units, one for fetching instructions and another
A pipelined processor may process each instruction in four steps, as follows:
for executing them, as shown in Figure Hardware organization of pipelining. The instruction fetched by the
F Fetch: read the instruction from the memory.
fetch unit is deposited in an intermediate storage buffer, B1. This buffer is needed to enable the execution unit
D Decode: decode the instruction and fetch the source operand(s).
to execute the instruction while the fetch unit is fetching the next instruction. The results of execution are
deposited in the destination location specified by the instruction. The data can be operated by the instructions E Execute: perform the operation specified by the instruction.
are inside the block labeled "Execution unit". W Write: store the result in the destination location.
Dept. Of AI&ML, AIT-CKM Page 9 Dept. Of AI&ML, AIT-CKM Page 10
Digital Design and Computer Organization BCS302 Digital Design and Computer Organization BCS302
Buffer B3 holds the results produced by the execution unit and the destination information for
instruction 11.
5.3.2 Role of Cache memory
Each stage in a pipeline is expected to complete its operation in one clock cycle. Hence, the clock period should
be sufficiently long to complete the task being performed in any stage. Pipelining is most effective in improving
performance if the tasks being performed in different stages require about the same amount of time.
The use of cache memories solves the memory access problem. In particular, when a cache is included
on the same chip as the processor, access time to the cache is usually the same as the time needed to perform
other basic operations inside the processor. This makes it possible to divide instruction fetching and processing
into steps that are more or less equal in duration. Each of these steps is performed by a different pipeline stage,
Fig: Instruction execution divided into four steps for four segments and the clock period is chosen to correspond to the longest one.
5.3.3 Pipeline Performance
The pipelined processor completes the processing of one instruction in each clock cycle, which means that the
rate of instruction processing is four times that of sequential operation. The potential increase in performance
resulting from pipelining is proportional to the number of pipeline stages. However, this increase would be
achieved only if pipelined operation as depicted could be sustained without interruption throughout program
execution. Unfortunately, this is not the case.
Fig: Hardware organization of a 4-stage pipeline for two segments
The sequence of events for this case is shown in Figure Instruction execution divided into four steps. Four
instructions are in progress at any given time. This means that four distinct hardware units are needed, as shown
in Figure Hardware organization of a 4-stage pipeline. These units must be capable of performing their tasks
simultaneously and without interfering with one another. Information is passed from one unit to the next
through a storage buffer. Fig: Effect of an execution operation taking more than one clock cycle.
For example, during clock cycle 4, the information in the buffers is as follows: Figure Effect of an execution operation taking more than one clock cycle. Shows an example in which the
Buffer B1 holds instruction I3, which was fetched in cycle 3 and is being decoded by the instruction- operation specified in instruction I2 requires three cycles to complete, from cycle 4 through cycle 6. Thus, in
decoding unit. cycles 5 and 6, the Write stage must be told to do nothing, because it has no data to work with. Meanwhile, the
Buffer B2 holds both the source operands for instruction I2 and the specification of the operation to be information in buffer B2 must remain intact until the Execute stage has completed its operation. This means that
performed. stage 2 and, in turn, stage1 are blocked from accepting new instructions because the information in B1 cannot
be overwritten.
Dept. Of AI&ML, AIT-CKM Page 11 Dept. Of AI&ML, AIT-CKM Page 12
Digital Design and Computer Organization BCS302
Thus, steps D4 and F5 must be postponed as shown in figure Effect of an execution operation taking
more than one clock cycle. Pipelined operation in Figure Effect of an execution operation taking more than one
clock cycle.is said to have been stalled for two clock cycles. Normal pipelined operation resumes in cycle 7.
Any condition that causes the pipeline to stall is called a hazard.
Data Hazards: A data hazard is any condition in which either the source or the destination operands of
an instruction are not available at the time expected in the pipeline. As a result some operation has to be
delayed, and the pipeline stalls.
Control/Instruction Hazards: The pipeline may also be stalled because of a delay in the availability of
an instruction. For example, this may be a result of a miss in the cache, requiring the instruction to be
fetched from the main memory. Such hazards are often called control hazards or instruction hazards
Fig: Instruction execution in Success Clock Cycle
Fig: Function performed by each processor stage in Success Clock Cycle
The effect of a cache miss on pipelined operation is illustrated in above Figure. Instruction I1 is fetched from
the cache in cycle 1, and its execution proceeds normally. However, the fetch operation for instruction I2, which
is started in cycle 2, results in a cache miss. The instruction fetch unit must now suspend any further fetch
requests and wait for I2 to arrive. We assume that instruction I2 is received and loaded into buffer Bl at the end
of cycle 5. The pipeline resumes its normal operation at that point.
Dept. Of AI&ML, AIT-CKM Page 13