Advance Computer
Architecture
Humaira Ashraf
Computer Architecture
Major components of a computer
Central Processing Unit (CPU)
memory
peripheral devices
What is a computer?
Simply put, a computer is a sophisticated electronic calculating machine that:
Accepts input information,
Processes the information according to a list of internally stored instructions and
Produces the resulting output information.
Functions performed by a computer are:
Accepting information to be processed as input.
Storing a list of instructions to process the information.
Processing the information according to the list of instructions.
Providing the results of the processing as output.
What are the functional units of a computer?
3
Functional units of a computer
Arithmetic and logic unit(ALU):
Performs the desired
operations on the input
information as determined
by instructions in the memory
Input unit accepts
information:
Human operators,
Electromechanical devices
Other computers
Memory
Input
Instr1
Instr2
Instr3
Data1
Data2
Output
I/O
Output unit sends
results of processing:
To a monitor display,
To a printer
Stores
information:
Instructions,
Data
Arithmetic
& Logic
Control
Processor
Control unit coordinates
various actions
Input,
Output
Processing
4
Information in a computer -- Instructions
Instructions specify commands to:
Transfer information within a computer (e.g., from memory to ALU)
Transfer of information between the computer and I/O devices (e.g., from
keyboard to computer, or computer to printer)
Perform arithmetic and logic operations (e.g., Add two numbers, Perform
a logical AND).
A sequence of instructions to perform a task is called a program, which is
stored in the memory.
Processor fetches instructions that make up a program from the memory and
performs the operations stated in those instructions.
What do the instructions operate upon?
5
Information in a computer -- Data
Data are the operands upon which instructions operate.
Data could be:
Numbers,
Encoded characters.
Data, in a broad sense means any digital information.
Computers use data that is encoded as a string of binary digits called bits.
Input unit
Binary information must be presented to a computer in a specific format. This
task is performed by the input unit:
- Interfaces with input devices.
- Accepts binary information from the input devices.
- Presents this binary information in a format expected by the computer.
- Transfers this information to the memory or processor.
Real world
Computer
Memory
Keyboard
Audio input
Input Unit
Processor
7
Memory unit
Memory unit stores instructions and data.
Recall, data is represented as a series of bits.
To store data, memory unit thus stores bits.
Processor reads instructions and reads/writes data from/to the
memory during the execution of a program.
In theory, instructions and data could be fetched one bit at a time.
In practice, a group of bits is fetched at a time.
Group of bits stored or retrieved at a time is termed as word
Number of bits in a word is termed as the word length of a
computer.
In order to read/write to and from memory, a processor should know
where to look:
Address is associated with each word location.
8
Memory unit (contd..)
Processor reads/writes to/from memory based on the memory address:
Access any word location in a short and fixed amount of time based
on the address.
Random Access Memory (RAM) provides fixed access time
independent of the location of the word.
Access time is known as Memory Access Time.
Memory and processor have to communicate with each other in order
to read/write information.
In order to reduce communication time, a small amount of RAM
(known as Cache) is tightly coupled with the processor.
Modern computers have three to four levels of RAM units with different
speeds and sizes:
Fastest, smallest known as Cache
Slowest, largest known as Main memory.
9
Memory unit (contd..)
Primary storage of the computer consists of RAM units.
Fastest, smallest unit is Cache.
Slowest, largest unit is Main Memory.
Primary storage is insufficient to store large amounts of data and programs.
Primary storage can be added, but it is expensive.
Store large amounts of data on secondary storage devices:
Magnetic disks and tapes,
Optical disks (CD-ROMS).
Access to the data stored in secondary storage in slower, but take advantage of the
fact that some information may be accessed infrequently.
Cost of a memory unit depends on its access time, lesser access time implies higher cost.
10
Computer Architecture (continued)
organized as
bit
byte = 8 bits (smallest addressable location)
word = 4 bytes (typically; machine dependent)
instructions consist of operation codes and addresses
oprn
addr 1
oprn
addr 1
oprn
addr 1
addr 2
addr 2
addr 3
Computer Architecture (continued)
Numeric data representations
magnitude
integer (exact representation)
sign-magnitude
floating point (approximate representation)
scientific notation: 0.3481 x 106
inherently imprecise
IEEE Standard 754-1985
exp
significand
Arithmetic and logic unit (ALU)
Operations are executed in the Arithmetic and Logic Unit (ALU).
Arithmetic operations such as addition, subtraction.
Logic operations such as comparison of numbers.
In order to execute an instruction, operands need to be brought into the ALU
from the memory.
Operands are stored in general purpose registers available in the ALU.
Access times of general purpose registers are faster than the cache.
Results of the operations are stored back in the memory or retained in the
processor for immediate use.
13
Output unit
Computers represent information in a specific binary form. Output units:
- Interface with output devices.
- Accept processed results provided by the computer in specific binary form.
- Convert the information in binary form to a form understood by an
output device.
Computer
Memory
Output Unit
Real world
Printer
Graphics display
Speakers
Processor
14
Control unit
Operation of a computer can be summarized as:
Accepts information from the input units (Input unit).
Stores the information (Memory).
Processes the information (ALU).
Provides processed results through the output units (Output unit).
Operations of Input unit, Memory, ALU and Output unit are coordinated by
Control unit.
Instructions control what operations take place (e.g. data transfer,
processing).
Control unit generates timing signals which determines when a particular
operation takes place.
15
How are the functional units connected?
For a computer to achieve its operation, the functional units need to
communicate with each other.
In order to communicate, they need to be connected.
Input
Output
Memory
Processor
Bus
Functional units may be connected by a group of parallel wires.
The group of parallel wires is called a bus.
Each wire in a bus can transfer one bit of information.
The number of parallel wires in a bus is equal to the word length of
a computer
16
Organization of cache and main memory
Main
memory
Cache
memory Processor
Bus
Why is the access time of the cache memory lesser than the
access time of the main memory?
17
Simple Machine Organization (continued)
ALU does arithmetic and logical comparisons
AC = accumulator holds results
MQ = memory-quotient holds second portion of long results
MBR = memory buffer register holds data while operation executes
Simple Machine Organization (continued)
Program control determines what computer does based
on instruction read from memory
MAR = memory address register holds address of memory cell to be read
PC = program counter; address of next instruction to be read
IR = instruction register holds instruction being executed
IBR holds right half of instruction read from memory
Simple Machine Organization
(continued)
Machine operates on fetch-execute cycle
Fetch
PC
MAR
read M(MAR) into MBR
copy left and right instructions into IR and IBR
Execute
address part of IR
read M(MAR) into MBR
execute opcode
MAR
Architecture Families
Before mid-60s, every new machine had a different
instruction set architecture
IBM System/360 created family concept
single instruction set architecture
wide range of price and performance with same software
Performance improvements based on different
detailed implementations
programs from previous generation didnt run on new machine
cost of replacing software became too large
memory path width (1 byte to 8 bytes)
faster, more complex CPU design
greater I/O throughput and overlap
Software compatibility now a major issue
partially offset by high level language (HLL) software
Multiple Register Machines
Initially, machines had only a few registers
2 to 8 or 16 common
registers more expensive than memory
Most instructions operated between memory locations
results had to start from and end up in memory, so fewer instructions
means smaller programs and (supposedly) faster execution
although more complex
fewer instructions and data to move between memory and ALU
But registers are much faster than memory
30 times faster
Multiple Register Machines (continued)
Also, many operands are reused within a short time
waste time loading operand again the next time its needed
Depending on mix of instructions and operand use, having many registers may
lead to less traffic to memory and faster execution
Most modern machines use a multiple register architecture
maximum number about 512, common number 32 integer, 32 floating point
Register Organization
User-visible
Control
registers
and status registers
User Visible Registers
General
Purpose
Data
Address
Condition
Codes
General Purpose Registers
Data
Accumulator
Addressing
Segment
pointers
Index registers
Stack Pointer
Control & Status Registers
Program
Counter
Instruction Decoding Register
Memory Address Register
Memory Buffer Register
Example Register Org.
Instruction Cycle
It
is the time in which a single
instruction is fetched from memory,
decoded, and executed
An
Instruction Cycle requires the
following subcycle:
Instruction Cycle
Fetch
Read next instruction from memory into the
processor
Indirect
Cycle (Decode Cycle)
May require memory access to fetch operands,
therefore more memory accesses.
Interrupt
Save current instruction and service the interrupt
Execute
Interpret the opcode and perform the indicated
operation
Instruction Cycle
Fetch
Interrupt
Indirect
Indirect
Execute
Data Flow (Fetch Diagram)
PC
MAR
Memory
Control
Unit
IR
MBR
MBR
Data Flow (Indirect Diagram)
MAR
Memory
Memory
Control
Unit
MBR
MBR
Data Flow (Execute)
May
take many forms
Depends on instruction being executed
May include
Memory read/write
Input/Output
Register transfers
ALU operations
Data Flow (Interrupt Diagram)
PC
PC
MAR
Memory
Control
Control
UnitUnit
MBR
Instruction Level Parallel Processor
Pipeline lead to evaluation of ILP and as the degree of parallel evaluation
increases the use of multiple pipeline execution Units.
Thus the evaluation of super scaler ILP begin to appear
They were first implemented as VLIW architectures
Instruction Pipelining
Instruction
processing is subdivided:
- Fetch/ Execute instruction
Pipeline
has two independent stages:
1st Stage Fetch an instruction and buffers it.
2nd Stage Temporarily free until first stage passes it
the buffered instruction.
While the second stage is executing the instruction, the
first stage fetches and buffers the next instruction.
Instruction
prefetch or fetch overlap.
- Purpose?
To speed up instruction execution.
Two-Stage Instruction Pipeline
Instruction Processing
Fetch
instruction (FI)
Decode
instruction (DI)
Calculate
Fetch
operands (FO)
Execute
Write
operands (CO)
instruction (EI)
operand (WO)
Successive
instructions in a program sequence
will overlap in execution.
Timing Diagram for Instruction Pipeline
Operation
Six-Stage CPU
Instruction
Pipeline
The logic needed
for pipelining to
account for
branches,
interrupts, and
arising problems.
Alternative Pipeline Depiction
RISC
Pipeline
1. Instruction
fetch
2. Instruction decode and register fetch
3. Execute
4. Memory Access
5. Register write back
Improve CPU performance by
increasing clock rates
(CPU running at 2GHz!)
increasing the number of instructions to be executed in parallel
(executing 6 instructions at the same time)
How do we increase the number of
instructions to be executed?
Time and Space parallelism
Pipeline (assembly line)
Result of pipeline (e.g.)
VLIW
(very long instruction word,1024 bits!)
Superscalar
(sequential stream of instructions)
From Sequential instructions
to parallel execution
Dependencies between instructions
Instruction scheduling
Preserving sequential consistency
4.2 Dependencies between instructions
Instructions often depend on each other in such a way that a particular instruction
cannot be executed until a preceding instruction or even two or three preceding
instructions have been executed.
1 Data dependencies
2 Control dependencies
3 Resource dependencies
4.2.1 Data dependencies
Read after Write
Write after Read
Write after Write
Recurrences
Data dependencies in straight-line code
(RAW)
RAW dependencies
i1: load r1, a
r2: add r2, r1, r1
flow dependencies
Data dependencies in straight-line code
(WAR)
WAR dependencies
i1: mul r1, r2, r3
r2: add r2, r4, r5
anti-dependencies
false dependencies
can be eliminated through register renaming
i1: mul r1, r2, r3
r2: add r6, r4, r5
by using compiler or ILP-processor
Data dependencies in straight-line code
(WAW)
WAW dependencies
i1: mul r1, r2, r3
r2: add r1, r4, r5
output dependencies
false dependencies
can be eliminated through register renaming
i1: mul r1, r2, r3
r2: add r6, r4, r5
by using compiler or ILP-processor
Data dependencies in loops
for (int i=2; i<10; i++) {
x[i] = a*x[i-1] + b
cannot be executed in parallel
Data dependency graphs
i1: load r1, a
i2: load r2, b
i3: load r3, r1, r2
i4: mul r1, r2, r4;
i5: div r1, r2, r4
Control Dependency Graph
4.2.3 Resource dependencies
An instruction is resource-dependent on a previously issued instruction if it
requires a hardware resource which is still being used by a previously issued
instruction.
e.g.
div r1, r2, r3
div r4, r2, r5
4.3 Instruction scheduling
scheduling or arranging two or more instruction to be executed in parallel
Need to detect code dependency (detection)
Need to remove false dependency (resolution)
Two basic approaches
Static: done by compiler
Dynamic: done by processor
Instruction Scheduling:
ILP-instruction scheduling
4.4 Preserving sequential consistency
care must be taken to maintain the logical integrity of the program execution
parallel execution mimics sequential execution as far as the logical integrity
of program execution is concerned
e.g.
add r5, r6, r7
div r1, r2, r3
jz somewhere
Branches
Branch(orbranching,branched)
may
also refer to the act of switching
execution to a different instruction
sequence as a result of executing a
branch instruction.
Branches
Two
Types of Branch Instructions
Conditional
A programming instruction that directs the
computer to another part of the program
based on the results of a compare. Highlevel language statements, such as IF
THEN ELSE and CASE, are used to express
the compare andconditional branch.
Branches
Unconditional
Unconditionalbranch instructionssuch
as GOTO are used to unconditionally
"jump" to (begin execution of) a different
instruction sequence. Machine
levelbranch instructionsare
sometimes called jump instructions.
Conditional Branch Instructions
Condition
BRP
Codes
Branch
BRZ
Branch
BRE
to location X if result is positive
to location X if result is zero
R1,R2,X
Branch
to location X if contents of R1 = R2
Conditional Branch Instructions
Dealing with Branches
A
major problem in designing an instruction
pipeline is assuring a steady flow of
instructions to the initial stages of the
pipeline.
Since
conditional branches alter the steady
flow of instructions, we must come up with
ways to execute them efficiently.
Major types of branches
Branch: To transfer control
Branch examples
8.1.2 How to check the results of operations for specified
conditions {branch} (e.g. equals 0, negative, and so on)
ISA = Instruction Set Architecture
Alternatives for checking the operation results
Result state approach: Disadvantage
The generation of the result state is not straightforward
It requires an irregular structure and occupies additional chip area
The result state is a sequential concept.
It cannot be applied without modification in architectures which have multiple
execution units.
Retaining sequential consistency for
condition checking (in VLIW or Superscalar)
Use multiple sets of condition codes or flags
It relies on programmer or compiler to use
different sets condition codes or flags for
different outcome generated by different EUs.
Use Direct Check approach.
Branch Statistics
20% of general-purpose code are branch
on average, each fifth instruction is a branch
5-10% of scientific code are branch
The Majority of branches are conditional (80%)
75-80% of all branches are taken
Branch statistics: Taken or not Taken
8.1.4 The branch problem:
The delay caused in pipelining
More branch problems
Conditional branch could cause an even longer penalty
evaluation
of the specified condition needs an extra
cycle
waiting
for unresolved condition (the result is not yet
ready)
e.g.
wait for the result may take 10-50 cycles
Pipelines with more stages than 4
each
branch would result in a yet larger number of
wasted cycles (called bubbles)
Interpretation of the concept of branch
penalty
Zero-cycle branching {in no time}
8.2 Basic approaches to branch handling
Speculative vs. Multiway branching
-Delayed Branching: Occurrence of an unused
instruction slot (unconditional branch)
Basic scheme of delayed branching
Early branch detection: {for scalar Processor}
Integrated instruction fetch and branch detection
Detect branch instruction during fetching
Guess taken or not taken
Fetch next sequential instruction or target instruction
Handling of unresolved conditional
branches
Blocking branch processing
Simply
stalled (stopped and waited) until the
specified condition can be resolved
-Basic kinds of branch predictions
-The fixed prediction approach
Always not taken vs. Always Taken
Always not taken: Penalty figures
Penalty figures for
the always taken prediction approach
-Static branch prediction
Instruction Scheduling
instructionschedulingis a compiler optimization used to improve
instruction-level parallelism, which improves performance on machines with
instruction pipelines. Put more simply, without changing the meaning of
thecode, it tries to. Avoid pipeline stalls by rearranging the order of
instructions
Compiler Optimization
Incomputing, anoptimizing compileris acompilerthat tries to minimize or
maximize some attributes of anexecutablecomputer program. The most
common requirement is to minimize the time taken to execute aprogram; a
less common one is to minimize the amount ofmemory occupied. The growth
ofportable computershas created a market for minimizing
thepowerconsumed by a program. Compiler optimization is generally
implemented using a sequence ofoptimizing transformations, algorithms
which take a program and transform it to produce a semantically equivalent
output program that uses fewer resources.
Motivation
Single cycling implementation
Single cycling implementation
Single cycling implementation
Pipeline Implementation
Pipeline Implementation
Pipeline Implementation
Pipeline Hazards
Pipeline Hazards
Pipeline Hazards
Pipeline Hazards
Pipeline Hazards
Pipeline Hazards
Pipeline Hazards
Instruction Order
Instruction Order
Instruction Order
Instruction Dependencies
Instruction Dependencies
Instruction Dependencies
Instruction Dependencies
Instruction Dependencies
Instruction Scheduling
Preserving Dependencies
Preserving Dependencies
Preserving Dependencies
Preserving Dependencies
pipeline stall
abubbleorpipeline stallis a delay in execution of aninstructionin
aninstruction pipeline.
During the decoding stage, the control unit will determine if the decoded
instruction reads from a register that the instruction currently in the
execution stage writes to. If this condition holds, the control unit will stall
the instruction by one clock cycle. It also stalls the instruction in the fetch
stage, to prevent the instruction in that stage from being overwritten by the
next instruction in the program.
Dynamic Scheduling