BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI
WORK INTEGRATED LEARNING PROGRAMMES
STUDENT NOTES
Course Title COMPUTER ORGANIZATION AND ARCHITECTURE
Course No CSIW ZC353
Date August 2022
Course Objectives
No Objective
CO1 To understand the structure, function, organization and architecture of modern-day
computing systems.
CO2 To learn the major components of a computer and their interconnections, both with
each other and the outside world with detailed discussion of internal and external
memory and of input–output (I/O) devices.
CO3 To examines the internal architecture and organization of the processor with an
extended discussion of computer arithmetic and the instruction set architecture.
CO4 To give an introduction of parallel organization, including symmetric multiprocessing,
clusters, and multicore architecture.
CO5 To learn the Hardware Description Language and simulator to design and verify the
basic components of a computing system.
Text Book(s)
T1 Stallings William, Computer Organization & Architecture, Pearson Education, 8th Ed.,
2010.
Reference Book(s)
R1 C Hamacher, Z Vranesic and S Zaky, Computer Organization by McGrawHill, 5th Ed.
2002
R2 Hennenssy& D.A. Patterson, Computer Organization & Design, Morgan Kaufmann 4th
Ed., 2009
R3 The Essentials of Computer Organization and Architecture, Linda Null and Julia
Lobur, Jones and Bartlett publisher, 2003 [24x7 Online Book]
Module 1
Computer System Components and Interconnections
Computer Architecture refers to those attributes of a system visible to a
programmer or, put another way, those attributes that have a direct impact on the
logical execution of a program.
Computer Organization refers to the operational units and their interconnections
that realize the architectural specifications.
Examples of Architectural attributes include the instruction set, the number of bits
used to represent various data types (e.g., numbers, characters), I/O mechanisms, and
techniques for addressing memory.
Organizational attributes include those hardware details transparent to the
programmer, such as control signals; interfaces between the computer and
peripherals; and the memory technology used.
A computer is a complex system; contemporary computers contain millions of
elementary electronic components. The hierarchical nature of complex systems is
essential to both their design and their description.
The designer need only deal with a particular level of the system at a time. At each
level, the system consists of a set of components and their interrelationships.
The behavior at each level depends only on a simplified, abstracted characterization
of the system at the next lower level. At each level, the designer is concerned with
structure and function:
• Structure: The way in which the components are interrelated
• Function: The operation of each individual component as part of the structure
Function:
In general terms, there are four functions of a computer system
Data processing i.e., the computer must be able to process data.
Data storage the computer must be able to store data.
Data movement the computer must be able to move data between itself and
the outside world.
Control the computer must be able to control all these three functions.
Structure:
The main structural components are:
Central processing unit (CPU): Controls the operation of the computer and
performs its data processing functions; often simply referred to as processor.
Main memory: Stores data.
Input/Output: Moves data between the computer and its external
environment.
System interconnection: Some mechanism that provides for communication
among CPU, main memory, and I/O. A common example of system
interconnection is by means of a system bus, consisting of a number of
conducting wires to which all the other components attach.
Control unit: Controls the operation of the CPU and hence the computer
Arithmetic and logic unit (ALU): Performs the computer’s data processing
functions
Registers: Provides storage internal to the CPU
CPU interconnection: Some mechanism that provides for communication
among the control unit, ALU, and registers.
A Brief History of Computers
ENIAC was the first general purpose electronic digital computer
oBy John Mauchly and John Eckert in 1946
Von Neumann Machine (1946) called as IAS
o Based on stored program concept
PDP-1 Computer (1957)
o Developed by Digital Equipment Corporation (DEC)
o First step towards mini computers
IBM SYSTEM/360 (1964)
o First planned family of computers
Vacuum tube (1946-1957) & Transistor (1958-1964)
Integrated Circuits
o Small scale integration - 1965 on
Up to 100 devices on a chip
Medium scale integration - to 1971
100 - 3,000 devices on a chip
Large scale integration - 1971-1977
3,000 - 100,000 devices on a chip
Very large-scale integration - 1978 -1991
100,000 - 100,000,000 devices on a chip
Ultra-large-scale integration – 1991 onwards
Over 100,000,000 devices on a chip
Computer Evolution and Performance
Concept of program
COMPUTER FUNCTION: The basic function performed by a computer is execution of a
program, which consists of a set of instructions stored in memory. The processor does the
actual work by executing instructions specified in the program such as
• Fetch instructions
• Interpret instructions
• Fetch data
• Process data
• Write/store data
In its simplest form, instruction processing consists of two steps:
The processor reads (fetches) instructions from memory one at a time and executes each
instruction. Program execution consists of repeating the process of instruction fetch and
instruction execution. The instruction execution may involve several operations and depends
on the nature of the instruction
PC = Program counter
IR = Instruction register
MAR = Memory address register
MBR = Memory buffer register
I/O AR = Input/output address register
I/O BR = Input/output buffer register
The Evolution of the Intel x86 Architecture
8086
o Much more powerful (16-bit data)
o Instruction cache, pre-fetch few instructions
o 8088 (8-bit external bus) used in first IBM PC
80286
o 16 MByte memory addressable
o Up from 1MB (in 8086)
80386
o 32-bit processor with multitasking support
80486
Sophisticated powerful cache and instruction pipelining
o
Built in maths co-processor
o
Pentium
o Superscalar
o Multiple instructions executed in parallel
Pentium Pro
o Increased superscalar organization
o Aggressive register renaming
o Branch prediction and Data flow analysis
Pentium II
o MMX technology, graphics, video & audio processing
Pentium III
o Additional floating-point instructions for 3D graphics
Pentium 4
o Further floating point and multimedia enhancements
Itanium Series
o 64 bit with Hardware enhancements to increase speed
Fetch and execute cycles
The processing required for a single instruction is called an instruction cycle
divided into two steps which are referred to as the fetch cycle and the execute cycle.
Fetch Cycle:
• Program Counter (PC) holds address of next instruction to fetch
• Processor fetches instruction from memory location pointed by PC
• Increment PC
– Unless told otherwise
• Instruction loaded into Instruction Register (IR)
• Processor interprets instruction and performs required actions
Execute Cycle:
Data transfer between CPU and Main Memory
• Data transfer between CPU and I/O module
• Some arithmetic or logical operation on data
• Alteration of sequence of operations
– e.g. jump
Interrupts
• Mechanism by which other modules (e.g. I/O) may interrupt normal sequence of
processing Program
• Timer:
– Generated by internal processor timer
– Used in pre-emptive multi-tasking
I/O
– From I/O controller
• Hardware failure
– e.g. Memory parity error
Performance Assessment
Computer Performance
Performance is one of the key parameter to-
Evaluate processor hardware
Measure requirement for new systems
When we say one computer has better performance than another, what does it
mean? Criterion for the performance?
Response Time (Single computer)
Throughput (Data center)
Clock Rate
Operation performed by processor are governed by system clock (fundamental
level of processor speed measurement)
Generated by quartz crystal
A clock cycle is the basic unit of time to execute one operation.
Clock Rate (clock cycles per second in MHz or GHz) is inverse of clock cycle
time (clock period)
Performance
How a processor performs when executing a given application
Application performance depends upon
Speed of the processor
Instruction set
Choice of language
Efficiency of the compiler
Programming skill of the user
CPU Performance
To maximize performance, need to minimize execution time
performance = 1 / execution_time
Cycles Per Instruction (CPI)
• Program execution time (T) = Ic x CPI x t
– t=cycle time, Ic = no. of instructions in the program
• Instruction execution also requires memory access
– T = Ic x [p +(m x k] x t
– p=processor cycles, m = memory references
– k = ratio of memory cycles to processor cycles
• These performance factors are influenced by
– ISA, compiler, processor implementation and memory hierarchy
MIPS Rate:
• Common measure of performance
– Millions Instructions Per Second (MIPS) rate
– Ic/(T x 106)
– This can be written as: f/(CPI x 106)
Benchmark Programs
• It is a collection of a programs that provides representative test of a computer
in a particular application area
– e.g. SPEC (System Performance Evaluation Corporation) benchmark
suites
– SPEC CPU 2006 is used for measuring performance for the
computational based applications
Top-Level View of Computer Function and Interconnection
Computer Components and connections:
• All the units must be connected
• Different type of connection for different type of unit
– Memory Connections
• Receives and sends data
• Receives addresses (of locations)
• Receives control signals
– CPU Connections
• Reads instruction and data
• Writes out data (after processing)
• Sends control signals to other units
• Receives (& acts on) interrupts
Types of Buses:
A Bus is a communication pathway connecting two or more devices .
Data Bus carries data . Width is a key determinant of performance
8, 16, 32, 64 bit
Address Bus identifies the source or destination of data
e.g. CPU needs to read an instruction (data) from a given location in memory.
•
• Control and timing information
– Memory read/write signal
– Interrupt request
– Clock signals
• Synchronous
– Events determined by clock signals and synchronized on leading edge
of clock
– All devices can read the clock line
– Usually a single cycle for an event
• Asynchronous
– The occurrence of one event on a bus follows and depends on the
occurrence of a previous event
– Events on the bus are not synchronized with clock
Bus Interconnection
• More than one module controlling the bus
– e.g. CPU and DMA controller
• Only one module may control bus at a time
• Arbitration may be-
– Centralized
• Single hardware device controlling bus access
– Distributed
• Each module may claim the bus
• Control logic on all modules
PCI Bus Example:
MODULE - 2
Central Processing Unit -Computer Arithmetic
The Arithmetic and Logic Unit (ALU):
• Does the calculations
• Everything else in the computer is there to service this unit
• Handles integers
• May handle floating point (real) numbers
• May be separate FPU (math’s co-processor)
Integer Representation:
• Only have 0 & 1 to represent everything
• Positive numbers stored in binary
– e.g. 41=00101001
• No minus sign
Sign-Magnitude Representation
• Left most bit is sign bit
• 0 means positive
• 1 means negative
Two’s Complement Representation
– Most significant bit is treated as a sign bit
– 0 for positive and 1 for negative
– Positive numbers are represented same as sign magnitude representation
– Negative numbers are represented in 2’s complement form
Integer Arithmetic:
Addition and Subtraction:
• Normal binary addition
• Take twos complement of subtrahend and add to minuend
– i.e. a - b = a + (-b)
– Hardware Implementation is simple
– Addition and complement circuits required
Multiplication:
• Works out on partial product for each digit
• Takes care with place value (column)
• Adds partial products
Booth’s Algorithm for multiplication:
èMultiplicand M unchanged
èBased upon recoding the multiplier Q
to a recoded value R
èEach digit can assume a negative as well as
positive and zero values
èKnown as Signed Digit ( SD) encoding
Division Algorithm
1. Load divisor in M register and dividend into the A,Q registers
2. Shift A, Q left 1 bit position
3. If M and A have the same signs, do
A=A-M otherwise A=A+M
4. a) If preceding operation is successful or MSB of A=0,
then set Q0ß1
b) If operation is unsuccessful and MSB of A<>0, then set Q 0ß0 and restore the
previous value of A
5. Repeat steps 2 through 4 as there are bit positions in Q
6. The remainder is in A. The quotient is in Q.
Floating Point Representation:
• We need a way to represent
– numbers with fractions, e.g., 3.1416
– very small numbers (in absolute value), e.g., .00000000023
– very large numbers (in absolute value) , e.g., –3.15576 * 1046
• Representation:
– scientific: sign, exponent, significand form:
• (–1)sign * significand * 2exponent . E.g., –101.001101 * 2111001
– more bits for significand gives more accuracy
– more bits for exponent increases range
– if 1 £ significand < 10two(=2ten) then number is normalized, except
for number 0 which is normalized to significand 0
E.g., –101.001101 * 2111001 = –1.01001101 * 2111011 (normalized)
Floating-Point Arithmetic:
Floating point addition:
• Make both exponents the same
– Find the number with the smaller one
– Shift its mantissa to the right until the exponents match
• Must include the implicit Mantissa
• Add the mantissas
• Choose the largest exponent
• Put the result in normalized form
– Shift mantissa left or right until in Mantissa form
– Adjust exponent accordingly
• Handle overflow or underflow if necessary
• Round
• Renormalize if necessary if rounding produced an un normalized result
Floating point multiplication:
• Add the exponents and subtract the bias from the sum
– Example: (5+127) + (2+127) – 127 = 7+127
• Multiply the mantissas
• Put the result in normalized form
– Shift mantissa left or right until in form 1.M
– Adjust exponent accordingly
• Handle overflow or underflow if necessary
• Round
• Renormalize if necessary if rounding produced an unnormalized result
• Set S=0 if signs of both operands the same, S=1 otherwise
Floating Point Complexities:
• In addition to overflow we can have underflow (number too small)
• Accuracy is the problem with both overflow and underflow because we have
only a finite number of bits to represent numbers that may actually require
arbitrarily many bits
– limited precision Þ rounding Þ rounding error
– IEEE 754 keeps two extra bits, guard and round
– four rounding modes
– positive divided by zero yields infinity
– zero divide by zero yields not a number
– other complexities
• Implementing the standard can be tricky
MODULE - 3
Instruction Set Architecture
Machine Instruction Characteristics:
Microprocessor Registers:
• Visible registers
– Addressable during application programming
• Invisible registers
– Addressable in system programming
• 8086,8088,80286 contain 16 bit internal architecture
• 80386-Pentium 4 contain 32 bit internal architecture
Multipurpose Registers
• AX- Accumulator
– Addressable as AX, AH, or AL
– Mainly used for multiplication, division
• BX- Base Index
– Some times holds the offset address of a location
• CX – Count
– Holds count for instructions
• DX- Data
– Holds a part of the result
• BP- Base Pointer
– Points to a memory location for memory data transfers.
• DI- Destination Index
– Addresses string destination data for the string instructions.
• SI- Source Index
– Addresses string source data for the string instructions.
Special Purpose Registers
• IP- Instruction Pointer
– Addresses the next instruction
• SP- Stack Pointer
– Addresses area of memory called stack
• FLAGS
– For controlling the microprocessor operation
– Not modified for data transfer or program control operation
• IP- Instruction Pointer
– Addresses the next instruction
• SP- Stack Pointer
– Addresses area of memory called stack
• FLAGS
– For controlling the microprocessor operation
– Not modified for data transfer or program control operation
• ES- Extra Segment
– An additional data segment used by string instructions
• SS- Stack
– Memory used for the stack
Instruction Set:
• The complete collection of instructions that are understood by a CPU
– Machine Code or Binary
– Usually represented by assembly codes
– In machine code each instruction has a unique bit pattern
– For human consumption a symbolic representation is used
• e.g. ADD, SUB, LOAD
– Operands can be represented in this way
• ADD A, B
General Instruction Format
– LABEL OPCODE OPERAND(s) ; COMMENT
– Example
• Next: MOV AL,BL ; Transfers BL contents to AL
• MOV AL,12h ; Transfers 12 to AL
• MOV AL, [1234h]; Transfers one byte data from memory location
given by [DS+1234] to AL
• Label and comments are optional!!!
– OPERAND(s) may be
• Part of instruction
• Reside in registers of the processor
• Reside in memory location
Addressing Modes and Formats
An instruction must contains the information about:
How to get the operands Called as ADDRESSING MODES
Essentially it tells where the operands are available and how to get them
The addressing modes available are
Immediate
Register operand
Displacement
Base
Base with displacement
Scaled index with displacement
Base with index and displacement
Base scaled index with displacement
Relative
Immediate addressing mode
The addressing mode in which the data operand is a part of the instruction itself is known as
immediate addressing mode.
Example
MOV CX, 4929 H, ADD AX, 2387 H, MOV AL, FFH
Register addressing mode
It means that the register is the source of an operand for an instruction.
Example
MOV CX, AX ; copies the contents of the 16-bit AX register into
; the 16-bit CX register),
ADD BX, AX
Direct addressing mode
The addressing mode in which the effective address of the memory location is written
directly in the instruction.
Example
MOV AX, [1592H], MOV AL, [0300H]
Register indirect addressing mode
This addressing mode allows data to be addressed at any memory location through an offset
address held in any of the following registers: BP, BX, DI & SI.
Example
MOV AX, [BX] ; Suppose the register BX contains 4895H, then the contents
; 4895H are moved to AX
ADD CX, {BX}
Based addressing mode
In this addressing mode, the offset address of the operand is given by the sum of contents of
the BX/BP registers and 8-bit/16-bit displacement.
Example
MOV DX, [BX+04], ADD CL, [BX+08]
Indexed addressing mode
In this addressing mode, the operands offset address is found by adding the contents of SI or
DI register and 8-bit/16-bit displacements.
Example
MOV BX, [SI+16], ADD AL, [DI+16]
Based-index addressing mode
In this addressing mode, the offset address of the operand is computed by summing the base
register to the contents of an Index register.
Example
ADD CX, [AX+SI], MOV AX, [AX+DI]
Based indexed with displacement mode
In this addressing mode, the operands offset is computed by adding the base register contents.
An Index registers contents and 8 or 16-bit displacement.
Example
MOV AX, [BX+DI+08], ADD CX, [BX+SI+16]
Assembly to Machine Conversion in 8086:
• Byte one contains
– OPCODE (6 bits)
• Specifies the operation to be performed. Ex. ADD, SUB, MOV
– Register direction (D) bit
• Tells the register operand in REG field in byte two, is source or
destination operand
• D=1: Data flow to the REG field from R/M (destination)
• D=0: Data flow from the REG field to the R/M (source)
– Data size (W) bit
• Specifies whether the operation will be performed on 8 bit or 16 bit W
=0: 8 bits and W=1: 16 bits
• Register field (REG): 3 bits
– To identify the register for the first operand
• Mode field (MOD): 2 bits
• Register/Memory field (R/M): 3 bits
– 2 bit MOD field and 3 bit R/M field together specify the second operand
Assembly Language Programming
Types of Instructions
• Data movement instructions
• Arithmetic - add, subtract, increment, decrement, convert byte/word and compare.
• Logic - AND, OR, exclusive OR, shift/rotate and test.
• String manipulation - load, store, move, compare and scan for byte/word.
• Control transfer - conditional, unconditional, call subroutine and return from
subroutine.
• Input/Output instructions.
• Other - setting/clearing flag bits, stack operations, software interrupts, etc.
Module - 4
Cache Memory Organization
Memory Hierarchy
• Computer memory exhibits the widest range of
– Physical Type
• Semiconductor (RAM), Magnetic (Disk), Optical (CD)
– Physical Characteristics
• Volatile/Non Volatile, Erasable/Non Erasable
– Organization
• Physical arrangement of bits
– Performance
• Access time (Latency), Transfer time, Memory cycle time
• Location
– CPU, Internal and External
• Capacity
– Number of Words/Bytes
• Unit of transfer
– Internal
• Usually governed by data bus width
– External
• Usually a block which is much larger than a word
• Addressable unit
– Smallest location which can be uniquely addressed
– Word or byte internally and Cluster on disks
• Access Methods
– Sequential
• Start at the beginning and read through in order
• Access time for a record is location dependent
• e.g. Magnetic Tape
– Direct
• Individual blocks have unique address based on physical location
• Access is by jumping to vicinity plus sequential search
• Access time depends on location and previous location e.g. Hard Disk
• Access Methods
– Random
• Wired-in addressing mechanism
• Individual addresses identify locations exactly
– Associative
• Data is located based on a portion of its contents rather than its address
• Called as Content Addressable Memory (CAM)
Cache Memory
It is a Small amount of fast memory that sits between normal main memory and CPU
Mapping Function:
The correspondence between the main memory blocks (group of words) and in the
cache lines is specified by a mapping function
Direct Mapping
• Each block of main memory maps to only one cache line
– i.e. if a block is in cache, it must be in one specific place
• Mapping Function
– jth Block of the main memory maps to ith cache line
• i = j modulo M (M = number of cache lines)
Fully Associative Mapping
A main memory block can load into any line of cache memory
Memory address is interpreted as tag and word
Tag uniquely identifies block of main memory
Set Associative Mapping
• Cache is divided into a number of sets
• Each set contains a fixed number of lines
• A given block maps to any line in a given set
– e.g. 2 lines per set
– 2 way associative mapping
– A given block can be in one of 2 lines in the set
Cache Line Replacement:
When all lines have valid data and a new block is to be bring from main memory to
the cache
– Using replacement Algorithms
• Ex. LRU, LFU, FIFO, Random etc
– Least Recently used (LRU)
– Oldest referenced block is removed
– First in first out (FIFO)
– Replace the block that has been in cache since longest time
– Least frequently used (LFU)
– Replace block which has had fewest hits
– Random
– Replace any block randomly
Elements of Cache Design:
Cache Performance:
• More the processor references found in the cache better the performance
– If a reference found in the cache it is HIT otherwise MISS
– A penalty is associated with each MISS occurred
– More hits reduces average access time
Types of Misses
Compulsory— Initially cache is empty or no valid data in it. So the first access to a block is
always miss. Also called cold start misses or first reference misses.
Capacity—If the cache size is not enough such that it can not accommodate sufficient blocks
needed during execution of a program then frequent misses will occur. Capacity misses are
those misses that occur regardless of associativity or block size.
Conflict—If block-placement strategy is set associative or direct mapped, conflict misses (in
addition to compulsory & capacity misses) will occur because a block can be discarded and
later retrieved if too many blocks map to its set. Also called collision misses
• Hit Ratio(H)
– Ratio of number of references found in the higher level memory (M1) (i.e.
cache memory) to the total references
• Average Access time
– Ts =H*T1 + (1-H)*(T1+T2)
– T1=Access time of M1 (Cache)
– T2=Access time of M2 (Main Memory)
Cost
Cs = (C1S1+C2S2)/(S1+S2)
• Access Efficiency (T1/Ts)
– Measure of how close average access time is to M1 access time
– On chip cache access time is about 25 to 50 times faster than main memory
– Off chip cache access time is about 5 to 15 times faster than main memory
access time
Write Through Policy
• All writes go to main memory as well as cache memory
• Multiple CPUs have to monitor main memory traffic to keep local (to CPU) cache up
to date
• Implications
– Lots of traffic and Slows down writes
Write Back Policy
• Updates initially made in cache only
• Update bit for cache slot is set when update occurs
• If block is to be replaced, write to main memory only if update bit or dirty bit is set
– Implications
• Other caches get out of sync
• I/O must access main memory through cache
Module-5
Internal Memory
• Data transfer between the processor and the memory takes place through the two
registers
– MAR and MBR or MDR
• Memory Speed measurement
– Memory Access Time
– Memory Cycle Time
• Minimum time delay required between the initiation of two successive
memory operations
– Memory Cycle time is usually slightly longer than the access time!
– Memory Cycle time for Semiconductor memories ranges 10 to 100 ns
Static RAM:
• Memories that consists of circuits capable of retaining their state as long as power is
applied
• Bits stored as on/off switches
• Complex construction so larger per bit and more expensive
• Transistor arrangement gives stable logic state
• State 1
– C1 high, C2 low, T1 T4 off, T2 T3 on
• State 0
– C2 high, C1 low, T2 T3 off, T1 T4 on
• Address line transistors are T5 and T6
Dynamic RAM
• Bits stored as charge in capacitors charges leak so need refreshing even when
powered
• Simpler construction and smaller per bit so less expensive
• Address line active when bit read or written
– Transistor switch closed (current flows)
• Slower operations, used for main memory
Synchronous DRAM (SDRAM)
• Access is synchronized with an external clock
• Address is presented to RAM
• RAM finds data (CPU waits in conventional DRAM)
• Since SDRAM moves data in time with system clock, CPU knows when data will be
ready
• CPU does not have to wait, it can do something else
Flash memory
Similar technology as EEPROM i.e. single transistor controlled by trapped charge
A single cell can be read but writing is on block basis and previous contents are erased
Greater density and a lower cost per bit, and consumes less power for operation
Used in MP3 players, Cell phones, digital cameras
Larger flash memory modules are called Flash Drives (i.e. Solid State Storage Devices)
External memory
• Semiconductor memory can not be used to store large amount of information or data
– Due to high per bit cost of bits
• Large storage requirements is full filled by
– Magnetic disks, Optical disks and Magnetic tapes
– Called as secondary storage
Magnetic Disk Structure
• Disk substrate (non magnetizable material) coated with magnetizable material (e.g.
iron oxide…rust)
• Advantage of glass substrate over aluminium
– Improved surface uniformity
– Reduction in surface defects
• Reduced read/write errors
– Lower fly heights
– Better stiffness
– Better shock/damage resistance
•
Data Organization on Disk
• Concentric rings called tracks
– Gaps between tracks
– Same number of bits per track
– Constant angular velocity
• Tracks divided into sectors
• Minimum block size is one sector
• Disk rotate at const. angular velocity
– Gives pie shaped sectors
Disk Characteristics
Fixed (rare) or movable head
Fixed head
One read/write head per track mounted on fixed ridged arm
Movable head
One read/write head per side mounted on a movable arm
Removable or fixed disk
Single or double (usually) sided
Head mechanism
Contact (Floppy), Fixed gap, Flying (Winchester)
Single or multiple platter
Computing Disk Capacity
• Capacity =(# bytes/sector) x (avg. # sectors/track) x
(# tracks/surface) x (# surfaces/platter) x
(# platters/disk)
• Example:
– 512 bytes/sector, 300 sectors/track (average)
– 20,000 tracks/surface, 2 surfaces/platter
– 5 platters/disk
– Capacity = 512 x 300 x 20000 x 2 x 5 = 30.72GB
Disk Performance Parameters
• Seek time (Ts)
– Time require to positioned the head on the desired track
• Rotational delay
– Time require to positioned desired sector under r/w head
• Transfer time
• The total average access time is:
Ta = Ts+ 1/2r + b/rN
– Here Ts is Average seek time
– r is rotation speed in revolution per second
– b number of bytes to be transferred
– N number of bytes on a track
RAID Structure
• RAID ( Redundant array of independent disks)– multiple disk drives provides
reliability via redundancy.
• Disk striping uses a group of disks as one storage unit.
• RAID schemes improve performance and improve the reliability of the storage system
by storing redundant data.
– Mirroring or shadowing keeps duplicate of each disk.
– Block interleaved parity uses much less redundancy.
– RAID is arranged into six different levels.
• RAID Level 0:- RAID level 0 refers to disk arrays with striping at the level of
blocks but without any redundancy (such as mirroring or parity bits), as shown
Figure.
• RAID Level 1:- RAID level 1 refers to disk mirroring.
• RAID Level 2:- RAID level 2 is also known as memory-style code error
correcting code (ECC) organization. Memory systems have long detected certain
errors by using parity bits. Each byte in a memory system may have a parity bit
associated with it that records whether the number of bits in the byte set to 1 is even
(parity = 0) or odd (parity = 1). If one of the bits in the byte is damaged (either a 1
becomes a 0, or a 0 becomes a 1), the parity of the byte changes and thus will not
match the stored parity. Similarly, if the stored parity bit is damaged, it will not match
the computed parity. Thus, all single-bit errors are detected by the memory system.
• RAID Level 3:- RAID level 3, or bit-interleaved parity organization, improves on
level 2 by taking into account the fact that, unlike memory systems, disk controllers
can detect whether a sector has been read correctly, so a single parity bit can be used
for error correction as well as for detection. The idea is as follows: If one of the
sectors is damaged, we know exactly which sector it is, and we can figure out whether
any bit in the sector is a 1 or a 0 by computing the parity of the corresponding bits
from sectors in the other disks. If the parity of the remaining bits is equal to the stored
parity, the missing bit is 0; otherwise, it is 1. RAID level 3 is as good as level 2 but is
less expensive in the number of extra disks required (it has only a one-disk overhead),
so level 2 is not used in practice.
• RAID Level 4:- RAID level 4, or block-interleaved parity organization, uses block-
level striping, as in RAID and in addition keeps a parity block on a separate disk for
corresponding blocks from A! other disks. This scheme is diagramed in Figure (e). If
one of the disks fails, the parity block can be used with the corresponding blocks from
the other disks to restore the blocks of the failed disk.
• RAID Level 5:- RAID level 5, or block-interleaved distributed parity, differs
from level 4 by spreading data and parity among all N + 1 disks, rather than storing
data in N disks and parity in one disk. For each block, one of the disks stores the
parity, and the others store data.
• RAID Level 6:- RAID level 6, also called the P + Q redundancy scheme, is much like
RAID level 5 but stores extra redundant information to guard against multiple disk
failures. Instead of parity, error correcting codes such as the Reed-Solomon codes are
used. In the scheme shown in Figure 2 bits of redundant data are stored for every 4
bits of compared with 1 parity bit in level 5 --- and the system can tolerate two disk
failures.
Module-6
Input/Output Organization
• Wide variety of peripherals
– Impractical to incorporate the necessary logic within processor
• Delivering different amounts of data
– At different speeds, different formats and word length
• Speed mismatch between processor/memory and I/O devices
– Some I/O devices are slower than processor and memory while some are faster
than processor and memory
I/O MODULE
I/O Module Functions
• Control & Timing
– To coordinate the flow of traffic between internal resources and external
devices
– Control Signals: Read, Write, Ready, Busy, Error etc.
• Processor and Device Communication
– It must communicate with the processor and the external device before
transferring the data
– Processor communication involves
• Command decoding, Data, Status reporting, Address recognition
• Data Buffering
– To overcome the speed mismatch between CPU or memory and device
• Error Detection
– Mechanical, electrical malfunctioning and transmission
– e.g. paper jam, disk bad sector
I/O Methods:
• Programmed
• Interrupt driven
• Direct Memory Access (DMA)
Programmed I/O:
• CPU has direct control over I/O
– Sensing status
– Read/write commands
– Transferring data
• CPU waits for I/O module to complete operation
• Commands
• Control - telling module what to do
• Test - check status
• Read/Write
• Memory mapped I/O
– Devices and memory share an address space
– I/O looks just like memory read/write
– No special commands for I/O
• Large selection of memory access commands available
• Isolated I/O
– Separate address spaces
– Need I/O or memory select lines
– Special commands for I/O
Interrupt Driven I/O
CPU issues read command
I/O module gets data from peripheral whilst CPU does other work
I/O module interrupts CPU once data ready
If interrupted:-
Save context (registers values)
Process interrupt
Fetch data & store
Direct Memory Access:
• A special control unit can be provided to allow transfer of a block of data directly
between an external device and the main memory.
• No continuous intervention by the processor is required
• This control circuit can be part of I/O device interface and called as DMA controller
• Good for moving large volume of data
DMA Module and Operation:
• CPU tells DMA controller:-
– Read/Write
– Device address
– Starting address of memory block for data
– Amount of data to be transferred
• CPU carries on with other work
• DMA controller deals with transfer
• DMA controller sends interrupt when finished
DMA Transfer Method: Cycle Stealing
• Memory accesses by the processor and by the DMA controllers are interwoven
• DMA controller takes over bus for a cycle
• Transfer of one word of data
• Not an interrupt
– CPU does not switch context
• CPU suspended just before it accesses bus
– i.e. before an operand or data fetch or a data write
• Slows down CPU but not as much as CPU itself doing transfer
DMA Transfer Method: Burst Mode
• DMA controller is given exclusive access to the main memory to transfer a block of
data without interruption
• Most DMA controllers incorporate a data storage buffer
• It reads a block of data using burst mode from the main memory and stores it into its
input buffer
• Then it is transferred to the desired I/O device
Module-7
Scalar and Super Scalar Instruction Pipeline
Instruction Execution Characteristics:
• High Level Language (HLL) programs mostly comprises-
– Assignment statements (e.g. AßB)
– Conditional statements (e.g. IF, LOOP)
– Procedure call/return (e.g. Functions in C)
• Key for ISA designer-
– These statements should be supported in optimal fashion
RISC Characteristics
• Relatively few instructions
– 128 or less
• Relatively few addressing modes
– Memory access is limited to LOAD and STORE instructions
• All operations done within the registers of the CPU
– Use of overlapped register windows for optimization
• Fixed Length (4 Bytes), easily decoded instruction format
– Instruction execution time consistent
• Hardwired control
CISC Characteristics
• A large number of instructions
• Some instructions for special tasks used infrequently
• A large variety of addressing modes (5 to 20)
• Variable length instruction formats
• Instructions that manipulate operands in memory
• However, it soon became apparent that a complex instruction set has a number of
disadvantages
• These include a complex instruction decoding scheme, an increased size of the control
unit, and increased logic delays.
RISC and CISC Comparison
• Computer performance equation
– CPU Time = (instructions/program) x (avg. cycles/instruction) x
(seconds/cycle)
• Example CISC Program
– mov ax, 20
– mov bx, 5
– mul ax, bx
• Example RISC Program
mov ax,0
mov bx, 20
mov cx,5
again add ax, bx
loop again
Instruction Pipelining
• New inputs are accepted at one end before previously accepted inputs appear as
outputs at the other end The same concept can be apply to the Instruction execution
• Execution usually does not access main memory.
• Next instruction can fetched during execution of current instruction.
• Called instruction prefetch
• Six stages during Instruction execution
• Fetch Instruction (FI)
• Decode Instruction (DI)
• Calculate Operands (i.e. EAs) (CO)
• Fetch Operands (FO)
• Execute Instructions (EI)
• Write result or Operand (WO)
Pipeline Performance
• Total time required to execute n instructions for a pipeline with k stages and cycle
time ‘t’
Tk,n = [k+(n-1)]t
• Speedup factor
– Sk = nk/[k+(n-1)]
Four Stage Pipeline
• Fetch: Read the instruction from the memory
• Decode: Decode the instruction and fetch the source operand(s)
• Execute: Perform the ALU operation
• Write: Store the result in the destination location
• Note:
– Each stage in pipeline is expected to complete its operation in one clock cycle;
hence the clock period should be sufficiently long to complete the task being
performed in any stage.
Stalls/Hazards in Pipeline
• Any condition that causes the pipeline to stall is called Hazard
Data Hazards
– Any condition in which either the source or the destination operands of an
instruction are not available at the expected time in the pipeline
Resource Hazards
• One instruction may need to access memory as part of the Write stage while another
instruction is being Fetched
– If instruction and data are in same cache unit there is a conflict
– Called as Structural or Resource conflict or Resource Hazard
Instruction/Control Hazards
• If the required instruction is not available-
– Example: Cache miss for instruction fetch, Branch instruction
– Called as control hazards or Instruction hazards
Prediction Based Algorithms for Handling Conditional Branches
• Predict never taken
– Assume that jump will not happen. Always fetch next instruction
• Predict always taken
– Assume that jump will happen. Always fetch target instruction
• Predict by opcode
– Some instructions are more likely to result in a jump than others. Can get up to
75% success
• Taken/Not taken switch
– Based on previous history.
Superscalar Processor Design Issues
• Instruction level parallelism
– If instructions in a sequence are independent, execution can be overlapped
– Degree of instruction level parallelism is governed by data and procedural
dependency
• Machine Parallelism
– Ability to take advantage of instruction level parallelism
– Governed by number of parallel pipelines
• Note
– A program may not have enough instruction level parallelism to take full
advantage of machine parallelism
Instruction Issue Policy
• Instruction Issue
– Refer to the process of initiating instruction execution in the processor’s
functional units
– Occurs when instruction moves from the decode stage to the first execute
stage
• Instruction Issue Policy
– Refer to the protocol used to issue instructions
– Processor look ahead to locate instructions that can be brought into the
pipeline and executed
• Ordering issues
– Order in which instructions are fetched, executed and change contents of
registers and memory
Issue Policies
• In-order issue with in-order completion
• In-order issue with out-of-order completion
• Out-of-order issue with out-of-order completion
Machine Parallelism
• To enhance the performance of the Super Scalar Processors
– Duplication of Resources, Out of order instruction issue
– Register Renaming
Module-8
Micro-operations and Control Unit
• An instruction execution involves the execution of a sequence of sub steps, called
cycles
– Ex: Fetch cycle, Indirect cycle, execute cycle, interrupt cycle
• Each cycle in turn is made up of a sequence of more fundamental operations called
micro-operations
– Ex: Register to Register transfer, Register to external bus transfer, ALU
operation etc.
Control Unit:
• Generates the control signals to execute each micro-operation
• Causes the processor to execute micro-operations in proper sequence
• The control signals generated by the CU cause the opening and closing of logic
gates, resulting in the execution of the micro-ops
Types of Micro-operations:
• Transfer data between registers
• Transfer data from register to external interface
• Transfer data from external interface to register
• Perform arithmetic or logical ops
Control Unit –Inputs
• Clock
– To keep time
• Instruction register
– Op-code for current instruction determines which micro-instructions are
performed
• Flags
– State of CPU
– Results of previous operations
• From control bus
– Interrupts, Acknowledgements
Hardwired Implementation of Control unit:
• Flags
– Needed by CU to determine the status of the processor and outcome of the
previous ALU operation. (e.g. ISZ X)
• Instruction register
– Op-code causes different control signals for each different instruction
– Unique logic for each op-code
• Clock
– To keep time
– Repetitive sequence of pulses. Useful for measuring duration of micro-ops
– Different control signals at different times within instruction cycle
– Need a counter with different control signals for t1, t2 etc.
At the end of an instruction cycle, the control unit must feed back to the counter to reinitialize
it at T1
• Control Signals from Control Bus
– System bus control lines provides signals to the control unit
• Output of the CU
– Control signals within the processor
• Data movement between Registers
• Activate ALU operations
– Control signals to control bus
• To the memory
• To the I/O modules
Micro-programmed Control Unit:
Logic of the control unit is specified by a micro program
Microgram is a sequence of micro instructions
E.g. MAR ß (IRadd) (IRout, MARin, Mread)
Microinstruction consists of micro-operations
Eg: IRout, MARin, PCin etc
A micro operation is enabled by one or a set of control signals
Hence micro instruction is defined by set of control signals, represented by sequence of
1s and 0s. Also called as Control Word (CW)
Sequence of instructions to control complex operations is called micro-programming or
firmware
It is a logic circuit that does the sequencing through microinstructions and generates
control signals to execute each microinstruction
Microinstruction Format
• Assign one bit position to each control signal in each micro instruction
– Today’s large microprocessor
• Many instructions and associated register-level hardware
• Many control points to be manipulated
• Large control memory àCostly stuff
– Most signals are not needed simultaneously
• Signals are mutually exclusive
– Ex. One ALU operation can be done at a time, PCin and PCout
cannot be active at same time
• Inference: Signals can be grouped
– Mutually exclusive signals are placed in same group
Micro-instruction Types
• Vertical micro-programming
– Each micro-instruction specifies single (or few) micro-operations to be
performed
• Considerable encoding of control information requires external
memory word decoder to identify the exact control line being
manipulated
• Horizontal micro-programming
– Each micro-instruction specifies many different micro-operations to be
performed in parallel
Microinstruction Sequencing
• If each machine instruction having a separate microroutine in control memory then:
– Simple sequencing logic will work
– But control memory size will increase
– For example if most of the instructions involve several addressing modes then
for each combination a separate microroutine needs to be stored, which leads
to lots of duplication
• So organization of memory should be such that common part can be shared by each
variation of a machine instruction
– Requires complicated sequencing logic
– Moreover branches are not always made to a single branch address
Function of Micro-programmed Control
1. Sequence logic unit issues read command
2. Word specified in control address register is read into control buffer register
3. Control buffer register contents generates control signals and next address information
4. Sequence logic loads new address into control address register based on next address
information from control buffer register and ALU flags
Module-9
Multiprocessor Organizations
Parallel Processor Architectures
Parallel Organizations –SIMD
• Operate element wise on vectors of data
– e.g. MMX and SSE instructions in x86
• All processors execute the same instruction at the same time
– Each with different data address, etc.
• Simplifies synchronization
• Reduced instruction control hardware
• Works best for highly data-parallel applications
Symmetric Multiprocessors (SMP):
• A stand alone computer with the following characteristics
– Two or more similar processors of comparable capacity
– Processors share same memory and I/O
– Processors are connected by a bus or other internal connection
– Memory access time is approximately the same for each processor
– All processors can perform the same functions (hence symmetric)
– System controlled by integrated operating system
– Providing interaction between processors
Interaction at job, task, file and data element levels
SMP Advantages
• OS of an SMP schedules processes across all the processors
• Performance
– If some work can be done in parallel
• Availability
– Since all processors can perform the same functions, failure of a single
processor does not halt the system
• Incremental growth
– User can enhance performance by adding additional processors
• Scaling
– Vendors can offer range of products based on number of processors
Cache Coherence
– Multiple copies of same data in different caches exist simultaneously, can
result in an inconsistent view of memory
• Two common write policies:
– Write back policy can leads to inconsistency
– Write through can also give problems unless caches monitor memory traffic
• Cache coherence is defined as the situation in which all cached copies of shared data
have the same value at all times
Software Solutions
• Compiler and operating system deal with problem
– Attractive because of, overhead transferred from run time to compile time and
design complexity transferred from hardware to software
• However, software tends to make conservative decisions
– Simple Approach: Prevent any shared data variables from being cached. This
leads to inefficient cache utilization???
– Efficient Approach: Analyze code to determine safe periods for caching shared
variables. Compiler needs to insert instructions to enforce cache coherence
during the critical periods
• Hardware based solutions are referred as cache coherence protocols
• Cache coherence protocols
– Directory protocols
– Snoopy protocols
Directory Protocols
• A directory (in main memory) contains the global state information about the contents
of the various local caches
• Individual cache controller requests are checked against directory stored in main
memory and appropriate transfers are performed
• Pros & Cons
– Creates central bottleneck
– Communication overhead between cache controllers and central controller
– Effective in large scale systems having multiple buses or with complex
interconnection schemes
Snoopy Protocols
• These protocols distribute cache coherence responsibility among cache controllers
• Cache must recognize when a line that it holds is shared
• Updates announced to other caches by broadcast mechanism
• Each cache able to “snoop” these broadcast messages
• Suited to bus based multiprocessor
• Increases the bus traffic due to broadcasting
• Two basic approaches are used with snoopy protocols:
– Write Invalidate (Multiple Readers one Writer)
– Write update (Multiple Writers and multiple Readers)
Write Invalidate
• Multiple readers, one writer
• When a write is required, all other caches of the line are invalidated
• Writing processor then has exclusive access until line required by another processor
• Used in Pentium II and PowerPC systems
• State of every line is marked as Modified, Exclusive, Shared or Invalid
Write Update
• Multiple readers and multiple writers
• Updated word is distributed to all other processors
• Neither of these two approaches is superior to the other under all possibilities
– Performance depends upon the number of local caches and the pattern of
memory reads and writes
• Some systems use an adaptive mixture of both solutions
MESI Protocol
• Modified- The line in the cache has been modified
• Exclusive-The line in the cache is same as in main memory and is not present in any
other cache
• Shared-The line in the cache is same as in the main memory and it present in another
cache
• Invalid-The line in the cache does not contain valid data
Clusters
• It is a group of interconnected whole computers working together as unified resource
gives illusion of being one machine
• Basically used for Server applications
• Each computer called a node
• Benefits
o High performance & High availability
o Absolute & Incremental scalability
Superior price/performance
Operating Systems Design Issues
• Failure Management
– High availability
• Failure occurs then quires in progress are lost
– Fault tolerant
• Ensures all resources are always available through redundant shared
disks and transaction backing system
• Failover: Switching applications & data from failed system to
alternative within cluster
• Failback: Restoration of applications and data to original system after
problem is fixed
• Load balancing
– Incremental scalability
– Automatically include new computers in scheduling
– Middleware needs to recognise that processes may switch between machines
• Parallelizing Computation
– Single application executing in parallel on a number of machines in cluster
– Parallelizing Complier
• Determines at compile time which parts can be executed in parallel
• Split off for different computers
– Parallelized Application
• Application written from scratch to be parallel
• Message passing to move data between nodes
• Hard to program but gives best end result!!!
– Parametric computing
• If a problem is repeated execution of algorithm on different sets of data
• e.g. simulation using different scenarios (different start conditions or
parameters)
• Needs parametric processing tools to organize and run in effective
manner
Blade Servers
• Common implementation of cluster approach
• Server houses multiple server modules (blades) in single chassis
– Save space
– Improve system management
– Chassis provides power supply
– Each blade has processor, memory, disk
Cluster vs. SMP
• Both provide multiprocessor support to high demand applications
• Both are available commercially
– SMP for longer time
• SMP
– Easier to manage and control
– Closer to single processor systems
• Scheduling is main difference to move from uni-processor to SMP
• Less physical space and Lower power consumption
• Clustering
– Superior incremental & absolute scalability
– Superior availability
Non-uniform Memory Access (NUMA)
• What is Uniform memory access?
– All processors have access to all parts of memory using load & store
– Access time to all regions of memory is the same
• What is Non-uniform memory access?
– All processors have access to all parts of memory using load & store
– Access time of processor differs depending on region of memory
– Different processors access different regions of memory at different speeds
• Cache coherent NUMA
– Cache coherence is maintained among the caches of the various processors
– Significantly different from SMP and clusters
– SMP has practical limit to number of processors
– Bus traffic limits to between 16 and 64 processors
• In clusters each node has own memory
– Apps do not see large global memory
– Coherence maintained by software not hardware
• NUMA retains SMP flavour while giving large scale multiprocessing
– e.g. Silicon Graphics Origin NUMA 1024 MIPS R10000 processors
• Objective is to maintain transparent system wide memory while permitting
multiprocessor nodes, each with it’s own bus or internal interconnection system
***********************