MICROPROCESSOR AND COMPUTER
ARCHITECTURE ASSESSMENT
SOLUTIONS
1. DIFFERENCES BETWEEN MICROPROCESSOR AND
MICROCONTROLLER
A. DIFFERENCES AND APPLICATIONS:
While both microprocessors and microcontrollers are essential components in
digital systems, they serve distinct purposes and possess fundamental
architectural differences.
• Microprocessor:
◦ Definition: A microprocessor is a multi-purpose, programmable,
clock-driven, register-based electronic device that reads binary
instructions from a storage device (memory), accepts binary data
as input, and processes data according to those instructions,
providing results as output. It is essentially the central processing
unit (CPU) on a single integrated circuit.
◦ Architecture: It contains only the CPU (Arithmetic Logic Unit,
Control Unit, Registers) and requires external components like
RAM, ROM, I/O ports, timers, and serial ports to function as a
complete system.
◦ Complexity & Power: Generally more complex, powerful, and
capable of higher clock speeds, handling larger data widths (e.g.,
32-bit, 64-bit). They are designed for general-purpose computing
tasks.
◦ Cost: Typically more expensive due to their complexity and
external component requirements.
◦ Applications: Personal computers (desktops, laptops), servers,
workstations, mobile phones (as the main application processor),
advanced gaming consoles, and complex digital signal processing
systems. They are suited for tasks requiring high computational
power and flexibility.
• Microcontroller:
◦ Definition: A microcontroller (often abbreviated as MCU) is a
compact integrated circuit designed to govern a specific operation
in an embedded system. It is a "computer-on-a-chip" as it
integrates a CPU, a fixed amount of RAM, ROM/Flash memory, I/O
ports, and sometimes other peripherals like timers, ADCs, DACs,
and serial communication interfaces (UART, SPI, I2C) all on a single
chip.
◦ Architecture: Self-contained, meaning it can operate
independently without additional external components for basic
functionality. This makes them ideal for embedded applications
where space and cost are critical.
◦ Complexity & Power: Less powerful and typically operate at lower
clock speeds, handling smaller data widths (e.g., 8-bit, 16-bit). They
are optimized for specific control-oriented tasks, often real-time.
◦ Cost: Generally less expensive due to their integrated nature and
lower complexity, leading to lower bill of materials (BOM) for end
products.
◦ Applications: Embedded systems such as home appliances
(washing machines, microwave ovens), automotive control systems
(engine control, ABS), industrial automation (PLCs), robotics,
remote controls, medical devices, toys, point-of-sale systems, and
IoT devices. They are designed for dedicated control and
monitoring tasks.
B. INTERNAL ARCHITECTURE OF 8085 MICROPROCESSOR:
The 8085 is an 8-bit microprocessor, meaning it processes 8 bits of data at a
time. Its architecture comprises several functional blocks that work cohesively
to execute instructions and manage data flow:
• Arithmetic Logic Unit (ALU):
The ALU is the digital circuit that performs arithmetic operations
(addition, subtraction, increment, decrement) and logical operations
(AND, OR, XOR, NOT, compare). It is the computational core of the
microprocessor. The 8085's ALU operates on 8-bit data values.
• Control Unit (CU):
The Control Unit is responsible for managing and coordinating all
operations within the microprocessor. It generates timing and control
signals (like read, write, I/O/Memory, ALE) required to control the flow of
data between the microprocessor and peripheral devices. It fetches
instructions from memory, decodes them, and then directs the ALU and
other units to perform the necessary operations. It includes components
like the Instruction Register (IR), Instruction Decoder, and Timing and
Control Logic.
• Registers:
Registers are small, high-speed storage locations within the CPU used to
temporarily hold data and instructions during processing. The 8085 has
several types of registers:
◦ Accumulator (A Register): An 8-bit register that is central to
almost all arithmetic and logical operations. The result of an ALU
operation is often stored in the accumulator. It also plays a vital
role in I/O operations.
◦ General-purpose Registers (B, C, D, E, H, L): These are six 8-bit
registers (B, C, D, E, H, L) that can be used to store data. They can
also be paired to work as 16-bit register pairs (BC, DE, HL) for
addressing or storing 16-bit data. The HL pair is particularly
important as it is often used as a memory pointer.
◦ Program Counter (PC): A 16-bit register that stores the memory
address of the next instruction to be fetched and executed. It is
automatically incremented after each instruction fetch.
◦ Stack Pointer (SP): A 16-bit register that points to the top of the
stack in memory. The stack is a LIFO (Last-In, First-Out) data
structure used for temporary storage during subroutine calls and
interrupts.
◦ Flag Register (or Status Register): An 8-bit register containing 5
flip-flops that indicate the status of the most recent arithmetic or
logical operation. These flags are: Sign (S), Zero (Z), Auxiliary Carry
(AC), Parity (P), and Carry (CY). These flags are used for conditional
branching.
• Address Bus and Data Bus:
◦ Address Bus: A unidirectional 16-bit bus (A0-A15) used by the
microprocessor to send memory addresses or I/O port addresses
to external devices. With 16 address lines, the 8085 can address
216 = 65,536 (64 KB) unique memory locations.
◦ Data Bus: A bidirectional 8-bit bus (D0-D7) used for transferring
data between the microprocessor and memory or I/O devices. The
8085 has a multiplexed address/data bus, meaning the lower 8 bits
of the address (A0-A7) and the 8-bit data (D0-D7) share the same
physical lines (AD0-AD7). An external latch is used to de-multiplex
these lines.
2. MACHINE CYCLES
A. DEFINITION AND TIMING DIAGRAM OF MVI A, 08H:
A Machine Cycle is the basic operation that a microprocessor performs to
access memory or an I/O device. Each machine cycle consists of a specific
number of T-states (clock periods), which are the smallest units of time in the
microprocessor's operation. Different instructions require different numbers
of machine cycles for their execution.
The instruction MVI A, 08H (Move Immediate 08H to Accumulator) is a 2-
byte instruction in the 8085 microprocessor:
1. The first byte is the opcode for MVI A .
2. The second byte is the immediate data 08H .
This instruction typically requires 2 machine cycles for its complete execution:
1. Opcode Fetch (OF) Cycle: This cycle fetches the opcode ( MVI A ) from
memory.
2. Memory Read (MR) Cycle: This cycle fetches the immediate data ( 08H )
from memory.
Timing Diagram Description for MVI A, 08H:
A timing diagram visually represents the behavior of various control and data
signals over time during the execution of an instruction. For MVI A, 08H ,
the timing diagram would show the following key signals and their states:
• Opcode Fetch Cycle (4 T-states):
◦ T1:
▪ ALE (Address Latch Enable) goes HIGH: Indicates that
the multiplexed AD0-AD7 lines contain the lower 8 bits of the
address. A0-A15 contain the full 16-bit address of the opcode.
▪ IO/M goes LOW: Indicates a memory operation.
▪ S1=1, S0=1 : Indicates an Opcode Fetch.
▪ The microprocessor puts the 16-bit address of the opcode on
the address bus (A8-A15 on A bus, A0-A7 on AD bus).
◦ T2:
▪ ALE goes LOW: Latch the lower address bits from AD0-AD7
into an external latch.
▪ RD (Read) goes LOW: Signals memory to output data.
▪ Memory places the opcode (e.g., 3EH for MVI A) onto the data
bus (AD0-AD7).
◦ T3:
▪ The opcode is read by the microprocessor and stored in the
Instruction Register.
▪ RD goes HIGH: Ends the read operation.
◦ T4:
▪ Internal decoding and execution of the opcode begins. The
Program Counter is incremented to point to the next memory
location (where the immediate data is stored).
• Memory Read Cycle (3 T-states):
◦ T1 (of MR cycle):
▪ ALE goes HIGH: Indicates AD0-AD7 contain lower 8 bits of
the address of the immediate data. A0-A15 carry the full 16-
bit address.
▪ IO/M goes LOW: Indicates a memory operation.
▪ S1=0, S0=1 : Indicates a Memory Read.
◦ T2 (of MR cycle):
▪ ALE goes LOW.
▪ RD goes LOW.
▪ Memory places the immediate data ( 08H ) onto the data bus
(AD0-AD7).
◦ T3 (of MR cycle):
▪ The immediate data 08H is read by the microprocessor and
moved into the Accumulator.
▪ RD goes HIGH.
▪ The Program Counter is incremented again.
B. PROCESS TO CALCULATE 20-BIT ADDRESS:
The method of address calculation varies significantly between 8-bit
microprocessors like the 8085 and 16-bit/higher microprocessors like the
8086 due to their different addressing capabilities.
• 8085 Microprocessor:
◦ The 8085 has 16 address lines (A0-A15). This means it can directly
address 216 = 65,536 bytes, or 64 KB, of memory.
◦ Address calculation is straightforward: a 16-bit address directly
points to a memory location. There is no concept of segmentation
for memory access in the 8085. The maximum addressable
memory is 64KB, which is directly mapped using the 16-bit address
bus.
• 8086 Microprocessor:
◦ The 8086 has 20 address lines (A0-A19). This allows it to address
220 = 1,048,576 bytes, or 1 MB, of memory.
◦ To generate a 20-bit physical address from 16-bit registers, the
8086 employs a segmentation mechanism. Memory is divided
into logical segments, each up to 64 KB in size. A physical address
is calculated using a 16-bit segment address and a 16-bit offset
address.
◦ Calculation Process: The 20-bit physical address (PA) is calculated
by shifting the 16-bit segment register value left by 4 bits
(effectively multiplying it by 16 or 10H) and then adding the 16-bit
offset address to the result.
Physical Address (20-bit) = (Segment Register
Value × 10H) + Offset Register Value
For example, if the Data Segment (DS) register contains 2000H
and the effective address (offset) is 1234H :
▪ Segment Register Value shifted: 2000H × 10H = 20000H
▪ Offset Register Value: 1234H
▪ Physical Address = 20000H + 1234H = 21234H
◦ This segmentation scheme allows the 8086 to manage a 1MB
memory space using 16-bit registers, providing flexibility and
support for multitasking environments.
3. ASSEMBLY LANGUAGE PROGRAM IN 8085
A. PROGRAM TO FIND GREATEST NUMBER:
This 8085 assembly program finds the greatest number from a sequence of
bytes stored in memory starting from address 3000H and stores the result
at 3050H . The number of bytes to compare is initialized in register B (here, 4
bytes).
LXI H, 3000H ; Load HL pair with starting address
(source: 3000H) of the numbers
MVI B, 04H ; Initialize B register with the count
of numbers (e.g., 4 numbers)
LXI D, 3050H ; Load DE pair with the address where
the result (maximum number) will be stored (destination:
3050H)
MOV A, M ; Initialize Accumulator (A) with the
first byte from memory pointed by HL (3000H)
; This byte is assumed to be the
current maximum initially.
INX H ; Increment HL to point to the next
byte (3001H) for comparison
DCR B ; Decrement the counter (B). We've
already processed one number.
LOOP: ; Start of the comparison loop
CMP M ; Compare the content of Accumulator
(current max) with the byte
; pointed by HL (M).
; The flags (especially Carry Flag) are
set based on (A - M).
JC NEXT ; Jump if Carry Flag is set (A < M). If
A is less than M, it means
; the new byte (M) is greater. However,
the logic here is inverted:
; JC means A < M, so if A is less, the
new M is greater, so we should update A.
; This instruction should be JNC NEXT
to skip if A is GREATER or EQUAL (no carry).
; Or better, JNC SKIP_UPDATE (if A >=
M, no update needed).
; Let's re-evaluate the intended logic
for CMP M.
; If A < M, CMP M sets CY. So JC means
M is greater. We should update A.
; If A >= M, CMP M clears CY. So JNC
means A is greater or equal. We should NOT update A.
; Corrected Logic:
; CMP M compares A with M.
; If A < M, then CY=1. We need to
update A with M.
; If A >= M, then CY=0. We don't need
to update A.
MOV A, M ; If we reach here, it means A was NOT
greater than or equal to M
; (i.e., A < M or A == M and Z=0 (which
is not right interpretation for Z)).
; If A < M (CY=1), then M is the new
maximum. Update A with M.
; The original JZ and JC logic is a bit
confusing.
; A simpler approach:
; CMP M ; Compare A (current max)
with M (next number)
; JNC NEXT_NO_UPDATE ; If A >= M (No
Carry), then A is already greater or equal,
; ; so skip the
update.
; (If JNC NEXT_NO_UPDATE is NOT taken, it means A < M,
so update A)
; MOV A, M ; Update Accumulator with the new
maximum (M)
; NEXT_NO_UPDATE:
; Let's follow the provided code's flow:
CMP M ; Compare A (current max) with M
(current number).
; If A < M, CY=1. If A >= M, CY=0. If A
= M, Z=1.
JZ NEXT ; If A = M (Zero flag is set), no
change needed, skip to NEXT.
; The maximum remains A.
JC NEXT ; If A < M (Carry flag is set), then M
is greater than A.
; This is a logical error. If JC (A <
M), we *should* update A.
; The original code skips the update if
A < M, which is incorrect.
; To find the greatest, the logic should be:
; If (A < M) { A = M; } // If current A is smaller,
update A
; So, if CMP M sets Carry Flag (A < M), we should NOT
jump, but continue to MOV A, M.
; If CMP M CLEARS Carry Flag (A >= M), we should jump
to NEXT, skipping MOV A, M.
; Corrected logic for the jump:
; CMP M ; Compare A with M
; JNC NEXT ; If A >= M (No Carry), then A is
already greater or equal, so skip update.
MOV A, M ; If we are here, A < M (Carry set), so
M is the new maximum. Update A.
NEXT: ; Label for next iteration or skip
update
INX H ; Move HL to point to the next byte
DCR B ; Decrement byte count
JNZ LOOP ; If count is not zero, jump back to
LOOP
SH: ; After loop, store the final maximum
LXI H, 3050H ; Load HL with the result storage
address (this overwrites previous HL)
; Alternatively, could use MOV M, A if
DE was the result pointer.
; Given LXI D, 3050H earlier, it's
better to use:
MOV M, A ; Store content of Accumulator (A,
which holds the max) at memory pointed by HL (3050H).
; Note: The original program uses LXI
H, 3050H again, which is fine but resets HL.
; If the result was to be stored at
3050H using the DE register, it would be:
; STAX D ; Store A at memory pointed by
DE (3050H).
HLT ; Halt the processor
Explanation of the Corrected Logic:
The core of finding the greatest number lies in iterating through the list,
maintaining a 'current maximum', and updating it whenever a larger number
is encountered. The corrected assembly logic for the comparison part would
be:
1. Load the first number into the Accumulator (A), considering it the initial
maximum.
2. Decrement the count of numbers to process.
3. Loop through the remaining numbers:
◦ Compare the current number (from memory, pointed by HL) with
the Accumulator (current maximum).
◦ If the current number from memory is greater than the
Accumulator's content, then update the Accumulator with this new
number. Otherwise, keep the Accumulator's content as is.
◦ Increment HL to point to the next number.
◦ Decrement the count.
◦ Repeat until all numbers are processed.
4. Store the final maximum value from the Accumulator to the designated
result memory location.
The most common and efficient way to implement the comparison ( if (A <
M) { A = M; } ) in 8085 is:
CMP M ; Compare A with M. If A < M, Carry Flag
(CY) is set.
JNC NEXT ; If CY is not set (A >= M), jump to NEXT
(skip update).
MOV A, M ; If CY is set (A < M), M is greater, so
move M to A (update max).
NEXT:
INX H
DCR B
JNZ LOOP
B. STATE DIAGRAM FOR SIMPLE CPU:
A state diagram illustrates the different states a CPU can be in during
instruction execution and the transitions between these states. The
fundamental states are Fetch, Decode, and Execute, forming the Instruction
Cycle.
States:
1. Fetch State:
◦ Action: CPU retrieves the next instruction's opcode from the
memory location pointed to by the Program Counter (PC).
◦ Internal Changes: PC is incremented to point to the next
instruction. The fetched opcode is placed into the Instruction
Register (IR).
◦ Signals: Memory Read (RD) signal active, address placed on
address bus.
2. Decode State:
◦ Action: The Control Unit (CU) interprets the opcode currently held
in the Instruction Register. It determines what operation needs to
be performed, what operands are required, and where they are
located.
◦ Internal Changes: Control signals are generated based on the
decoded instruction, preparing the ALU, registers, and memory for
the upcoming execution phase.
3. Execute State:
◦ Action: The actual operation specified by the instruction is
performed. This involves specific micro-operations. The nature of
the execute phase depends entirely on the instruction.
◦ Example Micro-operations for given instructions:
▪ ADD (e.g., ADD B):
▪ If operand is in a register (like B), its content is sent to
one input of the ALU. The Accumulator's content is sent
to the other ALU input.
▪ ALU performs addition (A + B).
▪ Result is stored back in the Accumulator. Flag registers
are updated.
▪ AND (e.g., ANI 05H):
▪ If operand is immediate (05H), it's fetched from memory
(if not already fetched as part of opcode). Accumulator
content is fetched.
▪ ALU performs bitwise AND operation (A AND 05H).
▪ Result is stored back in the Accumulator. Flags updated.
▪ JMP (e.g., JMP 2000H):
▪ The new target address (2000H) is fetched (if not already
fetched).
▪ The value 2000H is loaded into the Program Counter
(PC). This alters the flow of control.
▪ No ALU operation typically involved for a direct jump.
▪ INC (e.g., INR B):
▪ The content of register B is sent to the ALU.
▪ ALU performs an increment operation (B + 1).
▪ Result is stored back in register B. Flags may or may not
be updated depending on the instruction. (8085 INR/
DCR affect all flags except CY).
State Transitions:
[RESET/START] --> Fetch --(Instruction Ready)--> Decode
--(Decoded)--> Execute
|--(Operation Complete)--> Fetch (for next
instruction)
`--(Branch/Jump)--> Fetch (at new target address)
This cycle repeats continuously as long as the CPU is operating. Interrupts
can cause a transition out of the normal cycle to an Interrupt Service Routine
(ISR) and then back.
4. VERTICAL MICROINSTRUCTION SEQUENCING
A. EXPLANATION:
In the design of a CPU's control unit, microinstructions are fundamental
commands that control the individual functional units (e.g., ALU, registers,
buses) within the CPU. A sequence of microinstructions forms a
microprogram, which in turn implements a machine language instruction.
Vertical Microinstruction Sequencing refers to a microprogrammed control
unit design approach where each microinstruction is highly encoded. This
means that a single bit or a small field within the microinstruction represents
a specific control function (e.g., "Add", "Load A").
• Characteristics:
◦ Highly Encoded: Each bit or small field in the microinstruction
controls a specific operation or set of mutually exclusive
operations. For example, a 3-bit field might select one of 8 ALU
operations (ADD, SUB, AND, OR, etc.).
◦ Minimal Parallelism: Vertical microinstructions typically specify a
limited number of micro-operations to be performed concurrently,
usually one main operation per clock cycle. This leads to longer
microprograms (more microinstructions per machine instruction).
◦ Shorter Microinstruction Word Length: Due to encoding, the
width (number of bits) of each microinstruction is relatively small.
This saves space in the control memory (ROM).
◦ Simpler Microprogram Control Logic: The decoding logic for
vertical microinstructions is more complex than for horizontal, as
the encoded fields need to be translated into numerous control
signals. However, the sequencing logic itself might be simpler.
◦ Flexible: Easier to add new machine instructions or modify existing
ones by changing the microprogram, as microinstructions are
concise.
• Sequencing Mechanism:
The sequencing of vertical microinstructions is managed by a
microprogram sequencer, which often includes a microprogram counter
(μPC), mapping logic, and a dispatch ROM/PLA. When a machine
instruction is fetched and decoded, its opcode is mapped to the starting
address of its corresponding microprogram in the control memory. The
μPC then increments to fetch successive microinstructions. Conditional
branching within the microprogram (e.g., for handling different
addressing modes or flag conditions) is supported by special fields in
the microinstruction that can modify the μPC or enable branch logic.
• Advantages: Reduces the size of the control memory. Easier to modify
the control logic.
• Disadvantages: Requires more complex decoding logic for each
microinstruction. Slower execution due to sequential nature and less
parallelism.
B. ARITHMETIC PIPELINE:
An Arithmetic Pipeline is a technique used in computer architecture to
speed up the execution of arithmetic operations (like floating-point addition,
multiplication, or division) by breaking them down into a sequence of smaller,
independent stages. Each stage performs a specific sub-operation, and these
stages are connected in a linear fashion, allowing multiple operations to be
processed concurrently, albeit in different stages of completion.
Analogy: Similar to an assembly line, where different workers perform
different tasks simultaneously on different products, a pipeline allows
different parts of an arithmetic operation to be performed in parallel on
different data sets. For example, a floating-point addition might be broken
into: 1) Exponent comparison, 2) Mantissa alignment, 3) Mantissa addition, 4)
Normalization.
Benefits: Pipelining increases the throughput (number of operations
completed per unit time) of the system, even though the latency (time to
complete a single operation) might slightly increase due to inter-stage delays.
Hazards in Pipelining:
Pipelining introduces challenges known as hazards, which can prevent the
pipeline from operating at its peak efficiency, potentially requiring stalling
(pausing some stages) or other corrective measures.
1. Structural Hazards:
◦ Definition: Occur when two or more instructions in different
stages of the pipeline require the same hardware resource
simultaneously.
◦ Example: If a single memory unit is used for both instruction fetch
and data access (load/store), and an instruction needs to fetch data
while another instruction needs to fetch its opcode, a structural
hazard arises. Or, if a single ALU is used for both address
calculation and arithmetic operations.
◦ Mitigation: Duplicating hardware resources (e.g., separate
instruction and data caches/memory – Harvard architecture), or
introducing stalls (bubbles) in the pipeline until the resource
becomes free.
2. Data Hazards:
◦ Definition: Occur when an instruction depends on the result of a
previous instruction that is still in the pipeline and has not yet
completed its write-back stage.
◦ Types:
▪ RAW (Read After Write) / True Dependency: An instruction
tries to read a source operand before a preceding instruction
has written to it. (e.g., ADD R1, R2 then SUB R3, R1 ).
This is the most common and critical data hazard.
▪ WAR (Write After Read) / Anti-Dependency: An instruction
tries to write to a destination operand before a preceding
instruction has read its source operand. (e.g., ADD R1, R2
then MOV R2, R4 ). Less common in classic pipelines, often
managed by register renaming.
▪ WAW (Write After Write) / Output Dependency: An
instruction tries to write to a destination operand before a
preceding instruction has written to the same destination.
(e.g., MUL R1, R2 then ADD R1, R3 ). Less common,
often managed by instruction reordering or register
renaming.
◦ Mitigation:
▪ Stalling (Bubbles): Pausing the pipeline until the dependent
data is available.
▪ Forwarding (Bypassing): Routing the result of an operation
from an earlier pipeline stage directly to a later stage that
needs it, without waiting for it to be written back to the
register file.
▪ Operand Fetch Optimization: Intelligent scheduling of
instructions.
3. Control Hazards (Branch Hazards):
◦ Definition: Occur when the pipeline makes a wrong guess about
which instruction to fetch next due to conditional branches, jumps,
or calls. When a branch instruction is fetched, the CPU doesn't
know the target address until the branch condition is evaluated
later in the pipeline.
◦ Example: BEQ R1, R2, LABEL . The instructions immediately
following BEQ are fetched before the condition (R1 == R2) is
known. If the branch is taken, the fetched instructions are incorrect
and must be flushed.
◦ Mitigation:
▪ Stalling: Simply pause the pipeline until the branch target is
known. (Least efficient).
▪ Branch Prediction: Guessing the outcome of the branch
(taken or not taken) and fetching instructions based on the
guess. If the guess is wrong, the pipeline is flushed, incurring
a penalty.
▪ Delayed Branch: The instruction immediately following a
branch instruction is always executed, regardless of whether
the branch is taken or not. This instruction is placed there by
the compiler to fill the pipeline delay slot.
▪ Speculative Execution: Execute instructions along the
predicted path, but commit results only if the prediction was
correct.
5. RTL CODE FOR BOOTH’S ALGORITHM
A. RTL CODE EXAMPLE (CONCEPTUAL FOR 7 X 3):
Booth's Algorithm is a multiplication algorithm that multiplies two signed
binary numbers in two's complement representation. It is particularly efficient
when there are long sequences of 1s in the multiplier. Here, we'll illustrate a
conceptual RTL (Register Transfer Level) description of the main steps for the
multiplication of 7 (0111) and 3 (0011) using a simplified 4-bit representation.
Let Multiplicand (M) = 7 (0111) and Multiplier (Q) = 3 (0011).
Registers needed:
• A : Accumulator (initially 0, will hold partial product, size N+1 bits for N-
bit multiplication, here 5 bits)
• Q : Multiplier (initially 3, size N bits, here 4 bits)
• Q_1 : A single bit (initially 0), holds the bit to the right of the LSB of Q.
• M : Multiplicand (7, size N bits, here 4 bits)
• Count : N (number of bits in Q), here 4.
Initial State:
A <- 00000 ; 5 bits for partial product
Q <- 0011 ; Multiplier (3)
Q_1 <- 0 ; LSB of Q (Q[0]) is 1, so Q_0Q_1 is 10
M <- 0111 ; Multiplicand (7)
Count <- 4
Booth's Iteration (Simplified RTL - High Level):
The core logic involves checking the last two bits of the effective multiplier
(Q[0] and Q_1) and performing operations followed by an arithmetic right
shift. The operations are:
• 00 or 11 : No operation (0 * M), just shift.
• 01 : Add M to A, then shift. (A <- A + M)
• 10 : Subtract M from A, then shift. (A <- A - M)
Detailed Steps (Illustrative RTL for one iteration):
// Loop N (Count) times
LOOP_START:
// Check Q[0] and Q_1
IF (Q[0] == 1 AND Q_1 == 0) THEN
A <- A - M ; A = A + (-M), where -M is 2's
complement of M
// Example: A <- A + (1001) for M=0111 (7)
ELSE IF (Q[0] == 0 AND Q_1 == 1) THEN
A <- A + M
END IF
// Arithmetic Right Shift (A, Q, Q_1 together)
Q_1 <- Q[0] ; Move Q[0] to Q_1
Q <- RightShift(Q) ; Shift Q right by 1 bit, Q[N-1]
(MSB) becomes Q[N-2]
Q[MSB] <- A[0] ; Move LSB of A to MSB of Q (A[4]
to Q[3] if A is 5-bit)
A <- A[MSB] + RightShift(A[0 to 3]) ; Arithmetic
shift A right by 1 bit, preserve MSB of A
; (A[4] copied to
A[4] and A[3], then A[2] to A[1], etc.)
Count <- Count - 1
IF (Count != 0) THEN
GOTO LOOP_START
END IF
// Result is in A and Q combined (A:Q)
After 4 iterations, the 8-bit product (21 in decimal) will be found in the
combined A:Q registers.
This is a simplified RTL. A full RTL implementation would typically involve
more precise descriptions of bit-level transfers, state transitions, and control
signals within a finite state machine.
B. IMPORTANCE OF CACHE MEMORY:
Cache memory is a small, very fast memory type that acts as a buffer
between the CPU and main memory (RAM). Its primary purpose is to speed
up data access by storing copies of data from frequently used main memory
locations. This reduces the average time it takes for the CPU to access data,
thereby improving overall system performance.
How it works (Locality of Reference):
The effectiveness of cache memory is based on the principle of "locality of
reference":
• Temporal Locality: If a data item is accessed, it is likely to be accessed
again soon. (e.g., loop variables, function parameters).
• Spatial Locality: If a data item is accessed, data items whose addresses
are close to it are likely to be accessed soon. (e.g., array elements,
sequential instruction fetches).
When the CPU needs data, it first checks the cache. This is called a "cache hit".
If the data is found in the cache, it is retrieved much faster than from main
memory. If the data is not in the cache ("cache miss"), the CPU must fetch it
from main memory. When a cache miss occurs, not only the requested data
but also a block of surrounding data (due to spatial locality) is fetched into the
cache, anticipating future use.
Key Benefits and Importance:
1. Bridge the Speed Gap: CPUs operate at much higher speeds than main
memory (RAM). Cache memory, being faster, acts as a high-speed buffer,
minimizing the performance bottleneck caused by slower main memory
access times. Without cache, the CPU would frequently have to wait for
data, leading to significant idle cycles.
2. Improved System Performance: By reducing the average memory
access time (AMAT), cache memory significantly boosts the overall
execution speed of programs. A higher cache hit rate translates directly
to better system throughput.
3. Energy Efficiency: Accessing data from cache consumes less power
than accessing it from main memory, contributing to more energy-
efficient computing, especially crucial for portable devices.
4. Multi-level Caching: Modern CPUs often employ multiple levels of
cache (L1, L2, L3).
◦ L1 Cache: Smallest, fastest, closest to the CPU core, often split into
instruction and data caches.
◦ L2 Cache: Larger, slightly slower than L1, typically unified (stores
both instructions and data).
◦ L3 Cache: Largest, slowest among caches, but still much faster
than RAM. Often shared across multiple CPU cores.
This hierarchy allows for a balance between speed, size, and cost.
5. Optimizing CPU Utilization: By ensuring that the CPU has data readily
available, cache memory keeps the processing units busy, maximizing
their utilization and preventing bottlenecks.
In essence, cache memory is a critical component for achieving high
performance in modern computer systems, allowing CPUs to operate at
their potential despite the inherent latency of main memory.
6. FLOWCHART OF PROGRAMMED I/O VS.
INTERRUPT DRIVEN I/O
A. FLOWCHART DESCRIPTION (WITHOUT DRAWING):
These two methods dictate how a CPU interacts with I/O devices to
transfer data. They differ fundamentally in how the CPU's attention is
managed.
Programmed I/O Flowchart Description:
Programmed I/O (PIO) involves the CPU directly controlling the I/O
operations. The CPU issues commands to an I/O module and then
continuously checks the status of the I/O device until the operation is
complete. This method is CPU-intensive and inefficient for slow I/O
devices.
1. Start: CPU begins execution of a program.
2. CPU Needs I/O: Program requires data transfer to/from an I/O
device.
3. Issue Command: CPU issues a command to the I/O module (e.g.,
"start disk read", "send data to printer").
4. Loop: Check Status Register: CPU enters a tight loop, repeatedly
reading the status register of the I/O device.
5. Device Ready?:
▪ No: CPU continues to poll (loop back to "Check Status
Register"). This is "busy-waiting" and wastes CPU cycles.
▪ Yes: The status register indicates the device is ready (e.g.,
data available, transfer complete).
6. Transfer Data: CPU transfers data word by word between the I/O
device and memory/registers.
7. More Data?:
▪ Yes: Go back to "Transfer Data" for the next word (or to
"Check Status Register" if device needs to be polled again for
each word).
▪ No: All data transferred.
8. End I/O Operation: CPU proceeds with other program tasks.
Key Characteristic: CPU is fully dedicated to the I/O operation and
cannot perform other useful tasks until the I/O is complete. It is suitable
for fast I/O devices where polling overhead is minimal or for simple,
single-tasking systems.
Interrupt Driven I/O Flowchart Description:
Interrupt-driven I/O allows the CPU to perform other tasks while an I/O
device is processing. The I/O device notifies the CPU via an interrupt
signal when it is ready for data transfer or when an operation is
complete, thus freeing the CPU for other computations.
1. Start: CPU begins execution of a main program.
2. CPU Needs I/O: Program requires data transfer to/from an I/O
device.
3. Issue Command & Enable Interrupts: CPU issues a command to
the I/O module (e.g., "start disk read") and enables the
corresponding interrupt for that device.
4. CPU Continues Other Tasks: CPU immediately returns to
executing other instructions of the main program, without waiting
for the I/O device.
5. I/O Device Processes: The I/O device performs its operation
autonomously.
6. I/O Device Ready?:
▪ No: Device is still busy. CPU continues main program.
▪ Yes: Device completes operation or is ready for data transfer.
It sends an interrupt request signal to the CPU.
7. CPU Receives Interrupt: If CPU interrupts are enabled, the CPU
temporarily suspends its current task.
8. Save Context: CPU saves the state of its current program (e.g.,
registers, PC) onto the stack.
9. Jump to Interrupt Service Routine (ISR): CPU jumps to a
predefined memory location that contains the start of the ISR for
that specific device.
10. Execute ISR: The ISR performs the necessary data transfer (e.g.,
reads data from device buffer) or handles the device's status.
11. Restore Context: After the ISR completes, CPU restores the saved
context from the stack.
12. Resume Main Program: CPU returns to the exact point in the
main program where it was interrupted, continuing its execution.
13. End: Program execution completes.
Key Characteristic: CPU is utilized more efficiently as it is not tied up in
polling. It responds to I/O events only when notified, leading to better
responsiveness and throughput in multi-tasking environments.
B. FLYNN’S CLASSIFICATION:
Flynn's Classification is a widely used taxonomy for computer
architectures, proposed by Michael J. Flynn in 1966. It categorizes
architectures based on the number of instruction streams and data
streams they can process simultaneously.
1. SISD (Single Instruction Single Data):
▪ Description: A uniprocessor computer that executes a single
instruction stream on a single data stream. This is the most
traditional architecture.
▪ Characteristics: Sequential execution. Only one CPU, one
memory unit.
▪ Examples: Early mainframe computers, traditional single-
core personal computers before the advent of multi-core
processors. Most traditional Von Neumann architectures fall
into this category.
2. SIMD (Single Instruction Multiple Data):
▪ Description: A single instruction is executed simultaneously
on multiple data streams. All processing elements perform
the same operation on different data elements.
▪ Characteristics: Exploits data-level parallelism. Suitable for
tasks with high data parallelism, such as vector operations,
image processing, and scientific computations.
▪ Examples: Vector processors (e.g., Cray X-MP), Graphics
Processing Units (GPUs) with their many-core architecture,
array processors, processor extensions like SSE/AVX in
modern CPUs.
3. MISD (Multiple Instruction Single Data):
▪ Description: Multiple processing elements execute different
instructions on the same data stream. This architecture is
rarely seen in practice and is more theoretical.
▪ Characteristics: Potentially useful for fault-tolerant systems
where multiple processors perform the same task and results
are compared, or for pipelined architectures where data flows
through different functional units.
▪ Examples: Some highly specialized fault-tolerant systems or
systolic arrays for specific computations. No widely
recognized general-purpose computer falls strictly into this
category.
4. MIMD (Multiple Instruction Multiple Data):
▪ Description: Multiple processing elements simultaneously
execute different instructions on different data streams. This
is the most common type of parallel computer today.
▪ Characteristics: Exploits both instruction-level and data-level
parallelism. Highly scalable and flexible. Can be tightly
coupled (shared memory) or loosely coupled (distributed
memory).
▪ Examples: Multi-core processors (e.g., Intel Core i7, AMD
Ryzen), symmetric multiprocessing (SMP) systems, clusters of
workstations, supercomputers (e.g., IBM Blue Gene), and
modern cloud computing infrastructures.
7. SHORT NOTES
WALLACE TREE:
A Wallace Tree is an efficient hardware implementation for
multiplication, primarily used in digital circuits to speed up the process
of summing partial products. In binary multiplication, the product of two
N-bit numbers results in approximately 2N partial products. The
traditional way to sum these products is sequentially, which is slow. The
Wallace tree structure significantly reduces the time complexity of this
summation.
◦ How it works: Instead of summing partial products two at a time
sequentially, a Wallace tree uses a tree-like structure of adders to
sum groups of three partial products (using a 3:2 compressor,
typically a Full Adder) in parallel. Each stage of the tree takes three
inputs and produces two outputs (a sum and a carry). This process
reduces the number of partial products by one-third in each stage
until only two rows of partial products remain. These final two rows
are then added using a fast carry-propagate adder (e.g., a carry-
lookahead adder) to produce the final product.
◦ Advantages: Significantly reduces the propagation delay and thus
speeds up multiplication compared to ripple-carry or array
multipliers. It's particularly effective for large numbers (e.g., 32-bit,
64-bit multiplication) where carry propagation through many
stages would be a bottleneck.
◦ Application: Widely used in high-performance digital signal
processors (DSPs), floating-point units (FPUs) in modern CPUs, and
other applications requiring rapid multiplication.
UART:
UART stands for Universal Asynchronous Receiver-Transmitter. It is a
hardware communication protocol that facilitates asynchronous serial
communication between two devices. "Asynchronous" means there is no
shared clock signal between the transmitting and receiving devices;
instead, synchronization is achieved through start and stop bits.
◦ Functionality: A UART converts parallel data from a CPU or
microcontroller into a serial stream for transmission and converts
incoming serial data back into parallel data for the receiving
device. It handles the timing and data formatting.
◦ Key Features:
▪ Start Bit: A logic LOW (0) bit that signals the start of a data
packet.
▪ Data Bits: Typically 5 to 9 bits (commonly 8 bits) representing
the actual data.
▪ Parity Bit (Optional): An extra bit used for error checking
(even or odd parity).
▪ Stop Bit(s): One or two logic HIGH (1) bits that signal the end
of the data packet.
▪ Baud Rate: The rate at which data is transferred, measured in
bits per second (bps). Both sender and receiver must be
configured to the same baud rate.
◦ Communication Process:
1. Transmitting: The CPU sends parallel data to the UART's
transmit buffer. The UART adds start, parity (if enabled), and
stop bits, then shifts the data out serially, one bit at a time.
2. Receiving: The UART samples the incoming serial line. Upon
detecting a start bit, it synchronizes with the incoming data
stream, samples the data bits at the correct baud rate, checks
for parity errors, removes the start/stop bits, and converts the
serial data back to parallel format, making it available to the
CPU.
◦ Applications: Widely used for communication between
microcontrollers and other peripherals (GPS modules, Bluetooth
modules, Wi-Fi modules), for debugging via serial consoles, and in
various embedded systems for device-to-device communication.
RISC AND CISC:
RISC (Reduced Instruction Set Computing) and CISC (Complex
Instruction Set Computing) are two fundamental design philosophies
for processor architectures, primarily differing in the complexity and
number of instructions in their instruction sets.
◦ RISC (Reduced Instruction Set Computing):
▪ Design Philosophy: Emphasizes simplicity and efficiency of
individual instructions. The idea is that compilers can
generate efficient code using a small, highly optimized set of
simple instructions, which can then be executed very quickly,
often within a single clock cycle.
▪ Instruction Set:
▪ Fewer instructions, typically 100 or less.
▪ Simple, fixed-length instructions, often one instruction
per word.
▪ Instructions execute in a single clock cycle or a very few
cycles.
▪ Load/Store architecture: Only load and store instructions
access memory. All other operations (arithmetic, logical)
occur on operands held in registers.
▪ Larger number of general-purpose registers to minimize
memory access.
▪ Pipelining is easier to implement efficiently due to fixed
instruction length and simple execution steps.
▪ Compiler Role: More complex, as it needs to break down
high-level language constructs into a sequence of many
simple RISC instructions.
▪ Performance: Achieves high performance through
instruction-level parallelism (pipelining, superscalar
execution) and high clock frequencies.
▪ Examples: ARM (found in smartphones, tablets, embedded
systems), MIPS, SPARC, PowerPC (initially).
◦ CISC (Complex Instruction Set Computing):
▪ Design Philosophy: Aims to accomplish tasks with as few
assembly instructions as possible. CISC instructions can
perform multiple low-level operations (e.g., a single
instruction for memory access, arithmetic operation, and
register update).
▪ Instruction Set:
▪ Large number of instructions, often hundreds or
thousands.
▪ Complex, variable-length instructions, which can vary
from a few bytes to many bytes.
▪ Instructions can take multiple clock cycles to execute.
▪ Instructions can directly manipulate memory operands.
▪ Fewer general-purpose registers, as instructions can
operate directly on memory.
▪ Pipelining is more challenging due to variable
instruction lengths and complex operations.
Microprogrammed control units are often used.
▪ Compiler Role: Simpler, as complex high-level operations can
often be translated directly into single CISC instructions.
▪ Performance: Historically aimed for high "code density" (less
memory usage for programs). Modern CISC processors often
translate complex instructions into a sequence of simpler
"micro-operations" which are then executed on an internal
RISC-like core, thus blending aspects of both philosophies.
▪ Examples: Intel x86 (used in most personal computers and
servers), Motorola 68000.
In modern processor design, the distinction between pure RISC and
CISC has blurred, with many architectures incorporating features from
both to achieve optimal performance and efficiency.