KEMBAR78
Computer Architecture Notes | PDF | Cpu Cache | Central Processing Unit
0% found this document useful (0 votes)
23 views10 pages

Computer Architecture Notes

The document discusses the concepts of architecture and microarchitecture, focusing on Instruction Set Architecture (ISA) and its implementation in hardware. It covers the evolution of ISAs, various machine models, performance considerations, and the importance of pipelining in modern processors. Additionally, it addresses memory technology, cache design, and the challenges associated with optimizing performance and efficiency in computing systems.

Uploaded by

pramodhullur2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views10 pages

Computer Architecture Notes

The document discusses the concepts of architecture and microarchitecture, focusing on Instruction Set Architecture (ISA) and its implementation in hardware. It covers the evolution of ISAs, various machine models, performance considerations, and the importance of pipelining in modern processors. Additionally, it addresses memory technology, cache design, and the challenges associated with optimizing performance and efficiency in computing systems.

Uploaded by

pramodhullur2003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 10

1.

Architecture (Big A Architecture)


* Instruction Set Architecture (ISA): This refers to the abstraction layer provided
to software, essentially acting as the interface between software and hardware. It
is a high-level view of the machine's functionality, visible to the programmer.
* Programmer Visible State: This includes components like memory, registers, and
the fundamental data types the machine can operate on (e.g., byte, word, or
floating-point numbers).
* Instructions and Execution Semantics: It defines how the machine executes
instructions like "add," "subtract," etc., and handles more complex operations,
including I/O and interrupts.
* Stability: The instruction set is designed to remain relatively unchanged,
providing a consistent programming environment across different implementations of
the same architecture.
* Trade-offs: The design of an ISA involves various trade-offs such as speed, cost,
and energy. These considerations affect how applications run but leave room for
flexibility in how the architecture is implemented.
2. Microarchitecture (Organization)
* Implementation of the ISA: Microarchitecture is concerned with how to implement
the ISA in hardware. It refers to the underlying structure of the processor that
actually executes the instructions.
* Trade-offs in Microarchitecture: The design of a microarchitecture involves many
subtle decisions on how to balance factors such as performance, energy efficiency,
and manufacturing cost.
3. History and Evolution of ISA and Microarchitecture
* IBM’s Contribution: IBM played a key role in this evolution by introducing the
IBM 360 in the 1960s the first ISA. The IBM 360 consolidated multiple product lines
into a unified instruction set, allowing for different microarchitectural
implementations while maintaining compatibility at the ISA level.
4. Machine Model
The "machine model" of an instruction set architecture (ISA) refers to an abstract
representation of a computer processor that defines how software interacts with the
hardware. It outlines the available instructions, data types, addressing modes, and
other architectural details that enable the execution of programs on the processor.
5. Machine Model Types:
* Stack-Based Architecture:
* Operands are pushed onto the stack, and operations are done by popping
operands off the stack and pushing results back. This model is simpler but may have
performance inefficiencies due to the need for repeated memory accesses.
Screenshot 2025-01-21 154332.jpg

* Accumulator-Based Architecture:
* One operand is implicit (usually the accumulator (temporary memory)), while
the other operand is from memory. This simplifies operations but limits
flexibility.
* Register-Memory Architecture:
* Operands come from memory and registers, and sometimes the destination is
also specified. This is more flexible than the accumulator model.
* Register-Register (Load-Store)(LDR & STR instruction) Architecture:
* All operands come from and results go to registers. This is highly
efficient as it minimizes memory accesses, using registers for both operands and
results.
* Example: MIPS architecture, where multiple operands can be loaded into
registers before performing operations.
1. Performance Considerations:
* Stack-based models may lead to inefficiencies because of redundant memory
accesses (e.g., reloading values). In contrast, architectures like MIPS with
registers can load values once and perform multiple operations without needing to
reload them.
* The trade-off is that having more operands named (like in the register-
register model) requires more instruction space, but it improves performance by
reducing memory references.
2. Optimization with Registers:
* To optimize stack-based architectures, memory references can be minimized by
storing part of the stack in registers. This reduces the overhead of memory access
during operations but requires careful management of the stack size.
1. Instruction Set Architectures (ISA): ISAs include the fundamental machine model
and operations. They define how many registers a processor has and how it accesses
them (stack-based, accumulator, register-register, or register-memory
architecture).
2. Classes of Instructions:
* Data Transfer Instructions: These involve moving data between registers and
memory (e.g., loads, stores, and moves to control registers). (LD, ST, MFC1, MTC1,
MFC0, MTC0)
* Arithmetic Logic Unit (ALU) Instructions: Operations like addition,
subtraction, and comparisons (e.g., Set Less Than).( ADD, SUB, AND, OR, XOR, MUL,
DIV, SLT, LUI)
* Control Flow Instructions: Branches, jumps, and traps.( BEQZ, JR, JAL, TRAP,
ERET)
* Floating Point Instructions: Operations on floating point numbers, such as
addition, subtraction, and comparison.( ADD.D, SUB.S, MUL.D, C.LT.D, CVT.S.W)
* Conversion Operations: Converting between data types, such as floating point
to integer.
* Multimedia Instructions: These handle single instruction multiple data (SIMD),
useful in data parallelism and vector units. (ADD.PS, SUB.PS, MUL.PS, C.LT.PS)
3. Complex Instructions:
* Examples like REP MOVSB in x86, which copies strings, and complex operations
in VAX architecture that could perform advanced functions like Fast Fourier
Transforms.
4. Addressing Modes:
* Register-Based Addressing: Involves registers to perform operations without
accessing memory.
* Immediate Addressing: Using constant values directly in operations.
* Displacement Addressing: Adding a constant to a register and using the result
as a memory address.
* Register Indirect: Accessing memory indirectly through a register.
* Absolute Addressing: Using a constant memory address directly.
* Memory Indirect: A two-step memory access through a register.
* Program Counter (PC) Relative Addressing: Used for position-independent code,
adding a displacement to the current PC value.
* Scaled Addressing: In x86, involves adding and multiplying register values,
useful for array indexing.
5. Data Types:
* Binary Integer: Includes ones' complement and twos' complement arithmetic.
* Binary Coded Decimal (BCD): Encodes each decimal digit with four bits for
exact calculations.
* Floating Point: Various types, often following the IEEE 754 standard, with
different precision and range trade-offs.
* Packed Vector Data: Used in multimedia extensions like MMX for simultaneous
operations on multiple data elements.
* Address Data Type: Separate from binary integers, used in older architectures
for address manipulation. Screenshot 2025-01-21 161131.jpg Screenshot 2025-01-21
160558.jpg
Microcoded microarchitecture is a computer architecture that uses microcode to
translate machine instructions into circuit-level operations
* A microprogrammed control unit (MCU) is a part of a computer's central
processing unit (CPU) that uses microinstructions to control the execution of
instructions.
* Microcode is a set of instructions stored in a processor's internal memory
that acts as an intermediary between the processor's hardware and the machine code.
Introduction to Pipelines
* Pipelines: Used in various systems, like factories and microprocessors, to
streamline processes by allowing multiple stages of work to happen simultaneously.
* Idealized Pipeline: Involves all objects going through every stage without
skipping or sharing resources between stages. Each stage should have equal
processing time, and operations should flow independently.
Pipelines in Microprocessors
* Microprocessor Pipelines: Different from assembly lines because instructions
can depend on previous instructions. This creates complexities like data
dependencies and control hazards.
* Unpipelined Processor: Processes instructions one at a time, taking one full
cycle to complete each instruction, leading to a long cycle time but only one cycle
per instruction.
Transition to Pipelined Design
* Breaking the Process: The instruction execution is broken down into stages,
each taking one cycle. This reduces the overall cycle time but increases the number
of cycles per instruction.
* Pipeline Stages: Common stages include:
* Instruction Fetch (IF): Retrieve the instruction.
* Instruction Decode/Register Fetch (ID/RF): Decode instruction and fetch
registers.
* Execution (EX): Perform the operation (e.g., arithmetic).
* Memory Access (MEM): Access data memory if needed.
* Write Back (WB): Write the result back to the register.
Screenshot 2025-01-21 175835.jpg

Challenges and Optimizations


* Multi-Cycle Pipelined Processor: Involves completing each instruction in
multiple cycles across different stages of the pipeline, allowing parallel
execution of multiple instructions.
* Pipeline Registers: Added between stages to separate instructions and maintain
correct operation flow.
* Performance Analysis: Focuses on optimizing factors like clock cycle time,
throughput (number of instructions processed per time unit), and efficiency.
Iron Law of Processor Performance
* Performance Metrics: The key aspects to consider are instruction count, cycles
per instruction, and clock cycle time. Optimizing these can enhance processor
performance.
1. Hazards in Pipelines:
* Structural Hazards: Occur when an instruction needs a resource that is already
in use by another stage. Screenshot 2025-01-22 122501.jpg
* Data Hazards: Happen when instructions depend on the result of earlier
instructions. For example, one instruction produces data that a subsequent
instruction needs. Screenshot 2025-01-22 122826.jpg
* Control Hazards: Arise from control flow changes like branches or jumps, where
the processor needs to decide which instructions to execute next.
Approaches to resolving structural hazards
* Schedule: Programmer explicitly avoids scheduling instructions that would
create structural hazards
* Stall: Hardware includes control logic that stalls until earlier
instruction is no longer using contended resource
* Duplicate: Add more hardware to design so that each instruction can access
independent resources at the same time
Approaches to resolving data hazards
* Schedule: Programmer explicitly avoids scheduling instructions that would
create data hazards
* Stall: Hardware includes control logic that freezes earlier stages until
preceding instruction has finished producing data value
* Bypass: Hardware data path allows values to be sent to an earlier stage
before preceding instruction has left the pipeline
* Speculate: Guess that there is not a problem, if incorrect kill speculative
instruction and restart
Control hazards
Control hazards occur in pipelined processors when the control flow of a program
(e.g., branches, jumps) creates uncertainty about the next instruction to fetch.
This uncertainty can stall the pipeline, leading to performance degradation.
Managing control hazards is critical to improving the efficiency of modern CPUs.
Examples:
* Branch Instructions: Conditional or unconditional branches.
* Jump Instructions: Direct jumps or register-indirect jumps.
* Exceptions
* Interrupts
What do we need to calculate next PC?
* For Jumps :- Opcode, offset and PC
* For Jump Register :- Opcode and Register value
* For Conditional Branches :- Opcode, PC, Register (for condition), and
offset
* For all other instructions :- Opcode and PC
How to avoid control hazards
* In case of Jump instruction: - Speculate Assume the next instruction is at
PC + 4 (no branch). If the assumption is wrong, the incorrect instruction is
discarded. i.e. fetched instruction is killed if jump instruction kills is the
following instruction.
* Use a multiplexer (mux) to insert a "no-op" (no operation) in the pipeline
when the jump or branch is resolved. Redirect the pipeline to the correct
instruction address once the jump/branch is confirmed.
________________

Techniques to Handle Control Hazards


To mitigate the performance loss due to control hazards, various techniques are
employed:
1. Pipeline Stall:
* The simplest solution is to stall the pipeline until the branch condition
is resolved.
* Disadvantage: Reduces throughput as stalls waste cycles.
2. Branch Prediction:
* Speculate the branch outcome (taken/not taken) and fetch the next
instruction accordingly.
* Static Prediction: Fixed strategy (e.g., always predict "not taken").
* Dynamic Prediction: Uses hardware to track branch behavior and improve
accuracy.
3. Speculative Execution:
* Assume a particular outcome (e.g., PC+4) and begin execution.
* If the speculation is incorrect, discard the incorrect instructions
(pipeline flush).
* Requires mechanisms to kill or overwrite speculative instructions.
4. Delayed Branch:
* Rearrange instructions so that the pipeline always executes useful
instructions during a branch.
* Example: Insert "delay slots" with instructions that execute regardless of
branch outcome.
5. Instruction Reordering:
* Rearrange instructions to avoid dependencies and reduce stalls.
6. Multiple Branch Prediction:
* Predict multiple branches ahead in the pipeline.
* Complex but effective in deeply pipelined architectures.
________________

Hardware Components for Handling Control Hazards


1. Branch Target Buffer (BTB):
* Stores branch addresses and outcomes.
* Speeds up branch resolution by caching predictions.
2. Multiplexers (MUX):
* Used to select between multiple possible next instructions.
* Insert no-ops or branch targets dynamically.
3. Extra Adders:
* Compute branch targets early in the pipeline to minimize delays.
4. Pipeline Flush Mechanism:
* Clears the pipeline of incorrect instructions when a branch wrong
prediction occurs.
________________

Evaluation of Techniques
Technique
Advantage
Disadvantage
Stall
Simple to implement
Reduces throughput significantly
Branch Prediction
Improves efficiency for predictable branches
Misprediction penalties
Speculative Execution
Boosts performance for well-predicted branches
High complexity, power consumption
Delayed Branch
Utilizes delay slots effectively
Requires compiler support
________________

Memory Technology
1. Memory array: Register file (read/write wireline, read/write bitline)
2. Memory array: SRAM (bit, bit_b, word)
3. Memory array: DRAM (bit, word)

Memory Technology Trade-offs


Latches/Registers Low Capacity, Low Latency, High Bandwidth
Register File
SRAM
DRAM High Capacity, High Latency, Low Bandwidth
Caches and Memory Hierarchy
In a typical computer, the processor is connected to a memory system that stores
data and instructions. However, accessing data from main memory (e.g., DRAM) can be
significantly slower than the processor's speed. This gap in performance is the
motivation for introducing caches.
________________
Why Do We Need Caches?
1. Latency Issue:
* Accessing DRAM can take hundreds to thousands of clock cycles in modern
processors.
* For instance, a 2 GHz superscalar processor accessing a 100 ns DRAM can
execute about 800 instructions in the time it takes to fetch one piece of data from
memory.
2. Bandwidth:
* Bandwidth i.e. the amount of data that can be accessed per unit of time is
high.
3. Physical Constraints:
* Larger memory systems mean data has to travel farther, limited by physical
constraints like the speed of light and wire resistance. This increases access
time.
________________

Introducing the Cache


A cache is a small, fast memory placed close to the processor. The memory hierarchy
is designed with the following characteristics:
* Small and Fast: Cache memory is much smaller but faster than DRAM.
* On-Chip vs. Off-Chip Bandwidth:
* On-chip communication (e.g., SRAM caches) is much faster and more
bandwidth-efficient than off-chip memory access (e.g., DRAM).
* Off-chip connections involve physical constraints, like solder balls or
pins, which are bulkier than on-chip wires.
The cache is effective only if it stores data the processor needs frequently.
Otherwise, the added complexity and power consumption of a cache would be wasteful.
________________

Memory Hierarchy
1. Registers:
* Smallest and fastest memory within the processor.
* Limited capacity, typically used for immediate computations.
2. SRAM (Cache):
* Larger than registers but still small and faster than DRAM.
* Divided into multiple levels: L1 (closest to the processor, smallest and
fastest), L2, and sometimes L3.
3. DRAM (Main Memory):
* Larger and slower, used for storing active programs and data.
4. Storage (e.g., HDD, SSD):
* Largest and slowest, used for long-term data storage.
________________

Exploiting Access Patterns in Caches


Caches exploit two main principles of memory access:
1. Temporal Locality:
* Recently accessed data is likely to be accessed again soon.
* Example: A loop repeatedly executing the same instructions.
2. Spatial Locality:
* Data near recently accessed addresses is likely to be accessed soon.
* Example: Accessing consecutive array elements.
________________
Key Challenges in Cache Design
1. Balancing Size and Speed:
* Larger caches can store more data but may increase latency.
* A multi-level cache hierarchy (L1, L2, L3) balances these trade-offs.
2. Cache Coherence in Multicore Systems:
* In systems with multiple processors, ensuring all cores have a consistent
view of memory is complex.
3. Eviction Policies:
* Deciding which data to replace in the cache when it is full (e.g., Least
Recently Used, or LRU policy).
________________

Big endian and little endian are two ways of storing data in a computer's memory.
The difference between the two is the order in which the bytes are stored.

Big endian
* The most significant byte is stored first, at the lowest memory address
* Big endian is the dominant order in network protocols
Little endian
* The least significant byte is stored first, at the lowest memory address
* Most PCs use little-endian format
________________

3. Cache Components and Functionality


* Basic setup:
* Data flows between the processor, cache, and main memory.
* Caches store both the data and an address tag to identify where the data
belongs in main memory.
* Cache line:
* Contains two key parts:
1. Tag: A subset of the main memory address for identification.
2. Block: The actual data.
* Includes a valid bit to indicate if the data in the cache is valid.
* Load process:
1. Processor checks if the requested data exists in the cache (cache hit).
2. If not (cache miss), it fetches the data from main memory, replaces a
cache line (based on a replacement policy), and stores the new data in the cache.
________________

4. Cache Classifications
Caches are classified based on:
1. Block placement: Where data can be stored in the cache.
2. Block identification: How data is found in the cache.
3. Block replacement: What happens to existing data when new data is added.
4. Write strategy: How the cache handles data writes (e.g., updating main
memory or only the cache).
________________

5. Types of Caches
* Direct-mapped cache:
* Each block of data has one fixed location in the cache.
* Uses a simple indexing function: Block number % Cache size.
* Example: Block 12 in a cache of size 8 → 12 mod 8 = 4. It can only go in
slot 4.
* Advantage: Simple and easy to build.
* Limitation: May cause frequent conflicts if multiple blocks map to the same
location.
* Set-associative cache:
* Each block can go into one of multiple locations (ways).
* Example: A 2-way set-associative cache with 4 buckets → Block 12 (12 mod 4
= 0) can go into one of two ways in bucket 0.
* Modern processors like Intel Core i7 use higher associativity (e.g., 8-way
or 16-way) for better performance.
* Advantage: Reduces conflicts compared to direct-mapped caches.
* Fully associative cache:
* A block can be stored anywhere in the cache.
* Advantage: No restrictions on placement.
* Limitation: Complex and expensive to implement.
________________

This snippet explains the strategies used for handling cache writes and
differentiates between them based on how and where data is written when changes
occur.
Cache Write Policies
1. Cache Hit
a. Write Through
* Writes data to both the cache and the main memory simultaneously.
* Ensures consistency but increases bandwidth usage
* Advantages: Simple design, data in memory is always consistent with cache.
* Disadvantages: Increases memory traffic, leading to slower performance.
b. Write Back
* Writes data only to the cache initially. The main memory is updated only
when the cache block is evicted (removed from cache).
* Uses a dirty bit to track modified blocks, avoiding unnecessary write-
backs.
* Advantages: Reduces memory traffic, improving performance, reduces
bandwidth.
* Disadvantages: More complex to design, as it requires mechanisms to track
dirty blocks.
2. Cache Miss
a. No Write Allocate
* Writes directly to main memory without bringing the block into the cache.
* Advantages: Simpler design.
* Disadvantages: Cache remains unutilized for future accesses to the same
block.
b. Write Allocate
* Fetches the block into the cache, then writes to it.
* Advantages: Allows future writes and reads to use the cache for better
performance.
* Disadvantages: Higher initial latency as the block must be fetched.
________________

Common Combinations
* Write Through & No Write Allocate
* Simpler design, ensures data consistency but increases traffic to main
memory.
* Write Back & Write Allocate
* Reduces memory traffic, improves performance but requires more complexity
to track dirty blocks.
________________

1. Cache Structure:
* A cache typically has data blocks, tags, and a valid bit for each entry.
* Data blocks hold actual data, the tag helps identify the memory address
stored in the block, and the valid bit indicates if the block is in use.
2. Replacement Policies:
* Random: Simplest but less effective for temporal locality.
* Least Recently Used (LRU): Replaces the block that hasn't been accessed for
the longest time.
* Accurate LRU tracking is complex for highly associative caches.
* First In, First Out (FIFO): Replaces the oldest block in the set.
* Not Most Recently Used (NMRU): Replaces any block except the most recently
accessed one.
1. Goal of Cache(Iron law): The main purpose of cache is to improve
performance by reducing power consumption and speeding up data retrieval. By
decreasing the time it takes to access data (clocks per instruction), we can make
programs run faster. Caches try to avoid accessing slower main memory, which takes
more time.
2. Types of Cache Misses:
* Compulsory Miss: Occurs when accessing a block of data for the first time.
You can't avoid this miss, but techniques like prefetching data (anticipating
future access) can reduce it.
* Capacity Miss: Happens when the cache is too small to store all the needed
data, causing data to be evicted and reloaded. Larger caches generally lower the
miss rate.
* Conflict Miss: Arises when two or more pieces of data compete for the same
cache location due to insufficient associativity (how the cache handles multiple
entries). Properly managing cache mappings can reduce conflicts.
3. Cache Size and Miss Rate:
* Larger caches generally have fewer misses because they can store more data,
but this is not always true in every situation. The relationship between cache size
and miss rate depends on the type of data access pattern.
* A common rule of thumb is that doubling the cache size typically reduces
the miss rate by the square root of two. This is called the "square root rule".
4. Cache Block Size:
* Larger cache blocks (chunks of data) can reduce overhead and improve
performance by transferring more data at once. However, if the block size is too
large, it may lead to wasting memory bandwidth because only a small part of the
block might be needed.
* Smaller block sizes can improve efficiency in cases with more random data
access but may increase overhead.
5. Improving Cache Efficiency:
* Reducing Cache Access Time: Small and simple caches are sometimes better as
they can reduce access time, which is beneficial for performance.
* Increasing Cache Associativity: Using more ways in a cache (e.g.,
converting a 2-way associative cache to 4-way) can reduce conflict misses but
increases power consumption and may slightly add to access time.
6. Empirical Guidelines:
* A rule of thumb is that increasing cache size will reduce the miss rate,
but only up to a certain point. Doubling the cache size typically improves
performance, but with diminishing returns.
________________

A superscalar processor is a central processing unit (CPU) that can execute


multiple instructions at once during a clock cycle. It's also known as a multiple-
issue processor.
* A superscalar processor checks for resource conflicts to determine which
instructions can be executed simultaneously.
* It uses a pipeline stage logic circuit and an instruction window to detect
and select independent instructions.
* It uses techniques like parallel instruction decoding, speculative
execution, and out-of-order execution.
Benefits
* Superscalar processors can improve instruction per cycle (IPC) and reduce
cycle per instruction(CPI) count
* They can achieve high performance when the instructions to be executed are
independent.
* Superscalar processors are commonly used in desktop and server systems.
* The Intel Core i7 processor is an example of a superscalar processor.
________________

Can have both in-order and out-of-order superscalar processors.


* In- order is the execution of the instructions in the written order or as
it is.
* Out-of-order is the execution of the instruction in the efficient manner.
Fetch Logic and Alignment Constraints(look into Lecture 4 PPT)
In instructions like Jump operations, the fetch cycle is executed, but the rest of
the cycles are killed due to jump instruction in the previous instruction.
________________

You might also like