Computer Architecture
Computer organisation
Memory components
▪ D latch
o store the state value unless the clock input C is asserted
o When C is asserted the value of input D replaces the value of Q
▪ flip-flop
o The output is equal to the value of the stored state
o The internal state is changed only on clock edge
▪ A register is a flip-flop with several bits
o A n-bits register consists n flip-flops with n inputs, n outputs and 1 clock
o Various types of registers are available commercially
▪ In a shift register the output of the flip-flopi is connected to the input of the flip-
flopi+1
▪ A register file is an array of registers
o Each register can be read by supplying a its register number
▪ Random Access Memory
o Larger amounts of memory than registers
o Slower access than registers
o Organised as arrays of 2m rows of n bits
▪ m bits needed to select a row
▪ Read Write Selector (RWS) control bit
• RWS = 0, RAM reads the address and the contents are
available in O
• RWS=1, RAM writes I at the address
Memory hierarchy
Memory organisation
▪ Endianness
o The order of byte wise values in memory
▪ Big-Endian
o Byte with most significant value: stored first (lowest memory address)
o Data networking and mainframes
o Motorola 68000 and PowerPC G5 are big-endian
▪ Little-Endian
o Byte with least significant value: stored first (lowest memory address)
o x86 Intel and AMD64 processors family and most microprocessors
▪ Some architectures support both
o E.g. Arm and IBM POWER in full, recent x86 and x86-64 have limited
support (movbe)
Big endian
▪ The location address points to the big end of the number
o Like writing the left-to-right
Little Endian
▪ The location address points to the Little endian of the number
o Like writing the bytes right-to-left
Endianness in Python
▪ Handling binary data
o stored in files
o or from network connections
Computer organisation
Central Processor Unit
Grouping operations together
Computation – programs
▪ Compute the sum of two vectors
o Vectors = data; data is stored in memory
Data operations
Processors
▪ M Chips
o N cores/chip
o T threads/core
▪ LLC – last level cache memory
▪ What do we need?
o A program – sequence of instructions
▪ Or multiple sequences... if concurrent/parallel
o Data – operands should reach the instructions
Hardware Thread
▪ Each hardware thread independently...
o Fetches instructions*
o Decodes
o Issues load memory accesses*
o Executes*
o Stores results*
*When executing a single thread per core, then such a thread has all core
resources available!
- Memory bandwidth
- Functional units
▪ Multithreading
o Execute multiple threads in parallel
Software Thread
▪ The instruction flow of a given running program. Any program has at least one
thread.
o Single-Threaded
▪ Multi-Threaded: execute multiple threads in parallel or concurrently
Hardware multithreading
▪ Each hardware thread independently...
o Fetches instructions*
o Decodes
o Issues load memory accesses*
o Executes*
o Stores results*
*When executing a single thread per core, then such a thread has all core
resources available!
- Memory bandwidth
- Functional units
Detailed memory access
Sample code
▪ Computing on vectors a, b, and c
▪ Accesses reference main memory locations, not cache locations
o Cache memories are transparently managed by the hardware
o Memory coherency: any read from any processor to a particular memory @,
returns the most recently written value to that @
o Memory consistency: ensure writes to different memory @ will be seen in the
correct order from all processors
Code generation details
Code execution details
Core details
▪ Instructions need the use of registers for bringing data to the thread
o Load/store instructions bring data from memory (also mov, add, mul...)
o Computation instructions use the ALUs to process data (add, mul...)
o Control instructions break the execution sequence (conditionally...)
Complete processor/memory system
▪ Most usually, systems have two or more chips
o NUMA – Non-Uniform Memory Access
Example of multiprocessor motherboard
Software/hardware mapping
Current processor chips
▪ Intel Xeon E7 v4 family
o 14 nm technology
o 24 cores / hyperthreading (2), 2.2 – 3.4 GHz.
o L3 cache 60MB.
o MAX CPU supported 8 sockets
o 3.07 TB. MAX RAM 1866 MHz., 4 memory channels
o PCIe x4, x8, x16
▪ IBM Power 9
o 14 nm technology
o 24 cores / SMT (8), 3.0 – 4.0 GHz.
o L1 caches 32+32 KB
o L2 cache 512 KB.
o L3 cache 120MB
o MAX CPU supported 4-8 and more sockets
o 2 TB MAX RAM DDR4
o PCIe v4 x4, x8, x16
▪ Intel KNL – Xeon Phi 72x5
o 14 nm technology
o 72 cores 1.5 – 1.6 GHz.
o L2 cache 36 MB.
o MAX CPU supported 1 socket?
o 384 GB. MAX RAM DDR4
o PCIe v3 x4, x8, x16
▪ ARM Cortex-A77
o 7 nm technology
o aarch64 – ARMv8-A
o 4-8 cores
o DynamIQ Technology – (big-LITTLE)
▪ ARM Cortex-A72 – A64FX (Fujitsu)
o 7 nm
o ARMv8.2
o 48 cores
o 512-bit SIMD Scalable Vector Extensions (SVE)
▪ Apple M3
o 3 nm technology
o 4.05 GHz performance, 2.76 GHz efficiency
o aarch64 – ARMv8.6-A
o 4 performance cores + 4 efficiency cores
o L1 cache 192+128 KiB per performance core
o L1 cache 128+64 KiB per efficiency core
o L2 cache 16 MiB
o RAM 8-24 GB
o GPU 8-10 cores
Computer organisation
Input/Output components
▪ The I/O Bus extends the access to
o Accelerators (GPUs, FPGAs)
o Disks
o Network
o Human-Machine Interface Peripherals
Accelerators
Access to accelerators/devices/peripherals
Sata and HMI Peripherals
Storage and file systems
Networking
▪ Send/receive information to
o Servers
o Network-attached disks
▪ Protocols
o Low-level – ethernet packet
o High-level – TCP/IP
▪ Control based on memory mapped configuration registers
o Access from the OS
▪ Data transfers based on DMA engines
Virtual Machine (VM)