Fundamentals of Operating Systems-April-2024
Fundamentals of Operating Systems-April-2024
Fundamentals of
Operating Systems
Build efficient software by understanding how the OS
works
husseinnasser
Introduction
husseinnasser
Introduction
● Welcome
● Who this course is for?
● Course Outline
husseinnasser
Introduction to OS
Understanding the need of an OS
husseinnasser
Why OS?
Why do we need an operating system?
OS
● Scheduling is critical
● Fair share of resources to all processes
○ CPU/Memory/IO
P1 P2
Why OS
System Architecture
The components the operating system operates
System Architecture
● Persisted storage
● HDD, SSD
● Slower than memory
Network
● User space
○ Browser
○ Postgres
User
● Protected kernel space
○ Kernel code
○ Device drivers
○ TCP/IP Stack 0x00000000
Program vs Process
Process is a Program in motion
Program
Process
postgres.exe
Compile
Assembly
C code
Compiling
Executable file
(program)
Process r0 r1 pc
1 3 0xaabbcc00
stack
0xffeeddcc c: 4 4
r2 r3
CPU
Heap
static
0xaabbcc12
0xaabbcc08
Compile
Assembly
C code
Compiling
Executable
(program)
Run the program
Process 1
Process 2
Process 3
Your Process
Memory
Cost time!
Process 3
0xffeeddcc
Your Process
0xaabbcc12
0xaabbcc08
0xaabbcc00
0xffeeddcc
r2 r3
CPU
0xaabbcc12
0xaabbcc08
0xaabbcc04 Instruction
Text (code)
0xaabbcc00
Pointer / program
counter (pc)
0xaabbcc00
ir
0xffeeddcc
r2 r3
Memory CPU
read!
0xaabbcc12
0xaabbcc08
0xaabbcc04
0xaabbcc00 pc
ir
0xffeeddcc
r2 r3
CPU
0xaabbcc12
0xaabbcc08
0xaabbcc04
0xaabbcc00 pc
ir
0xffeeddcc
r2 r3
CPU
0xaabbcc12
0xaabbcc08
0xaabbcc04 pc
0xaabbcc00 pc
ir
0xffeeddcc
r2 r3
CPU
0xaabbcc12
0xaabbcc08
0xaabbcc04 pc
0xaabbcc00
Note, that while I’m representing second instruction fetch as a memory access, it is often retrieved from
the CPU L caches. This is because when we fetch data from the memory we fetch not only what we
need, but much more, what is called burst.
This is the beauty of understanding caches and power of using near by code..
Execute r0 r1 pc
1 3 0xaabbcc04
ir
0xffeeddcc
r2 r3
0xaabbcc12
0xaabbcc08
0xaabbcc04 pc
0xaabbcc00
Increment r0 r1 pc
1 3 0xaabbcc08
ir
0xffeeddcc
r2 r3
0xaabbcc12
0xaabbcc08 pc
0xaabbcc04
0xaabbcc00
Load r0 r1 pc
1 3 0xaabbcc08
ir
0xffeeddcc
r2 r3
0xaabbcc12
0xaabbcc08 pc
0xaabbcc04
0xaabbcc00
Execute add r0 r1 pc
1 3 0xaabbcc08
ir
0xffeeddcc 4
r2 r3
0xaabbcc12
0xaabbcc08 pc
0xaabbcc04
0xaabbcc00
Increment/Fetch r0 r1 pc
1 3 0xaabbcc12
ir
0xffeeddcc 4
r2 r3
0xaabbcc12 pc
0xaabbcc08
0xaabbcc04
0xaabbcc00
Execute, store to memory r0 r1 pc
1 3 0xaabbcc12
ir
0xffeeddcc c=4 4
r2 r3
Memory
write!
0xaabbcc12 pc
0xaabbcc08
0xaabbcc04
0xaabbcc00
Writing from the CPU to memory, also involves the CPU L caches which we
will discuss further as we progress through the course
Summary
The Stack
Introducing the Stack
Stack
0xff999999 int a
Stack 0xff999995 int b
0xff999991 int c
● Stack is brilliant data structure function1
● Function has local variables int z
● Each function gets a frame function2
main
● Stack pointer (CPU Register)
● Allocate, deallocate memory
● Points at the end allocate
deallocate
Allocate
(push)
0xff999991 int c
● Stack pointer changes main
sp 0xff99998d
● Need a fixed reference
● Base pointer (frame pointer)
○ Also a cpu register
● To reference variable a, use bp;
● Variable b is bp - 4
1024 Stack
int a
Nested calls 1020
int b
1018 int c
● Main calls func1 main
● Main pauses, func1 executes bp 1014
int x
● Base and stack pointer change 1010
int y
func1
sp 1006
From now on I’ll switch to use decimal for readability of memory addresses
1024 Stack
int a
Nested calls 1020 int b
1018 int c
● But what if func1 returns? main
1014
● sp = sp + 8 bp
sp
int x
● We lost main’s base pointer 1010
int y
● We need to save it 1006
func1
From now on I’ll switch to use decimal for readability of memory addresses
1024
bp int a
Previous bp 1020 int b
1018 int c
● main calls func1 main
sp
1014
1010
1006
bp 1024 Stack
int a
Previous bp 1020 int b
1018 int c
● CPU allocates func1 memory main
1014
● bp, x and y needs 12
● sp = sp - 12 1010
1006
1002
func1
sp
1024 Stack
int a
Previous bp 1020 int b
1018 int c
● bp can be safely changed main
bp 1014
● bp and sp now references func1 old bp (main bp)
● func1 is active 1010
int x
1006
int y
1002
func1
sp
bp 1024 Stack
int a
Func1 returns 1020 int b
1018 int c
● When func1 is done, main
1014
● Use oldbp to store the current bp old bp (main bp)
● That is a memory read 1010
int x
1006
int y
1002
func1
sp
bp 1024 Stack
int a
Func1 returns 1020 int b
1018 int c
● Set sp back main
sp 1014
● Deallocate, sp = sp + 12 old bp (main bp)
● Main is now active 1010
int x
● But where in main should we go? 1006
int y
1002
func1
bp 1024 Stack
int a
Return address 1020 int b
1018 int c
● Main has still work to do main
sp 1014
● It called func1 but lost it’s place old bp (main bp)
● Need to store return address 1010
main’s return address
● Which becomes the pc 1006 int x
● Stored in link register in CPU
1002 int y
func1
998
Summary
Main start
func1
400
… … code
374
0 1024
r0 r1 r3
lr
400
…
374
code cpu
Process registers initialized , sp points to the top of stack by default, bp, lr
points to where the kernel was before process executes
Main called
1024
pc bp sp
386 0 1004
sp 1004 main r0 r1 r3
lr
0
Current pc
pc (main)
400
…
374
code cpu
Main called, set pc = 385, we fetch instruction from memory, execute and expand
the stack by 20, once we fetched it we increment the pc , so pc points to the next
one
Save registers
1024 pc bp sp
bp=0 (4 bytes)
1020
lr=0 (4 bytes)
387 0 1004
sp 1004 main r0 r1 r3
lr
0 pc
400
…
374
code cpu
Add one to pc to execute the next instruction and fetch it and execute it, save the bp
and lr to our stack because we might overwrite them
Set base pointer
bp 1024 pc bp sp
bp=0 (4 bytes)
1020
lr=0 (4 bytes)
388 1024 1004
sp 1004 main r0 r1 r3
lr
pc
0
400
…
374
code cpu
Next instruction, Set our base pointer for main
a=1
bp 1024 pc bp sp
bp=0 (4 bytes)
1020
lr=0 (4 bytes)
389 1024 1004
sp 1004 main r0 r1 r3
lr pc
400
…
374
code cpu
Next instruction, store 1 in r0 this is a pure cpu operation
a=1
bp 1024 pc bp sp
bp=0 (4 bytes)
1020
lr=0 (4 bytes)
390 1024 1004
1016 a=1 (4 bytes)
r0 r1 r3
1004 main
sp
1
pc
lr
400
…
374
code cpu
Next instruction, store r0 to memory
b=3
bp pc bp sp
1024 bp=0 (4 bytes)
1020 lr=0 (4 bytes)
1016 a=1 (4 bytes) 392 1024 1004
1012 b=3 (4 bytes)
r0 r1 r3
pc
1004 main
sp
3
lr
400
…
374
code cpu
We execute the next two instructions at once, set 3 in r0 and store it in memory
call func1
bp
pc bp sp
1024 bp=0 (4 bytes)
1020 lr=0 (4 bytes)
1016 a=1 (4 bytes) 393 1024 1004
1012 b=3 (4 bytes)
r0 r1 r3 pc
1004 main
sp
3
lr
394
400
…
374
code cpu
We prepare to call func1, first we need to set the return address to the next
instruction in main, lr will be 394 in this case
Jump to func1
bp pc bp sp
1024 bp=0 (4 bytes)
1020 lr=0 (4 bytes)
1016 a=1 (4 bytes) 375 1024 996
1012 b=3 (4 bytes)
r0 r1 r3
1004 main
3
sp 996 lr
394
func1
400
…
374
code cpu
pc
Start func1
We set the program counter to the first line in func1 which is 374, load instruction
and increment pointer
Store main’s bp in func1
bp pc bp sp
1024 bp=0 (4 bytes)
1020 lr=0 (4 bytes)
1016 a=1 (4 bytes) 376 1024 996
1012 b=3 (4 bytes)
r0 r1 r3
1004 main
3
bp=1024
sp 996 lr
394
func1
400
…
374
code cpu pc
We store the bp for main which is in the register to func1 for later..
Set bp for func1
pc bp sp
1024 bp=0 (4 bytes)
1020 lr=0 (4 bytes)
1016 a=1 (4 bytes) 377 1004 996
1012 b=3 (4 bytes)
r0 r1 r3
bp 1004 main
3
bp=1024
sp 996 lr
394
func1
400
…
374
code cpu pc
sp 996 lr
394
func1
400
pc
…
374
code cpu
We store 1 in r0 and push it to func1 memory
z=z+1
pc bp sp
1024 bp=0 (4 bytes)
1020 lr=0 (4 bytes)
1016 a=1 (4 bytes) 379 1004 996
1012 b=3 (4 bytes)
r0 r1 r3
bp 1004 main
2
bp=1024
sp 996 lr
394
func1
pc
400
…
374
code cpu
Add 1 to r0,
z=z+1
pc bp sp
1024 bp=0 (4 bytes)
1020 lr=0 (4 bytes)
1016 a=1 (4 bytes) 380 1004 996
1012 b=3 (4 bytes)
r0 r1 r3
main
bp 1004 2
bp=1024
1000 z=2
lr
sp 996
394
func1
pc
400
…
374
code cpu
Store r0 to z’s address, bp -4 , address 1000
About to exit func1
bp
pc bp sp
1024 bp=0 (4 bytes)
1020 lr=0 (4 bytes)
1016 a=1 (4 bytes) 381 1024 996
1012 b=3 (4 bytes)
r0 r1 r3
main
1004 2
bp=1024
1000 z=2
lr
sp 996
394
func1
pc
400
…
374
code cpu
We load bp back to original main’s bp
Jump to main, restore vars
bp pc bp sp
1024 bp=0 (4 bytes)
1020 lr=0 (4 bytes)
1016 a=1 (4 bytes) 395 1024 1004 pc
1012 b=3 (4 bytes)
r0 r1 r3
main
sp 1004 1
bp=1024
1000 z=2
lr
996
394
func1
400
…
374
code cpu
We deallocate func1 and jump to main setting pc = lr, load instruction 394 execute
and restore the a from memory to r0, r0 is 1, because we lost them when we
executed func1
Restore b
bp pc bp sp
1024 bp=0 (4 bytes) pc
1020 lr=0 (4 bytes)
1016 a=1 (4 bytes) 396 1024 1004
1012 b=3 (4 bytes)
r0 r1 r3
main
sp 1004 1 3
bp=1024
1000 z=2
lr
996
394
func1
400
…
374
code cpu
Restore b to r1
c=a+b
bp pc bp sp pc
1024 bp=0 (4 bytes)
1020 lr=0 (4 bytes)
1016 a=1 (4 bytes) 397 1024 1004
1012 b=3 (4 bytes)
r0 r1 r3
main
sp 1004 1 3 4
bp=1024
1000 z=2
lr
996
394
func1 ir
400
…
374
code cpu
Add r0, r1 and store it in r3
c=a+b pc
bp pc bp sp
1024 bp=0 (4 bytes)
1020 lr=0 (4 bytes)
1016 a=1 (4 bytes) 398 1024 1004
1012 b=3 (4 bytes)
1008 c=4 (4 bytes) r0 r1 r3
main
sp 1004 1 3 4
bp=1024
1000 z=2
lr
996
394
func1
400
…
374
code cpu
Store r3 in memory
pc
Terminate main
bp pc bp sp
1024 bp=0 (4 bytes)
1020 lr=0 (4 bytes)
1016 a=1 (4 bytes) 399 0 1004
1012 b=3 (4 bytes)
1008 c=4 (4 bytes) r0 r1 r3
main
sp 1004 1 3 4
bp=1024
1000 z=2
lr
996
0
func1
400
…
374
code cpu
Restore the kernel’s bp,lr and quit
pc
Terminate main
sp pc bp sp
1024 bp=0 (4 bytes)
1020 lr=0 (4 bytes)
1016 a=1 (4 bytes) 400 0 1024
1012 b=3 (4 bytes)
1008 c=4 (4 bytes) r0 r1 r3
main
1004 1 3 4
bp=1024
1000 z=2
lr
996
0
func1
400
…
374
code cpu
Sp is popped back to the top ,
Stack overflow
Data Section
Fixed size, Global and static variables, all functions share
Stack
0xff999999 int a
Data section 0xff999995 function
0xff999991
● Stores static and global variables
Data (static)
● Referenced directly by memory address
● Read only and read write
Text / code
● All functions can access
● Fixed size, like code section
Example
-
Start a process
bp
sp 1024 pc bp sp
bp=0 (4 bytes)
lr=0 (4 bytes)
680 1024 1024
r0 r1 r2
main
r3
#DATA 708
A=10
704 B=20
700
-
pc 680
10
r3
#DATA 708
A=10
704 B=20
pc -
sp 1012 r0 r1 r2
main
10 20
r3
#DATA 708
A=10
704 B=20
-
pc
-
sp 1012 r0 r1 r2
main
10 20 30
r3
#DATA 708
A=10
704 B=20
pc
We sum r0, and r1 store it in r2 and store that in sum local variable
which is at sp-4 offset
Summary
Heap
Large, dynamic place for memory allocations
Stack
0xff999999 int a
Heap section 0xff999995 function
0xff999991
● Stores large data
Heap
● Remain until explicitly removed 0x1000005 Heap
Data (static)
● Dynamic, grows low to high
● malloc, free, new
Text / code
0x0099991
Stack
int *a=0x333333
Pointers function
Text / code
Example
Start a process
bp
sp 2048 pc bp sp
bp=0 (4 bytes)
lr=0 (4 bytes)
ptr=NA (4 bytes) 640 2048 2048
r0 r1 r2
main
r3 lr
900 _malloc
800 _free
704 ….
Mycode
pc 640 …. pc
r0 r1 r2
sp 2036 main
4 0
r3 lr
pc
900 _malloc
800 _free
704 ….
Mycode
640 ….
r0 r1 r2
2036 main
bp
4 0
bp=2048 (4 bytes)
sp lr=668 (4 bytes) malloc
2028 r3 lr
668
900 _malloc
800 _free
704 ….
Mycode
640 ….
We call malloc, it saves the bp, lr of the main so we know where to return
Calling malloc
2048
bp=0 (4 bytes) pc bp sp pc
lr=0 (4 bytes)
ptr=0 (4 bytes) 900 2036 2028
r0 r1 r2
2036 main
bp
1024 0
bp=2048 (4 bytes)
sp lr=668 (4 bytes) malloc
2028 r3 lr
668
1024 *ptr=NA
900 _malloc
800 _free
704 ….
Mycode
640 ….
r0 r1 r2
sp 2036 main
1024 0
bp=2048 (4 bytes)
lr=668 (4 bytes) malloc
2028 r3 lr
0
1024 *ptr=NA pc
900 _malloc
800 _free
704 ….
Mycode
640 ….
Malloc exist, pc is set to lr , bp, sp is set back to the main, we store the
address in the main stack where ptr is
Store value to heap
bp 2048
bp=0 (4 bytes) pc bp sp
lr=0 (4 bytes)
ptr=1024 (4 bytes) 680 2048 2036
r0 r1 r2
sp 2036 main
1024 10
bp=2048 (4 bytes)
lr=668 (4 bytes) malloc
2028 r3 lr
pc
0
1024 *ptr=10
900 _malloc
800 _free
704 ….
Mycode
640 ….
r0 r1 r2
sp 2036 main
1024 11
bp=2048 (4 bytes) pc
lr=668 (4 bytes) malloc
2028 r3 lr
0
1024 *ptr=11
900 _malloc
800 _free
704 ….
Mycode
640 ….
r0 r1 r2
2036 main
bp
1024 11
bp=2048 (4 bytes)
lr=696 (4 bytes) free
sp 2028 r3 lr
696
1024 *ptr=11(freed)
900 _malloc
800 _free
704 ….
Mycode
640 ….
Load the pointer into r0 which becomes the first parameter, call free, update
pc, we free address
Main returns
sp
bp 2048
bp=0 (4 bytes) pc bp sp
lr=0 (4 bytes)
ptr=1024 (4 bytes) 800 2048 2036 pc
r0 r1 r2
2036 main
1024 11
bp=2048 (4 bytes)
lr=696 (4 bytes) free
2028 r3 lr
0
1024 *ptr=11(freed)
900 _malloc
800 _free
704 ….
Mycode
640 ….
Double-free fun1
Text / code
Performance
● Go, Java,
*ptr=11
…. ….
Mycode Mycode
…. ….
Program Break bp=0 (4 bytes)
lr=0 (4 bytes)
ptr=1024 (4 bytes)
….
Mycode
….
Summary
Memory Management
What is memory and how does the OS manages it
husseinnasser
What is memory?
Random access memory
Memory
● Stores data
● Volatile
○ RAM - Random access memory
● Non-Volatile
○ ROM - Read only memory
Static RAM
● SRAM
● Complex, expensive but fast
○ 1 bit -> 1 flip flop -> 6 transistors
● Flip-flops, constant power
● Used it CPU caches, SSDs
● Access is always fast
Dynamic RAM
● DRAM
● Abundant, cheaper, but slower
○ 1 bit -> 1 capacitor, 1 transistor
● Capacitors, lose their state
● need to be refreshed
● Access is slow
○ Read -> lose the charge
○ Store it
○ Write it back
Asynchronous DRAM
Received
Read
Synchronous DRAM
● SDRAM
● Synchronized RAM and CPU clocks
● No wasted cycles
Response (1/n) Response (2/n) Response (3/n)
Received
Read
Double Data Rate
● DDR SDRAM
● Two transfers per cycle
● Up and down (1/n
)
2/n
)
/n)
se e( (3
n ns ns
e
s po s po o
Re Re sp
Re
Received
Read
DDR4 SDRAM
Prefetch buffer is how much bits I can transfer per memory access,
DDR5 SDRAM
DDR5 SDRAM Channel A Channel B
(32 pin io bus) (32 pin io bus)
The 64 bit physical memory address is encoded with these parameters so the CPU knows which RAM to access, called
the DRAM address
Inactive
row
Opening a row Column
More on this, consider looking up sense amplifiers, shared in each bank. Opened row is cached in the amplifiers. So to open another row we
need to close and flush the cache to the row. When a write happens, the write goes to the shared amplifiers as well first
Summary
● RAM
● DRAM
● SDRAM
● Prefetch buffer io
● DDR4, DDR5
husseinnasser
2036 main
bp=2048 (4 bytes)
lr=668 (4 bytes) malloc
2028
1024 *ptr=11
900 _malloc
800 _free
704 ….
Mycode
pc
640 ….
Read from Memory
2048
bp=0 (4 bytes)
lr=0 (4 bytes) RAM
ptr=1024 (4 bytes)
2036 main
1024 *ptr=11
900 _malloc
800 _free
704 ….
Mycode
pc
640 ….
The kernel wants to read the instruction at the pc, so it asks the CPU to load the instruction on the 640 address.
Read from Memory
2048
bp=0 (4 bytes)
lr=0 (4 bytes) RAM
ptr=1024 (4 bytes)
2036 main
64 bytes
bp=2048 (4 bytes) Read address 640 (50-100ns)
lr=668 (4 bytes) malloc (pc)
2028
1024 *ptr=11
900 _malloc
800 _free
704 ….
Mycode
pc
640 ….
We get back not only one instruction but 64 bytes worth! Remember burst length, it is then cached in the cpu L
caches. This takes around 100 ns,
Write to Memory
2048
bp=0 (4 bytes)
lr=0 (4 bytes) RAM
ptr=1024 (4 bytes)
2036 main
bp=2048 (4 bytes)
lr=668 (4 bytes) malloc
2028
1028
*ptr=11
1024
900 _malloc
800 _free
704 ….
Mycode
640 ….
2036 main
1028
*ptr=22
1024
900 _malloc
800 _free
704 ….
Mycode
640 ….
● 1, 4 or 8 bytes
● Certain sizes are placed in specific
addresses
● E.g. 4 bytes placed in addressed
divisible by 4
First figure, structure allocated char one byte (that can go anywhere), but next we need a double and that must go in a /8 address the next one available is address 0x8, leaving bytes 1-7 empty, followed 2 bytes short which
we can put on the address 0x16 then the 4 byte int on address 0x20
However in the second figure we allocated less memory for the structure as we defined the 1 byte character on 0x1, then the short which lived in a 2 byte address 0x2 then 4 byte address right after on address 0x4
(divisible by 4) then we allocated the double on 0x8
Summary
Virtual memory
Layer on top of physical memory
Agenda
● Fragmentation
● Shared Memory
● Isolation
● Large Programs
Fragmentation
Fragmentation
1,000,000
● One space
● Memory must be contiguous
…
8
4
0
Fragmentation
1,000,000
● Process 1 is loaded
Process 3
● Large Process 2 is loaded
800,000
● Process 3 is loaded
Process 2
10,000
8
Process 1
4
0
Fragmentation
Process 4
1,000,000
● Process 1, 3 terminates
Process 3
● Large process 4 wants to800,000
load
Process 2
10,000
8
Process 1
4
0
Fragmentation 1,000,000
10,000
8
4
0
External vs Internal Fragmentation
● Memory allocation happens in blocks
● External fragmentation
● Internal fragmentation
External Fragmentation
● We get space but its not contiguous
● Happens often with variable size blocks
● e.g. segmentation
Example of external
fragmentation
(no space for purple process)
Large Small
block block
Internal Fragmentation
● Fixed-allocated blocks
● Internal space is wasted
● OS doesn’t know the space is unused
● Happens with fixed-size blocks,
especially large size
Example of internal
fragmentation
Virtual memory and fragmentation
● Let us use fixed block size call it paging
● Each process has virtual address space
● We map logical page to physical page
● Mapping stored in process page table
● Page size is often 4kb
● Many to 1
Virtual memory and fragmentation 10,000
10,000
P4 A.P4
P3 A.P3
Process A VM P2
… P1
A.P2
4 B.P3
0
B.P2
A.P1
10,000
P3 C.P5
Process B VM P2 Mapping C.P4
P1
…
C.P3
4 B.P1
0
C.P2
C.P1
10,000
P5
Process C VM P4 4
P3 0
… P2
4 P1
0 Physical memory
Page Tables
● Another layer requires translation
● Map virtual address to physical
● Page table has this mapping
● Each process has its own page table
● Page table is stored in memory
Page tables
52,000
A B C
48,000
A.P4
16,000 20,000 44,000
P4 16,000
P3 16,000
P5 A.P3
12,000 40,000
P3 12,000
P2 P4 A.P2
8,000 12,000
P2 8,000
P1 P3
4,000
P1 4,000
8,000
P2
36,000
B.P3
2,000 4,000 32,000
0
0
P1 B.P2
0
28,000
A.P1
24,000
C.P5
A page table B page table C page table 20,000
C.P4
16,000
VMA PA VMA PA VMA PA C.P3
12,000
4000 4000
B.P1
4000 28,000 8000 12,000 8,000
C.P2
8000 40,000 12,000 32,000 8000 8000 4,000
C.P1
12,000 44,000 16,000 36,000 12,000 16,000
0
16,000 48,000 16,000 20,000
20,000 24,000
Actual page tables store page numbers for more compact storage, the CPU stores the pointer to the current process page table in the PTBR
(page table base register)
P4
P3
P2
P1
A.P2, B.P1
P3
P2
P1
Shared Memory
Shared memory
Process 5 Code
● Sharing memory is a challenge
● Spin 5 processes of the same program Process 4 Code
Process 1 Code
Physical memory
Shared memory and Virtual Memory
● With virtual memory, we load the code
once and we map all virtual pages to the
same physical address!
P4
P3
Process A VM P2
P1
A.P2, B.P1
P3
P2
Process B VM P1
Shared Libraries
● Most processes use libraries
● OS loads the library code once
● Map the virtual page to the library physical code
● Libc
● /proc/[pid]/maps
Use cases for shared memory
● Multi-processes / multi-threading
● Databases shared buffers
● NGINX/Proxies
● forking
● CoW - copy on write
10,000
P4
P3
Process A VM P2
… P1
4
0
10,000
P3
P2
Process B VM P1
…
4
0
10,000
P5
Process C VM P4
P3
… P2
4 P1
Isolation 0
Isolation
● Physical memory addresses tells process where it is
● Process can attempt to load an address they aren’t
supposed to
● Virtual memory solves this
● Each process has full virtual address
● Most of it isn’t mapped
Process A VM
Isolation with Virtual memory 10,000
P4
Process C VM
10,000
…
4
0
P4
P3 A.P4
A P2 A.P3
P1
A.P2 P5
B.P3 P4
P3
B.P2
A.P1
P3
P2
C.P5
B P1 C.P4
C.P3 Offloaded to disk
B.P1
P5
C.P2
P4 C.P1
C P3
P2
P1
C’s P3, P4 and P5 haven’t been used for a while, so the OS keeps those VM entries but frees the physical
memory allocation.
Swap
P4
A P3 A.P4
P2 A.P3
P1
A.P2 P5
B.P3 P4
P3
B.P2
A.P1
P3
B P2
P1
The next time the process C access P3, P4 or P5 the OS will issue a page fault, remember the mapping entry still exist but there is a bit
that says hey this page isn’t in physical memory, so the OS loads it from swap file to memory, update the new physical address (because
we might load it somewhere completely different). This also means that the index of swap file is stored in the page table to know where in
the swap file the page lives.
Limitations of Virtual Memory
● Additional layer of translation (CPU can’t read virtual addresses)
● More maintenance (page tables)
● Page faults (kernel mode switch)
● More complex CPU architecture (MMU/TLB)
● TLB cache misses (MySQL 8.x vs 5.x)
husseinnasser
DMA
Direct Memory Access
Peripherals Read
● Data from network/disk must pass through CPU
● Keyboard -> CPU -> RAM
● Network -> CPU -> RAM
● Slow at times with large transfers
DMA
● Allow direct access from network/disk to RAM
● DMA controller initializes the operation
● Start the direct transfer
DMA Controller
Notes about DMA
● Must be Physical addresses
● DMA doesn’t often have MMU
● Knows nothing of the virtual memory
● Kernel allocated memory must not be swapped
● IOMMU (allows IO)
O_DIRECT
● Very important option in file systems and databases
● Allows bypassing the file system cache
● Direct from disk to user-space (database)
● Uses DMA
Pros and Cons
● Efficient transfers
● No VM Management
● Less CPU overhead
● But security concerns and complexity
● Initialization cost
● Can’t be used for interrupts (Keyboard/mouse), CPU is faster
Summary
● Large transfers takes time
● Going from peripheral to CPU then memory
● DMA allow direct access
● Has limitations and requires lots of work to get right
husseinnasser
CPU Components
Basic components of the CPU
CPU
Basic Components
ALU
L2
Processor CPU Core #2 CPU Core #1
ALU ALU
CU CU
Registers Registers
MMU MMU
L1 L1
L2 L2
L3
CPU Core #2 CPU Core #1
ALU ALU
CU CU
Registers Registers
L1 L1
L2 L2
L3
Bus
Core 1 Core 2 Core 3 Core 4
Main Main
Memory Memory
(DIMM1) (DIMM2)
L2 L2 L2 L2
L3 L3
Bus Bus
DSM
ALU
ALU
MMU
L1
L2
ALU
CU
CU
● Control Unit
Registers
● Fetches Instructions
● Decodes Instructions
● Executes Instructions
MMU
L1
L2
ALU
Registers
CU
L2
ALU
MMU
CU
● Memory management Unit
Registers
● Responsible for memory access
● Translating virtual to physical address
○ TLB Translation Lookaside buffer
TLB MMU
○ TLB must* be flushed on context
switch L1
L2
L Caches
CPU Core #2 CPU Core #1
ALU ALU
● L1, L2 and L3 CU CU
○ 1 ns, ~128KB
MMU MMU
● L2 local to core
L1 L1
○ 5 ns, ~256-2MB
L2 L2
● L3 shared between all cores
L3
○ 15 ns, ~64 MB
● Main Memory
○ 50-100 ns
L Caches
CPU Core #2 CPU Core #1
ALU ALU
● Memory reads are cached at all levels CU CU
● L1 cache is two types Registers Registers
○ L1D (Data) L1I (Instructions)
○ CU can fetch data and instructions at same time MMU
MMU
● L2, L3 unified L1D/L1I
L1D/L1I
● Cache invalidation challenges
● Some CPUs only have L1, and L2 (shared) L2 L2
L3
CPU Architecture
● RISC - Reduced Instruction Set
○ Simple instructions - each single task - single cycle
○ Low power, predictable
○ Arm
● CISC - Complex Instruction Set
○ One instruction, lots of tasks, multiple cycle
○ More power, unpredictable
○ x86 (Intel/AMD)
CISC vs RISC - Example
● Example a = a + b, where b and a are integers
● CISC - 1 instruction
○ ADD a,b (a and b here are addresses)
○ Takes two memory locations
○ Reads , adds then store in a
● RISC - 4 instructions
○ LDR r0, a
○ LDR r1, b
○ ADD r0, r1
○ STR a1, r0
Clock Speed
● How many cycles per second
● e.g. 3GHz = 3 billion clock cycles per second
● In RISC could mean 3 billion instructions per second
● Less in CISC
● Remember cost of fetching/decoding (pipelining helps)
Display CPU Info (mac)
● Show L caches
○ sysctl -a | grep cachesize
● Show number of cores
○ sysctl -n hw.physicalcpu
● Show CPU architecture
○ uname -m
Summary
● CPU internals
● Caches
● Architecture
husseinnasser
2048
bp=0 (4 bytes)
lr=0 (4 bytes)
ptr=1024 (4 bytes)
2036 main
bp=2048 (4 bytes)
lr=668 (4 bytes) malloc
2028
1024 *ptr=11
900 _malloc
800 _free
704 ….
Mycode
pc
640 ….
Fetch RAM
Read address 640
(pc)
ALU
64 bytes
CU (50-100ns)
Registers
MMU
L1
L2
pc
We fetch that instruction by asking the MMU, the MMU translates the virtual memory address to a physical memory (640),
and we check L caches, the address isn’t there so we go to memory, fetch a whole cache line 64 bytes.
Update L caches RAM
ALU
CU
Registers
MMU
L1i 64 bytes
L2 64 bytes
pc
We then update the L caches with the 64 bytes we just receive, the L2 cache is a general cache, but L1i is the
instruction based cache.
Decode RAM
ALU
CU
Registers
MMU
L1i 64 bytes
L2 64 bytes
pc
The CU gets instruction at 640 and starts decoding it, remember instructions are opcodes that need to be
converted to control signals understood by the ALU. Here sub is converted to a -
Decode RAM
ALU
CU
- 100
registers
sp=100
MMU
L1i 64 bytes
L2 64 bytes
pc
The CU also fetches the values of the operands from the register (sp here) and prepare it for the ALU
Execute RAM
CU
- 100
registers
sp=100
MMU
L1i 64 bytes
L2 64 bytes
pc
CU
- 100
registers
sp=88
MMU
L1i 64 bytes
L2 64 bytes
pc
The next step is to write the new value back to the sp register , the CU takes over to write, (ALU doesn’t write or
read from/ to registers)
Fetch Next instruction RAM
ALU
CU
registers
sp=88
Cache hit!
MMU
L1i 64 bytes
L2 64 bytes
pc
Increment the program counter and fetch the next instruction, luckily we have it hot in the L1i cache. Fast!
Teaser Question
● What if you called a function
● Function code is NOT within the 64 bytes cache
● Cache misses!
● Inlining
● MySQL 8.x
Summary
● Instruction life cycle
● Each stage has component
● Can be overlapped
● RAM access is slower than cache
husseinnasser
With Pipelining
Instruction 1 Fetch Decode Execute Write
Big Task
Big Task
Process 1 Process 2
1 2 1 2
Hyper threading
● Sharing cores
● Hyper threading exposes a single core as multiple logical cores
● Dedicated registers( e.g. pc)/shared CU/ALU/ L Cache
Process 1
Logical
core/hardware 1 2
thread
1
SIMD
● Single Instruction Multiple data
● With a single instruction add multiple values
● Vectors
● Instead of executing 4 instructions
Traditional
○ 1 instruction on 4
Add a1,b1
● Gaming/DB Btrees Add a2,b2
Add a3,b3
● E.g. ARM Neon Add a4,b4
SIMD
Add [a1,a2,a3,a4], [b1,b2,b3,b4]
Summary
● CPU Is mostly idle
● Need to keep it busy
● Pipelining
● Parallelism
● Hyperthreading
● SIMD
husseinnasser
Process Management
A dive in processes
husseinnasser
Process vs Thread
Understanding the difference
Stack
0xff999999 int a
Process 0xff999995 function
Heap
● Has dedicated code, stack, heap,
0x1000005 Heap
data section 0x1000001
Data (static)
● Has context in the CPU (pc, lr, etc..)
● Process Control Block
Text / code
0x0099991
Process Control Block (PCB)
● Kernel needs metadata about the process
● PCB Contains
○ PID, Process State, Program counter, registers
○ Process Control info (running/stopped, priority)
○ Page Table (Virtual memory to physical mapping)
○ Accounting (CPU/memory usage)
○ Memory management info (Pointer to code/stack etc.)
○ IO info (File descriptors)
○ IPC info, semaphores, mutexes, shared memory, messages.
Process table (kernel)
PID | PCB
Kernel Process Table 100 | 0x222222
200 | 0x333333
PCB 100
● Kernel needs to manage processes int a
100, running,
● A mapping table from PID to PCB 0x555555, pc…
● Process Table
Heap
Text / code
Heap
Data (static)
PCB 200
Text / code
200, stopped,
0x9999999,pc..
0xff999999 Stack
Main process stack
Thread
0xff999995
Thread 1 stack
● A thread is a light weight process 0xff999991
0x1000001
Data (static)
Text / code
0x0099991
Thread Control Block (TCB)
● Kernel needs metadata about the thread
● TCB Contains
○ TID, Thread State, Program counter, registers
○ Process Control info (running/stopped, priority)
○ Accounting (CPU/memory usage)
○ Memory management info (Pointer to stack etc.)
○ Pointer to parent PCB
Thread table (kernel)
PID | TID | TCB
Kernel Thread Table 100 | 1001 | 0x222222
100 | 1002 | 0x333333
TCB 1001
● Kernel needs to manage threads 1001, running,
● A mapping table from TID to TCB 0x555555,
Main process stack pc1…
● Thread Table
● Quick lookup Thread 1 stack
TCB 1002
1001, stopped,
● In kernel space Thread 2 stack
0x9999999,pc2.
.
Heap
200, running,
0x4311332,pc..
Text / code
Shared Memory
● Multiple processes/threads can share memory
● mmap
● Virtual memory different, physical memory same
● Shared Buffers in databases
Postgres Processes
● Postgres uses processes
● Should Postgres move to threads?
● Long running discussions
Fork
● Fork creates a new process
● Child must have new virtual memory
● But OS uses CoW so pages can be shared unless a write happens
● Redis Asynchronous durability
Copy on Write
52,000
A 48,000
A.P4
16,000 44,000
P4 A.P3
12,000 40,000
P3 A.P2
8,000
P2 36,000
4,000
P1
2,000 32,000
0
28,000
A.P1
24,000
16,000
VMA PA
12,000
12,000 44,000
0
16,000 48,000
A B 48,000
A.P4
16,000 16,000 44,000
P4 P4 A.P3
12,000 12,000 40,000
P3 P3 A.P2
8,000 8,000
P2 P2 36,000
4,000 4,000
P1 P1
2,000 2,000 32,000
0 0
28,000
A.P1
24,000
16,000
VMA PA VMA PA
12,000
B is a fork of A, initially they are identical, their page tables point to the same physical memory
Copy on Write
52,000
A B 48,000
A.P4
16,000 16,000 44,000
P4 P4 A.P3
12,000 12,000 40,000
P3 P3 A.P2
8,000 8,000
P2 P2 36,000
4,000 4,000
P1 P1
2,000 2,000 32,000
0 0
28,000
A.P1
change 24,000
16,000
VMA PA VMA PA
12,000
A B 48,000
A.P4
16,000 16,000 44,000
P4 P4 A.P3
12,000 12,000 40,000
P3 P3 A.P2
8,000 8,000
P2 P2 36,000
4,000 4,000
P1* P1
2,000 2,000 32,000
0 0
28,000
A.P1
24,000
16,000
VMA PA VMA PA
12,000
B.P1
4000 28,000 4000 12,000 8,000
New page is allocated for B, A.P1 is copied in it, page table is updated to point to it.
Python CoW bug
● Python bug None, True, False
● Refcounting was constantly updated
● Forks were triggering CoW
Summary
● Processes have dedicates VM
● Threads belong to the same process
● Threads share parent process memory
husseinnasser
Context Switching
A critical OS function
CPU Process
● CPU doesn’t really know what a process is
● OS loads data into CPU registers (pc, sp, bp, etc.)
● Pointer to Page Table mapping (ptbr)
● Called “Context”
● Executes instructions
Context Switch
● To switch “context” we save current context and load new context
● “Save” the current registers to current process PCB (memory write)
● “Load” the new process PCB to CPU registers (Memory read)
● pc, bp, sp, lr, ptbr and more
Process 100 Process 200
int a
int a
Heap
pc=4008 Heap
Data (static)
pc=4008 pc=8008
Text / code
Text / code
12,000 44,000
12,000 20,000
The CPU is executing process 100’s instructions, lots of registers are in place but we will only show pc and ptbr for simplicity. Ptbr is the page table base
register, pointing to the page table physical memory address where mapping is stored. Pc is the program counter which points to the virtual memory
address of the current instruction.
Process 100 Process 200
int a
int a
Heap
pc=4012 Heap
Data (static)
pc=8008
Text / code
pc=4012
Text / code
12,000 44,000
12,000 20,000
CPU moves to the next instruction, pc register is currently 4012 , but the PCB pc value in memory still has the old values.
Process 100 Process 200
int a
int a
Heap
pc=4012 Heap
Data (static)
pc=8008
Text / code
pc=4012
Text / code
12,000 44,000
12,000 20,000
OS initiates a context switch, process 100 state must be saved to its PCB, the process now becomes in a ready state
Process 100 Process 200
int a
int a
Heap
pc=8008 Heap
Data (static)
pc=8008
Text / code
pc=4012
Text / code
12,000 44,000
12,000 20,000
OS then loads Process 200’s PCB into the CPUs registers. Process 200 is now executing. We also need to flush the TLB as well, process 200 is now in
running state
TLB flush
● TLB stores Virtual memory mapping cache
● Processes CANNOT share VM mapping
● Slow
● Threads of same process are faster to switch
○ Same memory, paging
○ As long as threads of the same process
TLB ASID
● Address space id
● Identify the process in the TLB
● 255 values
● Avoid TLB flushing on context switch
● ARM/Intel
Credit arm.com
When does context switch happens
● Scheduling algorithms
● Preemptive multitasking
● IO wait
Preemptive multitasking
● Some processes run for a long time
● OS must switch those out
● Time slice
● Windows 3.1 bug where other processes starve
Scheduling algorithms
● What processes/ threads get what CPU core for how long
● Many papers have been written on this topic
● Complex and hard to find a fair algorithm
● You want to schedule threads of same process on the same core
Scheduling algorithms
● First come first serve
● Shortest Job First
● Round Robin
Note about alignment
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
16 17 18 19 20 21 22
● 1, 4 or 8 bytes
● Certain sizes are placed in specific
addresses
● E.g. 4 bytes placed in addressed
divisible by 4
First figure, structure allocated char one byte (that can go anywhere), but next we need a double and that must go in a /8 address the next one available is address 0x8, leaving bytes 1-7 empty, followed 2 bytes short which
we can put on the address 0x16 then the 4 byte int on address 0x20
However in the second figure we allocated less memory for the structure as we defined the 1 byte character on 0x1, then the short which lived in a 2 byte address 0x2 then 4 byte address right after on address 0x4
(divisiable by 4) then we allocated the double on 0x8
Summary
● Context is an OS concept
● CPU execute instructions only
● Context is saved or loaded (thus switched)
● Threads are more efficient in context switch
husseinnasser
Concurrency
Split your program into processes or threads
CPU time is precious
● CPU commodity
● Need to keep it busy
● Can one task be split to concurrent tasks
CPU Bound vs IO Bound Workload
● CPU Bound workload mostly use CPU
○ Encryption, Compression, DB planning, sorting, Protocol Parsing (HTTP/2,
QUIC)
○ We want to limit starvation of those
● IO Bound Workload is mostly does IO
○ Database queries, network connection write/read, file reads/writes
Multi-threaded vs Multi-process
P1 P2
● Multi-process
○ Spin multiple processes
P3
○ Isolated
○ e.g. NGINX, Postgres T1 T2
● Multi-threaded
○ parent process spins multiple threads
○ Share memory with parent
○ E.g. MySQL, libuv
Concurrency issues
● With concurrency comes challenges
● Two threads touching the same variable (thread safety)
● Two processes writing to the same shared memory
a=1
T1 T2
a++ a++
T1 T2
a++ a++
a=2
a=2
Read a=1 Read a=1
a=a+1 a=a+1
a=2 a=2
Store a=2 Store a=2
Mutexes M a=1
T1 T2
● Mutex is binary lock (mutual exclusion) a++ a++
a=2
● Thread 1 acquires a mutex succeeds. a=3
Read Read a=2
● Thread 2 tries to acquire the same mutex, a=1 a=a+1
a=a+1 a=3
waits(blocks) a=2 Store a=3
● Thread 1 does the change and release Store
a=2
● Thread 2 immediately unblocks, gets the
mutex and change
a=1
T1 T2
a++ a++
Acquire Acquire
Mutex Mutex (waits)
Even if both threads attempts to obtain the mutex at the same time, the OS will serialize them the second one to come in will be blocked (its atomic), or one will be chosen
by the OS. Here thread 1 wins. Both T1 and T2 are executing on different cores (T2 will get switched out probably making room for other threads)
a=1
T1 T2
a++ a++
Read a Acquire
a=1 Mutex (waits)
a=a+1
a=2
Thread 1 reads a increments it (CPU executes instructions), a is now 2 in T1’s core register
a=2
T1 T2
a++ a++
Read a Acquire
a=1 Mutex (waits)
a=a+1
a=2
Write a=2
T1 T2
a++ a++
Thread 1 releases the mutex, immediately T2 gets it (OS will context switch T2 back)
a=2
T1 T2
a++ a++
Thread 1 releases the mutex, immediately T2 gets it (OS will context switch T2 back)
a=3
T1 T2
a++ a++
Thread 2 resumes, get the latest version of a which is 2, increments it to 3, writes back to memory, releases
a=3
T1 T2
a++ a++
Mutexes Gotchas
● Mutex has ownership
● The thread that locks Mutex MUST unlock
● If a thread terminates the mutex can remain locked
● Can cause deadlock
Semaphores
● Semaphores can be used for mutual exclusion
● Signal increments, Wait decrements (atomically)
● Wait blocks when semaphore = 0
● Any thread with access to the semaphore can signal/wait
Summary
● Concurrent programming improves performance
● Can use processes or threads
● Challenging as need to deal with race conditions
● Using Mutex/Semaphores help
husseinnasser
Storage Management
The internals of disk storage
husseinnasser
Persistent Storage
Why Persist?
Persistence
● RAM is volatile, if power goes we lose data
● Need to persist data for certain use cases
● Megantic Tape, HDD, SSD, Flash
HDD
● HDD consists of platters, heads, tracks and sectors
● Traditionally the OS addressed using CHS method
○ Cylinder/ Head / Sector
● Geometrical Sector vs Disk sector
● The OS “knows” the physical layout of the HDD
head
cylinder
1
G F
E Geometrical
2 A sector
C D
B
3 Disk sector
G F
2 A E
C D
B
LBAs 1 2 3 4 5 6 7 8
A E
B F
C G
D H
LBA 1 maps to page A, 7 to G and so on..
SSD Write
● We want to write to LBA 1
● We find a free physical block and map it
1 2 3 4 5 6 7 8
A E
B F
C G
D H
We write to LBA 1 , which maps to PBA A.
SSD Write
● Write 4 KiB to LBA 1
● Now we want to update LBA 1 with new content
● No update in SSD, its a remove and add.
1 2 3 4 5 6 7 8
A E
B F
C G
D H
We write to LBA 1 , which maps to PBA A.
SSD Write - Update
● LBA 1 is now maps to PPA B
● A is marked invalid (must be erased before used)
● Can’t erase a single page
1 2 3 4 5 6 7 8
A E
B F
C G
D H
We write to LBA 1 , which maps to PBA B.
SSD Write - More update
● We continue writing
● LBA1 points to D,
● F, G and H are used by 6, 7 and 8
● One more update to LBA 1
1 2 3 4 5 6 7 8
A E
B F
C G
D H
SSD Write - More Updates
● LBA 1 now points to E
● One more update to LBA 1 but everything is full
● Entire block is now invalid we can safely erase it
1 2 3 4 5 6 7 8
A E
B F
C G
D H
SSD Write Garbage collection
● First erase block
● Then we update LBA 1
1 2 3 4 5 6 7 8
A E
B F
C G
D H
Erasing block
SSD Write Amplification
● Now we have LBA 1 points to A
● E is invalid
● This is called Write amplification
● We wanted to do a single write but end up doing more
1 2 3 4 5 6 7 8
A E
B F
C G
D H
A write amplification occurred when we try to write a page but end up doing more work behind the scene slowing the operation
Wear leveling
● NAND cells have write limit
○ Write endurance/ program
● Cold vs hot pages
○ Page written once and never touched
○ While other pages are updated all the time
● Some pages will die before others
1 2 3 4 5 6 7 8
A E
B F
C G
D H
Page A, B, C and D written once and never been touched ever. pages E, F, G and H are always getting used. E, F, G and H will be weared before the rest, and this part of
the SSD becomes unusable.
Over-provisioning
● SSD over-provision an area for GC/WL
OP
1 2 3 4 5 6 7 8
A E
B F
C G
D H
Page A, B, C and D written once and never been touched ever. pages E, F, G and H are always getting used. E, F, G and H will be weared before the rest, and this part of
the SSD becomes unusable.
Over-provisioning
● SSD over-provision an area for GC/WL
● Cold pages are moved to OP
OP
1 2 3 4 5 6 7 8
A E A
B F B
C G C
D H D
Over-provisioning
● SSD over-provision an area for GC/WL
● Cold pages are moved to OP
● Erase cold block
OP
1 2 3 4 5 6 7 8
E A
F B
G C
H D
Over-provisioning
● SSD over-provision an area for GC/WL
● Cold pages are moved to OP
● Erase cold block
● Move hot pages
OP
1 2 3 4 5 6 7 8
E A
F B
G C
H D
Mismatch block to page size
● LBA doesn’t have to match page size!
● e.g. LBA = 4KiB and Page is 8 KiB
● Mapping to addresses in a page
● Causes issues
1 2 3 4 5 6 7 8
A C
B D
LBA 1 maps to Page A at address 0 and LBA 2 maps to Page A on address 2000
Summary
File Systems
Layer above storage
File System
● An abstraction above present storage
● Users like files/directories
● Writing and reading to a file translates to blocks
● Promotes caching
● Must allocate a block (1 or more LBAs)
File System
(OS)
Examples of File Systems
● FAT (FAT16, FAT32)
● NTFS
● APFS (Apple File System)
● EXT4
● XFS
● btrfs
Some terminologies
● PBA - Physical Block Address - Internal to the drive
○ Aka physical sector size
● LBA - Logical Block Address - Exposed to OS
○ aka logical sector size
● File System Block Size Example
LBA 0 1 2 3 4 5 6 7
- - 3 E - - 2 -
File
Name: test.txt
Start LBA: 6
File test.txt starts at LBA 6, to read the entire file we read go to position 6 , find out that the next LBA is 2, we go to the LBA 2 and find out the next LBA is
3, we go to LBA 3 and find out that it is the end of the file no more chain! So to read the entire file we read LBAs 6, 2 and 3.
32 bit is really 28
● We need to reserve some bits
● End of chain, Is dirty, free, other uses
● LBA = 512 byte (the old standard)
● So we can only address 2^28 *512 bytes ~ = 128 GB!
● Very low! So we need to play a trick
● Meet clusters
Clustering
● Cluster is a logical grouping of LBAs (aka blocks)
● LBA size = 512 byte
● E.g. 8 LBAs = 1 cluster (4 KiB)
● E.g. Cluster 0 -> LBAs 0-7, Cluster 1 -> LBAs 8- 15 (C * 8)
● We can address much more (~1 TB)
test.txt
Blocks 0 1 2 3 4 5 6 7
- - - E - - 3 -
LBAs/logical sectors
File
Name: test.txt
Start Block/Cluster: 6
File test.txt starts at Cluster 6, to read the entire file we read go to position 6 , find out that the next cluster is 3, we go to
the cluster 3 and find out the its the end of file. So to read the file we need to convert clusters 6, and 3 to LBAs, cluster 6
is LBAs (48-55) and cluster 3 to LBAs (24-31)
Clustering
● The larger the cluster, the more disk we can address
● But creates more internal fragmentation
● Wasted space, can’t be used by others.
used
waste
Blocks everywhere
● When formatting a file system you specify a file system block size
● Must not exceed virtual page size 1024, 2048, 4096 bytes
● LBA size = Logical sector block
● PBA size = Physical sector block (on device)
● Fs block maps to collection of LBAs (derived like clusters)
LBA 1 PBA 1
LBA 1 LBA 1
LBA 2 PBA 2
PBA 1
LBA 3 PBA 3
LBA 2 LBA 2
FS LBA 4 PBA 4 FS FS PBA 1
block 1 LBA 5 PBA 5 block 1 LBA 3
block 1 LBA 3
LBA 6 PBA 6
PBA 2
LBA 7 PBA 7
LBA 4 LBA 4
LBA 8 PBA 8
4096 -> 512 -> 512 4096 -> 1024 -> 2048 4096 -> 1024 -> 4096
OS Page Cache
● File system blocks read from disk are cached
● Block number maps to virtual memory page
● FS Block <= OS VM Page (often)
● Reads checks the cache first then disk and updates cache
● Writes go to the cache first, then disk
● Block number maps to LBAs
OS Page Cache (Cached Read)
● User wants to read block number 3
● OS first checks the page cache
● Block 3 is cached and mapped to a VM page
● content is copied to the user buffer
Page cache
Disk
(memory)
OS Page Cache (Missed Read)
● User wants to read block number 7
● OS first checks the page cache
● Block 7 isn’t in the page cache, we go to disk
● Translate block 7 to LBAs and read from disk and update the cache
● 1 Block = 8 LBAs (B * 8 + 7)
Read Block 7 (miss) Read LBAs 56-63
Page cache
Disk
(memory)
OS Page Cache (Write)
● User wants to write to Block 7
● OS writes to the page cache creating a new entry
● LATER the OS flushes the page cache to disk
● OR until fsync() is called
(Buffered)
Write Block 7
Write LBAs 56-63
Page cache
Disk
(memory)
Too many fsync is not good…
Page cache woes
● Faster reads
○ two apps can page cache files
● Can cause corruption
○ Write goes to cache then OS crashes
○ Torn database pages
File Modes
Partitions
LBA 2
P1
LBA 3
LBA 4
LBA 10
● Each partition can have its own FS LBA 11
LBA 13
LBA 14
P3 LBA 15
LBA 16
……
Partition Alignment
● Partition starting at an odd LBA
● Say 8 LBAs map to 1 PBA
● FS Block size is 8 LBAs
● Partition starts at LBA 2
● One fs block is 2-9 LBAs
○ Means PBA 1 & PBA 2 Credit: IBM
● Overlapping!
● Performance issues
Moving File Systems
● Success moves of file
systems
● Allegro moved their Kafka
from ext4 to XFS
● The cost of journal commits
● Metadata updates
Summary
● File systems exposes files and directories
● Itself has a structure and storage
● Linked to the OS page cache
● Translation of POSIX read, fs block, to pages
husseinnasser
A simple read
What really happens behind a read?
POSIX
Read example Test.dat (5000 bytes)
Block 0 1 2 3 4 5 6 7
File - - - E - - 3 -
Name: test.dat
Start Block: 6
Check the page cache Page cache
Block | VM memory
● Query the OS file system page cache 1 x
● Block 3 is cached but 6 is not. 2 y
● Next we send a command to disk controller 3 z
Read from disk controller Page cache
Block | VM memory
● The OS sends a read command to the disk 1 x
● Block 6 is LBA 6 (1 Block = 1 LBA) 2 y
● LBA 6, for length 1 3 z
LBA 6
OS updates cache Page cache
Block | VM memory
● OS gets the content and update its cache 1 x
2 y
3 z
6 j
OS return to user Page cache
Block | VM memory
● OS takes content from z, j 1 x
● Z has 4096 bytes, but user only asks for 1000 2 y
● Copies the memory to the user buffer 3 z
6 j
● Return success/unblocks
Summary
● So many layers
● Translation of POSIX read, fs block, to pages
● Page cache/delayed
husseinnasser
Socket Management
The networking part of the OS
husseinnasser
Network Fundamentals
Brief intro on Networking
Client-Server Architecture
A revolution in networking
Client-Server Architecture
● Machines are expensive, applications are complex
● Seperate the application into two components
● Expensive workload can be done on the server
● Clients call servers to perform expensive tasks
● Remote procedure call (RPC) was born
Client-Server Architecture Benefits
● Servers have beefy hardware
● Clients have commodity hardware
● Clients can still perform lightweight tasks
● Clients no longer require dependencies
● However, we need a communication model
husseinnasser
OSI Model
Open Systems Interconnection model
husseinnasser
Client Server
Application Application
Presentation Presentation
Session Session
Physical Physical
Client Server
Client Server
Application Application
Presentation Presentation
Session Session
TCP/IP Model
● Much simpler than OSI just 4 layers
● Application (Layer 5, 6 and 7)
● Transport (Layer 4)
● Internet (Layer 3)
● Data link (Layer 2)
● Physical layer is not officially covered in the model
husseinnasser
A B
A B
00:00:5e:00:53:aa 00:00:3a:12:31:0b
husseinnasser
D
husseinnasser
A A
B B
C C
N1 N2
Host 192.168.1.3 wants to talk to 192.168.2.2
192.168.1.3 192.168.2.3
192.168.1.2 192.168.2.2
192.168.1.1 192.168.2.1
192.168.1.0/24 192.168.2.0/24
husseinnasser
1.2.3.4
IP Address
● Layer 3 property
● Can be set automatically or statically
● Network and Host portion
● 4 bytes in IPv4 - 32 bits
husseinnasser
Network vs Host
● a.b.c.d/x (a.b.c.d are integers) x is the network bits and remains are host
● Example 192.168.254.0/24
● The first 24 bits (3 bytes) are network the rest 8 are for host
● This means we can have 2^24 (16777216) networks and each network has
2^8 (255) hosts
● Also called a subnet
husseinnasser
Subnet Mask
● 192.168.254.0/24 is also called a subnet
● The subnet has a mask 255.255.255.0
● Subnet mask is used to determine whether an IP is in the same subnet
husseinnasser
Default Gateway
● Most networks consists of hosts and a Default Gateway
● Host A can talk to B directly if both are in the same subnet
● Otherwise A sends it to someone who might know, the gateway
● The Gateway has an IP Address and each host should know its gateway
E.g. Host 192.168.1.3 wants to talk to 192.168.1.2
192.168.1.3 =
192.168.1.0
● 255.255.255.0 & 192.168.1.2 192.168.2.2
192.168.1.2 =
192.168.1.0
● Same subnet ! no need to
192.168.1.1 192.168.2.1
route
192.168.1.0/24 192.168.2.0/24
E.g. Host 192.168.1.3 wants to talk to 192.168.2.2
192.168.1.3 =
192.168.1.0
● 255.255.255.0 & 192.168.1.2 192.168.1.100 192.168.2.2
192.168.2.2 =
192.168.2.0
● Not the subnet ! The packet
192.168.1.1 192.168.2.1
is sent to the Default
Gateway 192.168.1.100 192.168.1.0/24 192.168.2.0/24
husseinnasser
Summary
● IP Address
● Network vs Host
● Subnet and subnet mask
● Default Gateway
husseinnasser
UDP
User Datagram Protocol
husseinnasser
UDP
● Stands for User Datagram Protocol
● Layer 4 protocol
● Ability to address processes in a host using ports
● Simple protocol to send and receive data
● Prior communication not required (double edge sword)
● Stateless no knowledge is stored on the host
● 8 byte header Datagram
husseinnasser
A B
husseinnasser
10.0.0.1 10.0.0.2
husseinnasser
Summary
● UDP is a simple layer 4 protocol
● Uses ports to address processes
● Stateless
husseinnasser
UDP Pros
● Simple protocol
● Header size is small so datagrams are small
● Uses less bandwidth
● Stateless
● Consumes less memory (no state stored in the server/client)
● Low latency - no handshake , order, retransmission or guaranteed delivery
husseinnasser
UDP Cons
● No acknowledgement
● No guarantee delivery
● Connection-less - anyone can send data without prior knowledge
● No flow control
● No congestion control
● No ordered packets
● Security - can be easily spoofed
husseinnasser
TCP
Transmission Control Protocol
husseinnasser
TCP
● Stands for Transmission Control Protocol
● Layer 4 protocol
● Ability to address processes in a host using ports
● “Controls” the transmission unlike UDP which is a firehose
● Connection
● Requires handshake
● 20 bytes headers Segment (can go to 60)
● Stateful
husseinnasser
A B
husseinnasser
TCP Connection
● Connection is a Layer 5 (session)
● Connection is an agreement between client and server
● Must create a connection to send data
● Connection is identified by 4 properties
○ SourceIP-SourcePort
○ DestinationIP-DestinationPort
husseinnasser
TCP Connection
● Can’t send data outside of a connection
● Sometimes called socket or file descriptor
● Requires a 3-way TCP handshake
● Segments are sequenced and ordered
● Segments are acknowledged
● Lost segments are retransmitted
husseinnasser
10.0.0.1 10.0.0.2
husseinnasser
Connection Establishment
● App1 on 10.0.0.1 want to send data to AppX on 10.0.0.2
● App1 sends SYN to AppX to synchronous sequence numbers
● AppX sends SYN/ACK to synchronous its sequence number
● App1 ACKs AppX SYN.
● Three way handshake
Sending data
● App1 sends data to AppX
● App1 encapsulate the data in a segment and send it
● AppX acknowledges the segment
● Hint: Can App1 send new segment before ack of old segment arrives?
Acknowledgment
● App1 sends segment 1,2 and 3 to AppX
● AppX acknowledge all of them with a single ACK 3
File descriptor
File descriptor
10.0.0.2 22 ACK3 5555 10.0.0.1
husseinnasser
Lost data
● App1 sends segment 1,2 and 3 to AppX
● Seg 3 is lost, AppX acknowledge 3
● App1 resend Seq 3
Closing Connection
● App1 wants to close the connection
● App1 sends FIN, AppX ACK
● AppX sends FIN, App1 ACK
● Four way handshake
10.0.0.1 5555 FIN 22 10.0.0.2
AppX-port 22
App1-port 5555 AppY-port 443
App2-port 7712 AppZ-port 80
App3-port 2222 10.0.0.2 22 ACK 5555 10.0.0.1
Summary
● Stands for Transmission Control Protocol
● Layer 4 protocol
● “Controls” the transmission unlike UDP which is a firehose
● Introduces Connection concept
● Retransmission, acknowledgement, guaranteed delivery
● Stateful, connection has a state
husseinnasser
Socket S
Connection, Receive and Send queue
S SYN Queue
ACK
Connection Establishment
SYN/ACK
ACK
Accept Queue
accept()
copy
PID
Problems with accepting connections
ACK
read()
Copy
PID
Send buffers
data
Send buffer
ACK
send()
Copy
PID
Problems with reading and sending
Socket Programming
Patterns
Common socket patterns
Single Listener/Single Worker Thread
Single Listener/Multiple Worker threads
Single Listener/Multiple Worker threads with load balancing
Multiple Acceptor Threads single Socket
Multiple Listeners on the same port
husseinnasser
Asynchronous IO
Non blocking reads and writes
Blocking operations
Process OS kernel
Blocking operations read
con1
read
con2
read
con3
con4
con5
con6
con7
con8
con9
This is just a pseudo code for simplicity, green indicates the receive buffer for the connection has data to be read, gray indicates an con10
empty receive buffer (client hasn’t sent anything). In this example the process gets blocked on reading connection 3 until some data
arrives there, while the other connections are starved.
Asynchronous I/O
● Read blocks when no data in receive buffer
● Accept blocks when no connections in accept queue
● Ready approach
○ Ask the OS to tell us if a file is ready
○ When it is ready, we call it without blocking
○ Select,epoll, kqueue
● Completion approach
○ Ask the OS (or worker thread to do the blocking io)
○ When completed notify
○ IOCP, io_uring
● Doesn’t work with storage files
Select (polling)
● Select takes a collection of file descriptors for kernel to monitor
● Select is blocking (with a timeout)
● When any is ready, select returns
● Process checks which one is ready (loop w/FD_ISSET)
● Process calls read/write/accept etc. on the file descriptor
select con1
con2
con3
con4
con5
select
con6
con7
con8
con9
In this example, only connection 9 has data in the receive buffer, but client is monitoring all 10 connections. The select call sends all con10
10 file descriptors, kernel checks one by one (all 10) discovers the connection 9 had data, unblocks and return
select con1
con2
con3
con4
con5
con6
con7
con8
con9
read
The client then gets its program counter incremented, loops through all connections and check which one is ready O(n). It finds that con10
connection 9 is ready and issues the read on it with no blocking.
Select pros and cons
● Pros
○ Avoid reading (unready) resources
○ async
● Cons
○ But slow we have to loop through all of them O(n)
○ Lots of copying from kernel/user space
○ Supports fixed size of file descriptors
epoll (eventing)
● Register an interest list of the fds (once) in the kernel
● Kernel keeps working and updates the ready list
● User calls epoll_wait, kernel builds an events array of ready list
con1 Incoming
Packets con3 (2)
con2
con3
epoll epoll e (kernel) con1
con2
con3
con4
con5
con6
con7
con8
con9
Again this is pseudo code, we first create the epoll_instance, which lives in the kernel memory. Then we add the connections/file con10
descriptors
epoll epoll e (kernel) con1
con2
con5
con6
con7
con8
con9
The code process moves on, meanwhile the kernel keeps receiving packets and as it does so, it will populate the ready list for those con10
connections with data. In this case we got data for con9 and con4
epoll epoll e (kernel) con1
con2
con5
con6
con7
con8
con9
At some point, code calls epoll_wait, if the epoll instance is empty wait is blocked, else we immediately send back the array of events con10
to the user space. Only what is ready. Events count is now 2 elements.
epoll epoll e (kernel) con1
con2
con5
con6
con7
con8
con9
the user calls read as non-blocking on connection 9 and connection 4 emptying the receive buffers. con10
epoll drawbacks
● Complex
○ Level-triggered vs edge-triggered
● Only in linux
● Too many syscalls
● Doesn’t work on files
io_uring
● Based on completion
● Kernel does the work
● Shared memory, user puts “job”
● kernel does the work and writes results
io_uring
● Based on completion
● Kernel does the work
● Shared memory, user puts “job”
● kernel does the work and writes results
User process
Shared memory
read() write()
read() write()
Submission write() accept() Completion
queue queue
accept() accept()
read()
More OS Concepts
husseinnasser
Hardware
Virtualization Full OS1 Full OS2 Full OS3
Hardware
Virtualization
● Can limit CPU/Memory for each VM
● E.g. VMWare Oracle virtualbox
Containerization
● Containers isolated by namespace Container1 Container2 Container3
○ Mount fs namespace
○ Network namespace Base OS tools
○ Process namespace
Base Kernel
● Limited by cgroups
○ CPU/Memory usage Hardware
● All containers share kernel!
● E.g. docker
namespace
● Kernel feature for isolation Con1
● Mount namespace -> isolates mounts
Base OS tools
● PID namespace -> isolates processes
● NIC namespace -> isolates networks Base Kernel
Hardware
cgroup
● Control group kernel feature
● Assigns certain cpu and memory to processes
● So that containers can’t starve others
What happens on a new container
● New namespaces created
● It can only see what is mounted
● Usually a directory in host is created
● Container gets 1 mount to that directory
● Can’t see anything else
What happens on a new container
● The OS necessary tools are loaded
● E.g. ubuntu apt-get, ifconfig etc.
● Then user code/processes are loaded
○ E.g. node
● But there is a better way
overlayfs
● new container gets its copy of base OS tools
● This will quickly run the disk out
● Meet overlay fs
● Base layer os is read only (shared)
● Only changes are maintained
network
● new container gets its own NIC namespace
● Won’t see host NICs (unless explicitly given)
● Docker often has one network that adds all container to
● New container gets a new mac address and get DHCPed
processes
● New container can spin any number of processes
● Container can only see its processes
● PID 1 can exist in all containers, they are isolated
● Each process has a globalid the base kernel sees
● PID namespace
Summary
● Hardware
● Kernel
● Virtual machines
● Containers