KEMBAR78
Fundamentals of Operating Systems-April-2024 | PDF | Operating System | Process (Computing)
0% found this document useful (0 votes)
382 views450 pages

Fundamentals of Operating Systems-April-2024

The document discusses the components of an operating system including the kernel, CPU, memory, storage, network, and file system. It also covers topics like process management, user space vs kernel space, device drivers, and system calls.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
382 views450 pages

Fundamentals of Operating Systems-April-2024

The document discusses the components of an operating system including the kernel, CPU, memory, storage, network, and file system. It also covers topics like process management, user space vs kernel space, device drivers, and system calls.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 450

husseinnasser

Fundamentals of
Operating Systems
Build efficient software by understanding how the OS
works
husseinnasser

Introduction
husseinnasser

Introduction
● Welcome
● Who this course is for?
● Course Outline
husseinnasser

Introduction to OS
Understanding the need of an OS
husseinnasser

Why OS?
Why do we need an operating system?
OS

● Operating System is a Software


● Resources needs to be managed
● Your apps talk to the OS
● Most OSs are general purpose
Scheduling

● Scheduling is critical
● Fair share of resources to all processes
○ CPU/Memory/IO

P1 P2
Why OS

● OS APIs abstracts the hardware


● Your app works on any hardware*
● Occasional breaking changes (64bit/32bit)
● Can you build your app without OS?
Understand the OS

● OS shouldn’t be a black box


● Building efficient apps

Your App OS?


Summary

● Operating System manages resources


● Schedules different processes
● Maintains compatibility
● Understanding it makes a better engineer
husseinnasser

System Architecture
The components the operating system operates
System Architecture

● System has resources


● The Kernel is OS core component
● The Kernel manages the resources
● OS has more tools for usability
CPU

● Central Processing Unit


● Consists of cores
● Each core has a clock speed
● Executes machine level instructions
● Fast Caches
Memory

● Random Access Memory


● Fast but Volatile
● Store process states and data
● Limited
● Slower than CPU cache
Storage

● Persisted storage
● HDD, SSD
● Slower than memory
Network

● Communicates with other hosts


● NIC (Network Interface Controller)
● Protocol implementations
Kernel

● The core of the OS


● We focus in this course on the kernel tasks
● Manages the resources
● The OS is more than the kernel
○ GUI, command lines, UI and tooling
○ E.g. top is a tool that extracts processes
○ Distros is all about this extra tooling
File System

● Storage is mostly blocks of bytes


● We like to work files
● File system is an abstraction
● How files stored on disk
● btrfs, ext4, fat32, NTFS, tmpfs
postgres pid 1
Program vs Process postgres.exe postgres pid 2

● Program is the compiled executable postgres pid 3

● Process is an instance of the program


mysql.bin
● Process is a program in motion Mysql pid 1

● Program is a process at rest Mysql pid 2

● Execution file format


postgres pid 1
Process Management postgres pid 2

● Kernel manages processes postgres pid 3

● Schedules process for a CPU


● Switches a process for another
● Grant access to resources like storage
0xFFFFFFFF
Kernel
User space vs Kernel Space 0xC0000000

● User space
○ Browser
○ Postgres
User
● Protected kernel space
○ Kernel code
○ Device drivers
○ TCP/IP Stack 0x00000000

● Isolated, (io_uring kind of broke that rule) Process


Device Drivers

● Drivers are software in the kernel


● Manages hardware devices
● NVMe, Keyboard, NIC, etc.
● Interrupt Service Routine
System Calls

● To jump from User to Kernel mode


● User makes a system call
● read() write() malloc()
● Mode switch
Summary

● System has resources


● The Kernel is OS core component
● The Kernel manages the resources
● OS has more tools for usability
husseinnasser

The Anatomy of a Process


What is in a process
husseinnasser

Program vs Process
Process is a Program in motion
Program

● Code is compiled and linked for a CPU


● Produces executable file program
● Only works on that CPU architecture
● At rest it follows an executable file
format
● Lives on disk
Max address
Stack

Process

● When a program is run, we get a process


● Process lives in memory Heap

● Uniquely identified with an id Data+static


● Instruction pointer/program counter text/code
Min address
● Process Control Block (PCB)
Program vs Process

postgres.exe

postgres postgres postgres


pid 1 pid 2 pid 3
Pc 0x111111 Pc 0x222222 Pc 0x333333
Producing machine code

Compile

Assembly
C code
Compiling

Assembly Machine code

Note, machine code is different from assembly,


For sake of illustration I’ll use assembly as machine code for readability
as most instructions are 1 to 1
Program

Executable file
(program)
Process r0 r1 pc
1 3 0xaabbcc00

stack
0xffeeddcc c: 4 4
r2 r3

CPU
Heap

static
0xaabbcc12
0xaabbcc08

0xaabbcc04 Text (code)


0xaabbcc00

Process (in memory)


Demo

● Spin a process in linux


● Use debugger to attach to process
● Inspect and look at the program counter
Summary

● A Process is a program in motion


● Program has an execution file format
● Process has counter for the current instruction
● Process states stored in the PCB
husseinnasser

Simple Process Execution


Walking through a simple process
focus program counter
Producing machine code

Compile

Assembly
C code
Compiling

Assembly Machine code

Note, machine code is different from assembly,


or sake of illustration I’ll use assembly as machine code for readability
as most instructions are 1 to 1
Program

Executable
(program)
Run the program
Process 1

Process 2

Process 3

Your Process

Memory
Cost time!

● Register access - 1ns


● L1 Cache - 2ns
● L2 Cache - 7ns
● L3 Cache - 15ns
● Main Memory - 100 ns
● SSD - 150 us
● HDD -10 ms
Process 1
Memory
New Process Process 2

Process 3
0xffeeddcc

Your Process

0xaabbcc12
0xaabbcc08

0xaabbcc04 Text (code)


0xaabbcc00

Machine code is often read bottom to top


Low address to high address
Start r0 r1 pc

0xaabbcc00

0xffeeddcc
r2 r3

CPU

0xaabbcc12
0xaabbcc08

0xaabbcc04 Instruction
Text (code)
0xaabbcc00
Pointer / program
counter (pc)

Process (in memory)


Load/fetch instruction r0 r1 pc

0xaabbcc00

ir
0xffeeddcc
r2 r3

Memory CPU
read!

0xaabbcc12
0xaabbcc08

0xaabbcc04

0xaabbcc00 pc

Process (in memory)


Execute Instruction r0 r1 pc
1 0xaabbcc00

ir
0xffeeddcc
r2 r3

CPU

0xaabbcc12
0xaabbcc08

0xaabbcc04

0xaabbcc00 pc

Process (in memory)


Increment PC r0 r1 pc
1 0xaabbcc04
0xaabbcc00

ir
0xffeeddcc
r2 r3

CPU

0xaabbcc12
0xaabbcc08

0xaabbcc04 pc
0xaabbcc00 pc

We add 4 bytes, that is the size of instruction


Depends on the CPU
Load next instruction r0 r1 pc
1 0xaabbcc04

ir
0xffeeddcc
r2 r3

CPU

0xaabbcc12
0xaabbcc08

0xaabbcc04 pc
0xaabbcc00

Note, that while I’m representing second instruction fetch as a memory access, it is often retrieved from
the CPU L caches. This is because when we fetch data from the memory we fetch not only what we
need, but much more, what is called burst.
This is the beauty of understanding caches and power of using near by code..
Execute r0 r1 pc
1 3 0xaabbcc04

ir
0xffeeddcc
r2 r3

0xaabbcc12
0xaabbcc08

0xaabbcc04 pc
0xaabbcc00
Increment r0 r1 pc
1 3 0xaabbcc08

ir
0xffeeddcc
r2 r3

0xaabbcc12
0xaabbcc08 pc
0xaabbcc04

0xaabbcc00
Load r0 r1 pc
1 3 0xaabbcc08

ir
0xffeeddcc
r2 r3

0xaabbcc12
0xaabbcc08 pc
0xaabbcc04

0xaabbcc00
Execute add r0 r1 pc
1 3 0xaabbcc08

ir
0xffeeddcc 4
r2 r3

0xaabbcc12
0xaabbcc08 pc
0xaabbcc04

0xaabbcc00
Increment/Fetch r0 r1 pc
1 3 0xaabbcc12

ir
0xffeeddcc 4
r2 r3

0xaabbcc12 pc
0xaabbcc08

0xaabbcc04

0xaabbcc00
Execute, store to memory r0 r1 pc
1 3 0xaabbcc12

ir
0xffeeddcc c=4 4
r2 r3
Memory
write!

0xaabbcc12 pc
0xaabbcc08

0xaabbcc04

0xaabbcc00

Writing from the CPU to memory, also involves the CPU L caches which we
will discuss further as we progress through the course
Summary

● Walked a simple process


● Program counter (pc) is important
● PC points the current instruction
● Fetch, Load, Execute cycle!
husseinnasser

The Stack
Introducing the Stack
Stack
0xff999999 int a
Stack 0xff999995 int b

0xff999991 int c
● Stack is brilliant data structure function1
● Function has local variables int z
● Each function gets a frame function2

● Grows from high to low int y


function3
● Stack space is limited
0xff999999
int a
Stack Pointer int b
sp 0xff999991

● Move using pointers int c

main
● Stack pointer (CPU Register)
● Allocate, deallocate memory
● Points at the end allocate
deallocate

Allocate
(push)

0xff999999 0xff999999 0xff999999


int a int a int a
0xff999995 main
sp
int b int b
0xff999991
main
sp
int c
main
0xff99998d
(12 bytes)
bp 0xff999999 int a
Base Pointer 0xff999995 int b

0xff999991 int c
● Stack pointer changes main
sp 0xff99998d
● Need a fixed reference
● Base pointer (frame pointer)
○ Also a cpu register
● To reference variable a, use bp;
● Variable b is bp - 4
1024 Stack
int a
Nested calls 1020
int b

1018 int c
● Main calls func1 main
● Main pauses, func1 executes bp 1014
int x
● Base and stack pointer change 1010
int y
func1
sp 1006

From now on I’ll switch to use decimal for readability of memory addresses
1024 Stack
int a
Nested calls 1020 int b

1018 int c
● But what if func1 returns? main
1014
● sp = sp + 8 bp
sp
int x
● We lost main’s base pointer 1010
int y
● We need to save it 1006
func1

From now on I’ll switch to use decimal for readability of memory addresses
1024
bp int a
Previous bp 1020 int b

1018 int c
● main calls func1 main
sp
1014

1010

1006
bp 1024 Stack
int a
Previous bp 1020 int b

1018 int c
● CPU allocates func1 memory main
1014
● bp, x and y needs 12
● sp = sp - 12 1010

1006

1002
func1
sp
1024 Stack
int a
Previous bp 1020 int b

1018 int c
● bp can be safely changed main
bp 1014
● bp and sp now references func1 old bp (main bp)
● func1 is active 1010
int x

1006
int y

1002
func1
sp
bp 1024 Stack
int a
Func1 returns 1020 int b

1018 int c
● When func1 is done, main
1014
● Use oldbp to store the current bp old bp (main bp)
● That is a memory read 1010
int x

1006
int y

1002
func1
sp
bp 1024 Stack
int a
Func1 returns 1020 int b

1018 int c
● Set sp back main
sp 1014
● Deallocate, sp = sp + 12 old bp (main bp)
● Main is now active 1010
int x
● But where in main should we go? 1006
int y

1002
func1
bp 1024 Stack
int a
Return address 1020 int b

1018 int c
● Main has still work to do main
sp 1014
● It called func1 but lost it’s place old bp (main bp)
● Need to store return address 1010
main’s return address
● Which becomes the pc 1006 int x
● Stored in link register in CPU
1002 int y
func1
998
Summary

● Stack grows from high to low


● Used for function calls
● Stack variables die quickly, (Watch out for pointers)
● Works with CPU registers, bp, sp, lr, pc
husseinnasser

Process Execution with


Stack
One function calling another
main
Example

Main start
func1

To make it readable, I made the memory addresses read as sequential integers.


In reality each instruction is 4 bytes which makes each address jump by 4 bytes. func1 start
So 400-396-362 and so on..
Start a process
1024

400

… … code

374

Program is started, a process is created, code loaded in the


code section.
Process starts
sp 1024
pc bp sp

0 1024

r0 r1 r3

lr

400


374
code cpu
Process registers initialized , sp points to the top of stack by default, bp, lr
points to where the kernel was before process executes
Main called
1024
pc bp sp

386 0 1004

sp 1004 main r0 r1 r3

lr

0
Current pc
pc (main)

400


374
code cpu
Main called, set pc = 385, we fetch instruction from memory, execute and expand
the stack by 20, once we fetched it we increment the pc , so pc points to the next
one
Save registers
1024 pc bp sp
bp=0 (4 bytes)
1020
lr=0 (4 bytes)
387 0 1004

sp 1004 main r0 r1 r3

lr

0 pc

400


374
code cpu
Add one to pc to execute the next instruction and fetch it and execute it, save the bp
and lr to our stack because we might overwrite them
Set base pointer
bp 1024 pc bp sp
bp=0 (4 bytes)
1020
lr=0 (4 bytes)
388 1024 1004

sp 1004 main r0 r1 r3

lr
pc
0

400


374
code cpu
Next instruction, Set our base pointer for main
a=1
bp 1024 pc bp sp
bp=0 (4 bytes)
1020
lr=0 (4 bytes)
389 1024 1004

sp 1004 main r0 r1 r3

lr pc

400


374
code cpu
Next instruction, store 1 in r0 this is a pure cpu operation
a=1
bp 1024 pc bp sp
bp=0 (4 bytes)
1020
lr=0 (4 bytes)
390 1024 1004
1016 a=1 (4 bytes)
r0 r1 r3
1004 main
sp
1
pc
lr

400


374
code cpu
Next instruction, store r0 to memory
b=3
bp pc bp sp
1024 bp=0 (4 bytes)
1020 lr=0 (4 bytes)
1016 a=1 (4 bytes) 392 1024 1004
1012 b=3 (4 bytes)
r0 r1 r3
pc
1004 main
sp
3

lr

400


374
code cpu
We execute the next two instructions at once, set 3 in r0 and store it in memory
call func1
bp
pc bp sp
1024 bp=0 (4 bytes)
1020 lr=0 (4 bytes)
1016 a=1 (4 bytes) 393 1024 1004
1012 b=3 (4 bytes)
r0 r1 r3 pc

1004 main
sp
3

lr

394

400


374
code cpu
We prepare to call func1, first we need to set the return address to the next
instruction in main, lr will be 394 in this case
Jump to func1
bp pc bp sp
1024 bp=0 (4 bytes)
1020 lr=0 (4 bytes)
1016 a=1 (4 bytes) 375 1024 996
1012 b=3 (4 bytes)
r0 r1 r3
1004 main
3

sp 996 lr

394
func1

400


374
code cpu
pc
Start func1
We set the program counter to the first line in func1 which is 374, load instruction
and increment pointer
Store main’s bp in func1
bp pc bp sp
1024 bp=0 (4 bytes)
1020 lr=0 (4 bytes)
1016 a=1 (4 bytes) 376 1024 996
1012 b=3 (4 bytes)
r0 r1 r3
1004 main
3
bp=1024

sp 996 lr

394
func1

400


374
code cpu pc

We store the bp for main which is in the register to func1 for later..
Set bp for func1
pc bp sp
1024 bp=0 (4 bytes)
1020 lr=0 (4 bytes)
1016 a=1 (4 bytes) 377 1004 996
1012 b=3 (4 bytes)
r0 r1 r3
bp 1004 main
3
bp=1024

sp 996 lr

394
func1

400


374
code cpu pc

Now we have bp stored, we can use the register for func1


z=1
pc bp sp
1024 bp=0 (4 bytes)
1020 lr=0 (4 bytes)
1016 a=1 (4 bytes) 378 1004 996
1012 b=3 (4 bytes)
r0 r1 r3
bp 1004 main
1
bp=1024

sp 996 lr

394
func1

400
pc

374
code cpu
We store 1 in r0 and push it to func1 memory
z=z+1
pc bp sp
1024 bp=0 (4 bytes)
1020 lr=0 (4 bytes)
1016 a=1 (4 bytes) 379 1004 996
1012 b=3 (4 bytes)
r0 r1 r3
bp 1004 main
2
bp=1024

sp 996 lr

394
func1

pc
400


374
code cpu
Add 1 to r0,
z=z+1
pc bp sp
1024 bp=0 (4 bytes)
1020 lr=0 (4 bytes)
1016 a=1 (4 bytes) 380 1004 996
1012 b=3 (4 bytes)
r0 r1 r3
main
bp 1004 2
bp=1024
1000 z=2
lr
sp 996
394
func1

pc

400


374
code cpu
Store r0 to z’s address, bp -4 , address 1000
About to exit func1
bp
pc bp sp
1024 bp=0 (4 bytes)
1020 lr=0 (4 bytes)
1016 a=1 (4 bytes) 381 1024 996
1012 b=3 (4 bytes)
r0 r1 r3
main

1004 2
bp=1024
1000 z=2
lr
sp 996
394
func1

pc

400


374
code cpu
We load bp back to original main’s bp
Jump to main, restore vars
bp pc bp sp
1024 bp=0 (4 bytes)
1020 lr=0 (4 bytes)
1016 a=1 (4 bytes) 395 1024 1004 pc
1012 b=3 (4 bytes)
r0 r1 r3
main
sp 1004 1
bp=1024
1000 z=2
lr
996
394
func1

400


374
code cpu
We deallocate func1 and jump to main setting pc = lr, load instruction 394 execute
and restore the a from memory to r0, r0 is 1, because we lost them when we
executed func1
Restore b
bp pc bp sp
1024 bp=0 (4 bytes) pc
1020 lr=0 (4 bytes)
1016 a=1 (4 bytes) 396 1024 1004
1012 b=3 (4 bytes)
r0 r1 r3
main
sp 1004 1 3
bp=1024
1000 z=2
lr
996
394
func1

400


374
code cpu
Restore b to r1
c=a+b
bp pc bp sp pc
1024 bp=0 (4 bytes)
1020 lr=0 (4 bytes)
1016 a=1 (4 bytes) 397 1024 1004
1012 b=3 (4 bytes)
r0 r1 r3
main
sp 1004 1 3 4
bp=1024
1000 z=2
lr
996
394
func1 ir

400


374
code cpu
Add r0, r1 and store it in r3
c=a+b pc
bp pc bp sp
1024 bp=0 (4 bytes)
1020 lr=0 (4 bytes)
1016 a=1 (4 bytes) 398 1024 1004
1012 b=3 (4 bytes)
1008 c=4 (4 bytes) r0 r1 r3
main
sp 1004 1 3 4
bp=1024
1000 z=2
lr
996
394
func1

400


374
code cpu
Store r3 in memory
pc

Terminate main
bp pc bp sp
1024 bp=0 (4 bytes)
1020 lr=0 (4 bytes)
1016 a=1 (4 bytes) 399 0 1004
1012 b=3 (4 bytes)
1008 c=4 (4 bytes) r0 r1 r3
main
sp 1004 1 3 4
bp=1024
1000 z=2
lr
996
0
func1

400


374
code cpu
Restore the kernel’s bp,lr and quit
pc

Terminate main
sp pc bp sp
1024 bp=0 (4 bytes)
1020 lr=0 (4 bytes)
1016 a=1 (4 bytes) 400 0 1024
1012 b=3 (4 bytes)
1008 c=4 (4 bytes) r0 r1 r3
main

1004 1 3 4
bp=1024
1000 z=2
lr
996
0
func1

400


374
code cpu
Sp is popped back to the top ,
Stack overflow

● Protect against infinite function calls


○ Recursion
○ Large local variables
● Stack has a limit
● Limit can be overridden by compiler
Summary

● Kernel does so much work


● Stack play an important rule in function calls
● Add parameters and return values
husseinnasser

Data Section
Fixed size, Global and static variables, all functions share
Stack
0xff999999 int a
Data section 0xff999995 function

0xff999991
● Stores static and global variables
Data (static)
● Referenced directly by memory address
● Read only and read write
Text / code
● All functions can access
● Fixed size, like code section
Example

-
Start a process
bp
sp 1024 pc bp sp
bp=0 (4 bytes)
lr=0 (4 bytes)
680 1024 1024

r0 r1 r2
main

r3

#DATA 708
A=10
704 B=20

700

-
pc 680

Program is started, a process is created, code loaded in the code


section, data section filled with the global variables
Load A
bp 1024 pc bp sp
bp=0 (4 bytes)
1020 lr=0 (4 bytes)
688 1024 1012
1016
main
sp 1012 r0 r1 r2

10

r3

#DATA 708
A=10
704 B=20

pc -

To load global A, we need to reference the start of the data section + 0


where A is, put it in r0
Load B
bp 1024 pc bp sp
bp=0 (4 bytes)
1020 lr=0 (4 bytes)
692 1024 1012
1016

sp 1012 r0 r1 r2
main

10 20

r3

#DATA 708
A=10
704 B=20

-
pc
-

To load global B, we need to reference the start of the data section - 4


where B is, put it in r1
Sum and store in local var b
bp 1024 pc bp sp
bp=0 (4 bytes)
1020 lr=0 (4 bytes)
sum=30 700 1024 1012
1016

sp 1012 r0 r1 r2
main

10 20 30

r3

#DATA 708
A=10
704 B=20
pc

We sum r0, and r1 store it in r2 and store that in sum local variable
which is at sp-4 offset
Summary

● Data section stores globals and static


● Shared between all functions
● Read and write
● Watch out for concurrency
husseinnasser

Heap
Large, dynamic place for memory allocations
Stack
0xff999999 int a
Heap section 0xff999995 function

0xff999991
● Stores large data
Heap
● Remain until explicitly removed 0x1000005 Heap

● All functions can access 0x1000001

Data (static)
● Dynamic, grows low to high
● malloc, free, new
Text / code
0x0099991
Stack
int *a=0x333333

Pointers function

● Point to a memory address in the heap


0x333333
9000 Heap
● A pointer can in stack, data or heap d Heap
0x222222
● Stores the address of first byte
Data (static)
● Pointer type helps determine size char *c=0x222222

Text / code
Example
Start a process
bp
sp 2048 pc bp sp
bp=0 (4 bytes)
lr=0 (4 bytes)
ptr=NA (4 bytes) 640 2048 2048

r0 r1 r2
main

r3 lr

900 _malloc
800 _free

704 ….
Mycode
pc 640 …. pc

Program is started, a process is created, code loaded in the code section,


malloc and free are also in the code section. (Shared memory)
Init pointer
bp 2048
bp=0 (4 bytes) pc bp sp
lr=0 (4 bytes)
ptr=0 (4 bytes) 664 2048 2036

r0 r1 r2
sp 2036 main

4 0

r3 lr

pc

900 _malloc
800 _free

704 ….
Mycode
640 ….

We load and execute, 640-656


Calling malloc
2048
bp=0 (4 bytes) pc bp sp pc
lr=0 (4 bytes)
ptr=0 (4 bytes) 900 2036 2028

r0 r1 r2
2036 main

bp
4 0
bp=2048 (4 bytes)
sp lr=668 (4 bytes) malloc
2028 r3 lr

668

900 _malloc
800 _free

704 ….
Mycode
640 ….

We call malloc, it saves the bp, lr of the main so we know where to return
Calling malloc
2048
bp=0 (4 bytes) pc bp sp pc
lr=0 (4 bytes)
ptr=0 (4 bytes) 900 2036 2028

r0 r1 r2
2036 main

bp
1024 0
bp=2048 (4 bytes)
sp lr=668 (4 bytes) malloc
2028 r3 lr

668
1024 *ptr=NA

900 _malloc
800 _free

704 ….
Mycode
640 ….

Malloc uses r0 as a parameter to allocate 4 bytes on the heap (it's actually


more, but then return the address back to r0
Malloc returns
bp 2048
bp=0 (4 bytes) pc bp sp
lr=0 (4 bytes)
ptr=1024 (4 bytes) 672 2048 2036

r0 r1 r2
sp 2036 main

1024 0
bp=2048 (4 bytes)
lr=668 (4 bytes) malloc
2028 r3 lr

0
1024 *ptr=NA pc

900 _malloc
800 _free

704 ….
Mycode
640 ….

Malloc exist, pc is set to lr , bp, sp is set back to the main, we store the
address in the main stack where ptr is
Store value to heap
bp 2048
bp=0 (4 bytes) pc bp sp
lr=0 (4 bytes)
ptr=1024 (4 bytes) 680 2048 2036

r0 r1 r2
sp 2036 main

1024 10
bp=2048 (4 bytes)
lr=668 (4 bytes) malloc
2028 r3 lr
pc
0
1024 *ptr=10

900 _malloc
800 _free

704 ….
Mycode
640 ….

We load 10 to r1 and write it to [r0] which is where the address of pointer is


Add 1
bp 2048
bp=0 (4 bytes) pc bp sp
lr=0 (4 bytes)
ptr=1024 (4 bytes) 688 2048 2036

r0 r1 r2
sp 2036 main

1024 11
bp=2048 (4 bytes) pc
lr=668 (4 bytes) malloc
2028 r3 lr

0
1024 *ptr=11

900 _malloc
800 _free

704 ….
Mycode
640 ….

We add 1 and store the r1 again in the heap


Call free
2048
bp=0 (4 bytes) pc bp sp
lr=0 (4 bytes) pc
ptr=1024 (4 bytes) 800 2036 2028

r0 r1 r2
2036 main
bp
1024 11
bp=2048 (4 bytes)
lr=696 (4 bytes) free
sp 2028 r3 lr

696
1024 *ptr=11(freed)

900 _malloc
800 _free

704 ….
Mycode
640 ….

Load the pointer into r0 which becomes the first parameter, call free, update
pc, we free address
Main returns
sp
bp 2048
bp=0 (4 bytes) pc bp sp
lr=0 (4 bytes)
ptr=1024 (4 bytes) 800 2048 2036 pc

r0 r1 r2
2036 main

1024 11
bp=2048 (4 bytes)
lr=696 (4 bytes) free
2028 r3 lr

0
1024 *ptr=11(freed)

900 _malloc
800 _free

704 ….
Mycode
640 ….

Free exits, set pc to lr, return control to kernel


Stack
int *a=0x333333

Memory leaks function

● Free the heap memory is important


0x333333
9000 (leaked) Heap
● We get memory leak in the heap d Heap
0x222222
● Losing a pointer in the stack when function
Data (static)
returns char *c=0x222222

● Refcounting, Garbage collection


Text / code
Stack
*a is a dangling pointer
int *a=0x333333

Dangling pointers fun1

int *b=0x333333 fun2


● Memory freed but active pointers
9000 (freed)
0x333333
exist d Heap
0x222222
● Leads to errors segfault
Data (static)
● fun2 frees *b 0x333333 and returns char *c=0x222222

● fun1 tries to use *a but fails


Text / code
Stack
Fun1 tries to free and
crashes the app int *a=0x333333

Double-free fun1

int *b=0x333333 fun2


● Memory freed twice
9000 (freed)
0x333333
● Leads to crash d Heap
0x222222
● fun2 frees *b 0x333333 and returns
Data (static)
● fun1 also tries to free *a char *c=0x222222

Text / code
Performance

● Stack has built in memory management


● Stack variables are close together
● Stack space Limited
● Heap is random
● Cache locality in Stack
Heap Stack
Escape analysis bp=0 (4 bytes) bp=0 (4 bytes)
lr=0 (4 bytes) lr=0 (4 bytes)
ptr=1024 (4 bytes) ptr=1024 (4 bytes)
● Allocates in the stack when possible *ptr=11
main main

● Go, Java,

*ptr=11

…. ….
Mycode Mycode
…. ….
Program Break bp=0 (4 bytes)
lr=0 (4 bytes)
ptr=1024 (4 bytes)

● Where the process ends main

● Points to the top of the heap


New Program break
● Using brk, sbrk to allocate/deallocate New allocated
Program break *ptr=11
● Not recommended

….
Mycode
….
Summary

● Stores large data


● Remain until explicitly removed
● All functions can access
● Dynamic, grows low to high
● malloc, free, new
husseinnasser

Memory Management
What is memory and how does the OS manages it
husseinnasser

What is memory?
Random access memory
Memory

● Stores data
● Volatile
○ RAM - Random access memory
● Non-Volatile
○ ROM - Read only memory
Static RAM

● SRAM
● Complex, expensive but fast
○ 1 bit -> 1 flip flop -> 6 transistors
● Flip-flops, constant power
● Used it CPU caches, SSDs
● Access is always fast
Dynamic RAM

● DRAM
● Abundant, cheaper, but slower
○ 1 bit -> 1 capacitor, 1 transistor
● Capacitors, lose their state
● need to be refreshed
● Access is slow
○ Read -> lose the charge
○ Store it
○ Write it back
Asynchronous DRAM

● Asynchronous is slow (missed cycles)


● Wasted clock signals
● Cycle wasted on refresh
Response (1/3) Response (2/3) Response (3/3)

Received

Delay Delay Delay Delay

Read
Synchronous DRAM

● SDRAM
● Synchronized RAM and CPU clocks
● No wasted cycles
Response (1/n) Response (2/n) Response (3/n)

Received

Read
Double Data Rate

● DDR SDRAM
● Two transfers per cycle
● Up and down (1/n
)
2/n
)
/n)
se e( (3
n ns ns
e
s po s po o
Re Re sp
Re

Received

Read
DDR4 SDRAM

DDR4 SDRAM (64 pin io bus)

Read address X 64 bytes


● 64 data lines, 64 pins (8 x 8)

● DDR4 Prefetch buffer = 8 bit per io pin


● 64 pins x 8 bit each = 64 bytes
● CPU often needs 64 bytes min
● Called a burst
● Many cycles required for a “read”

Prefetch buffer is how much bits I can transfer per memory access,
DDR5 SDRAM
DDR5 SDRAM Channel A Channel B
(32 pin io bus) (32 pin io bus)

● Two channels, 32 pins each Read address X 64 bytes Read address Y


64 bytes
(16 bit x 32 pins)
● DDR5 Prefetch buffer = 16 bit per pin (16 bit x 32 pins)

● 16 x 32 bit = 64 bytes each channel


● Double bandwidth, less contention
DRAM Internals
● DIMM
● Bank
● Rows
● Columns
● Cells (1 cell 1 bit)

The 64 bit physical memory address is encoded with these parameters so the CPU knows which RAM to access, called
the DRAM address
Inactive
row
Opening a row Column

● Bank has many rows (e.g. 32K)


● A row has many columns (e.g . ~1024)
● A column has many cells (e.g.,16, 32 bits)
● Each cell is a capacitor storing 1 bit ( 0, or 1) Opened
row
bank

● Can only have one opened row in a bank


● Slow
● Many banks help

More on this, consider looking up sense amplifiers, shared in each bank. Opened row is cached in the amplifiers. So to open another row we
need to close and flush the cache to the row. When a write happens, the write goes to the shared amplifiers as well first
Summary

● RAM
● DRAM
● SDRAM
● Prefetch buffer io
● DDR4, DDR5
husseinnasser

Read/Write from memory


How CPU reads and writes
Read from Memory
2048
bp=0 (4 bytes)
lr=0 (4 bytes) RAM
ptr=1024 (4 bytes)

2036 main

bp=2048 (4 bytes)
lr=668 (4 bytes) malloc
2028

1024 *ptr=11

900 _malloc
800 _free

704 ….
Mycode
pc
640 ….
Read from Memory
2048
bp=0 (4 bytes)
lr=0 (4 bytes) RAM
ptr=1024 (4 bytes)

2036 main

bp=2048 (4 bytes) Read address 640


lr=668 (4 bytes) malloc (pc)
2028

1024 *ptr=11

900 _malloc
800 _free

704 ….
Mycode
pc
640 ….

The kernel wants to read the instruction at the pc, so it asks the CPU to load the instruction on the 640 address.
Read from Memory
2048
bp=0 (4 bytes)
lr=0 (4 bytes) RAM
ptr=1024 (4 bytes)

2036 main

64 bytes
bp=2048 (4 bytes) Read address 640 (50-100ns)
lr=668 (4 bytes) malloc (pc)
2028

1024 *ptr=11

900 _malloc
800 _free

704 ….
Mycode
pc
640 ….

We get back not only one instruction but 64 bytes worth! Remember burst length, it is then cached in the cpu L
caches. This takes around 100 ns,
Write to Memory
2048
bp=0 (4 bytes)
lr=0 (4 bytes) RAM
ptr=1024 (4 bytes)

2036 main

bp=2048 (4 bytes)
lr=668 (4 bytes) malloc
2028

1028
*ptr=11
1024

900 _malloc
800 _free

704 ….
Mycode
640 ….

We want to set the value of 22 to the pointer on 1024, 4 bytes.


Write to Memory
2048
bp=0 (4 bytes)
lr=0 (4 bytes) RAM
ptr=1024 (4 bytes)

2036 main

bp=2048 (4 bytes) Write to address 1024


lr=668 (4 bytes) malloc Value 22
2028 4 bytes

1028
*ptr=22
1024

900 _malloc
800 _free

704 ….
Mycode
640 ….

We write and check for errors,


Note about alignment
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
16 17 18 19 20 21 22

● Data types are aligned 23

● 1, 4 or 8 bytes
● Certain sizes are placed in specific
addresses
● E.g. 4 bytes placed in addressed
divisible by 4

First figure, structure allocated char one byte (that can go anywhere), but next we need a double and that must go in a /8 address the next one available is address 0x8, leaving bytes 1-7 empty, followed 2 bytes short which
we can put on the address 0x16 then the 4 byte int on address 0x20

However in the second figure we allocated less memory for the structure as we defined the 1 byte character on 0x1, then the short which lived in a 2 byte address 0x2 then 4 byte address right after on address 0x4
(divisible by 4) then we allocated the double on 0x8
Summary

● Memory access takes 50-100ns


● CPU sends read or write request
● We get a burst of sequential data
husseinnasser

Virtual memory
Layer on top of physical memory
Agenda

● Limitations of Physical memory


● What is Virtual Memory
● Example
● Limitations of Virtual memory
Limitations of Physical memory

● Fragmentation
● Shared Memory
● Isolation
● Large Programs
Fragmentation
Fragmentation
1,000,000
● One space
● Memory must be contiguous


8
4
0
Fragmentation
1,000,000
● Process 1 is loaded
Process 3
● Large Process 2 is loaded
800,000

● Process 3 is loaded

Process 2

10,000

8
Process 1
4
0
Fragmentation
Process 4
1,000,000
● Process 1, 3 terminates
Process 3
● Large process 4 wants to800,000
load

Process 2

10,000

8
Process 1
4
0
Fragmentation 1,000,000

● Process 4 can’t load 800,000

● Enough space but


fragmented Process 2 Process 4

10,000

8
4
0
External vs Internal Fragmentation
● Memory allocation happens in blocks
● External fragmentation
● Internal fragmentation
External Fragmentation
● We get space but its not contiguous
● Happens often with variable size blocks
● e.g. segmentation

Example of external
fragmentation
(no space for purple process)
Large Small
block block

Internal Fragmentation
● Fixed-allocated blocks
● Internal space is wasted
● OS doesn’t know the space is unused
● Happens with fixed-size blocks,
especially large size

Example of internal
fragmentation
Virtual memory and fragmentation
● Let us use fixed block size call it paging
● Each process has virtual address space
● We map logical page to physical page
● Mapping stored in process page table
● Page size is often 4kb
● Many to 1
Virtual memory and fragmentation 10,000
10,000
P4 A.P4
P3 A.P3
Process A VM P2
… P1
A.P2
4 B.P3
0
B.P2
A.P1
10,000
P3 C.P5
Process B VM P2 Mapping C.P4
P1

C.P3
4 B.P1
0
C.P2
C.P1
10,000
P5
Process C VM P4 4
P3 0
… P2
4 P1
0 Physical memory
Page Tables
● Another layer requires translation
● Map virtual address to physical
● Page table has this mapping
● Each process has its own page table
● Page table is stored in memory
Page tables
52,000

A B C
48,000
A.P4
16,000 20,000 44,000
P4 16,000
P3 16,000
P5 A.P3
12,000 40,000
P3 12,000
P2 P4 A.P2
8,000 12,000
P2 8,000
P1 P3
4,000
P1 4,000
8,000
P2
36,000
B.P3
2,000 4,000 32,000
0
0
P1 B.P2
0
28,000
A.P1
24,000
C.P5
A page table B page table C page table 20,000
C.P4
16,000
VMA PA VMA PA VMA PA C.P3
12,000
4000 4000
B.P1
4000 28,000 8000 12,000 8,000
C.P2
8000 40,000 12,000 32,000 8000 8000 4,000
C.P1
12,000 44,000 16,000 36,000 12,000 16,000
0
16,000 48,000 16,000 20,000

20,000 24,000

Actual page tables store page numbers for more compact storage, the CPU stores the pointer to the current process page table in the PTBR
(page table base register)
P4
P3
P2
P1

A.P2, B.P1

P3
P2
P1

Shared Memory
Shared memory
Process 5 Code
● Sharing memory is a challenge
● Spin 5 processes of the same program Process 4 Code

● All processes have same code Process 3 Code


Duplicate
● Lots of duplicate memory
Process 2 Code

Process 1 Code

Physical memory
Shared memory and Virtual Memory
● With virtual memory, we load the code
once and we map all virtual pages to the
same physical address!
P4
P3
Process A VM P2
P1

A.P2, B.P1

P3
P2
Process B VM P1
Shared Libraries
● Most processes use libraries
● OS loads the library code once
● Map the virtual page to the library physical code
● Libc
● /proc/[pid]/maps
Use cases for shared memory
● Multi-processes / multi-threading
● Databases shared buffers
● NGINX/Proxies
● forking
● CoW - copy on write
10,000
P4
P3
Process A VM P2
… P1
4
0

10,000
P3
P2
Process B VM P1

4
0

10,000
P5
Process C VM P4
P3
… P2
4 P1
Isolation 0
Isolation
● Physical memory addresses tells process where it is
● Process can attempt to load an address they aren’t
supposed to
● Virtual memory solves this
● Each process has full virtual address
● Most of it isn’t mapped
Process A VM
Isolation with Virtual memory 10,000
P4

● Process A address 1000 is different from 8


4
Process B’s 1000 0
Process B VM
● They point to different physical address 10,000

● Process has no way to reference specific



physical addresses directly 4
0

Process C VM
10,000


4
0
P4
P3 A.P4
A P2 A.P3
P1
A.P2 P5
B.P3 P4
P3
B.P2
A.P1
P3
P2
C.P5
B P1 C.P4
C.P3 Offloaded to disk
B.P1
P5
C.P2
P4 C.P1
C P3
P2
P1

Swap (large programs)


Not enough memory
● Physical memory has a limit
● If I load too many processes, we ran out of memory
● So we fail to spin up new processes
● But virtual memory helps!
Swap
A
P4
P3 A.P4
P2 A.P3
P1
A.P2 P5
B.P3 P4
P3
B B.P2
P3 A.P1
P2
C.P5
P1
C.P4 Offloaded to disk
C.P3
(swap)
B.P1
C C.P2
P5
P4 C.P1
P3
P2
P1

C’s P3, P4 and P5 haven’t been used for a while, so the OS keeps those VM entries but frees the physical
memory allocation.
Swap
P4
A P3 A.P4
P2 A.P3
P1
A.P2 P5
B.P3 P4
P3
B.P2
A.P1
P3
B P2
P1

Load from disk!


B.P1
(swap)
Memory read of an address P5
C.P2
in P4 P4 C.P1
C P3
P2
P1 C.P4

The next time the process C access P3, P4 or P5 the OS will issue a page fault, remember the mapping entry still exist but there is a bit
that says hey this page isn’t in physical memory, so the OS loads it from swap file to memory, update the new physical address (because
we might load it somewhere completely different). This also means that the index of swap file is stored in the page table to know where in
the swap file the page lives.
Limitations of Virtual Memory
● Additional layer of translation (CPU can’t read virtual addresses)
● More maintenance (page tables)
● Page faults (kernel mode switch)
● More complex CPU architecture (MMU/TLB)
● TLB cache misses (MySQL 8.x vs 5.x)
husseinnasser

DMA
Direct Memory Access
Peripherals Read
● Data from network/disk must pass through CPU
● Keyboard -> CPU -> RAM
● Network -> CPU -> RAM
● Slow at times with large transfers
DMA
● Allow direct access from network/disk to RAM
● DMA controller initializes the operation
● Start the direct transfer

DMA Controller
Notes about DMA
● Must be Physical addresses
● DMA doesn’t often have MMU
● Knows nothing of the virtual memory
● Kernel allocated memory must not be swapped
● IOMMU (allows IO)
O_DIRECT
● Very important option in file systems and databases
● Allows bypassing the file system cache
● Direct from disk to user-space (database)
● Uses DMA
Pros and Cons
● Efficient transfers
● No VM Management
● Less CPU overhead
● But security concerns and complexity
● Initialization cost
● Can’t be used for interrupts (Keyboard/mouse), CPU is faster
Summary
● Large transfers takes time
● Going from peripheral to CPU then memory
● DMA allow direct access
● Has limitations and requires lots of work to get right
husseinnasser

Inside the CPU


How the CPU works
husseinnasser

CPU Components
Basic components of the CPU
CPU
Basic Components
ALU

● ALU (Arithmetic logic unit) CU


● CU (Control unit)
Registers
● MMU (Memory management Unit)
● Registers
● Caches (L1, L2, L3)
MMU
● Bus
L1

L2
Processor CPU Core #2 CPU Core #1
ALU ALU

CU CU

Registers Registers

MMU MMU

L1 L1

L2 L2

L3
CPU Core #2 CPU Core #1

ALU ALU

CU CU

Registers Registers

Main Memory MMU MMU

L1 L1

L2 L2

L3

Bus
Core 1 Core 2 Core 3 Core 4
Main Main
Memory Memory
(DIMM1) (DIMM2)
L2 L2 L2 L2

L3 L3

Bus Bus

DSM
ALU
ALU

● Arithmetic Logic Unit CU


● Arithmetic +-/*
Registers
● Logic XOR/OR/AND
● Core of compute

MMU

L1

L2
ALU
CU
CU
● Control Unit
Registers
● Fetches Instructions
● Decodes Instructions
● Executes Instructions
MMU

L1

L2
ALU
Registers
CU

● Small ultrafast unit of storage


Registers
○ 32 or 64 bit
● In the CPU core
● Many registers types
MMU
● PC, IR, SP, BP
● General purpose L1

L2
ALU
MMU
CU
● Memory management Unit
Registers
● Responsible for memory access
● Translating virtual to physical address
○ TLB Translation Lookaside buffer
TLB MMU
○ TLB must* be flushed on context
switch L1

L2
L Caches
CPU Core #2 CPU Core #1

ALU ALU

● L1, L2 and L3 CU CU

● L1 local to core Registers Registers

○ 1 ns, ~128KB
MMU MMU
● L2 local to core
L1 L1
○ 5 ns, ~256-2MB
L2 L2
● L3 shared between all cores
L3
○ 15 ns, ~64 MB
● Main Memory
○ 50-100 ns
L Caches
CPU Core #2 CPU Core #1

ALU ALU
● Memory reads are cached at all levels CU CU
● L1 cache is two types Registers Registers
○ L1D (Data) L1I (Instructions)
○ CU can fetch data and instructions at same time MMU
MMU
● L2, L3 unified L1D/L1I
L1D/L1I
● Cache invalidation challenges
● Some CPUs only have L1, and L2 (shared) L2 L2

L3
CPU Architecture
● RISC - Reduced Instruction Set
○ Simple instructions - each single task - single cycle
○ Low power, predictable
○ Arm
● CISC - Complex Instruction Set
○ One instruction, lots of tasks, multiple cycle
○ More power, unpredictable
○ x86 (Intel/AMD)
CISC vs RISC - Example
● Example a = a + b, where b and a are integers
● CISC - 1 instruction
○ ADD a,b (a and b here are addresses)
○ Takes two memory locations
○ Reads , adds then store in a
● RISC - 4 instructions
○ LDR r0, a
○ LDR r1, b
○ ADD r0, r1
○ STR a1, r0
Clock Speed
● How many cycles per second
● e.g. 3GHz = 3 billion clock cycles per second
● In RISC could mean 3 billion instructions per second
● Less in CISC
● Remember cost of fetching/decoding (pipelining helps)
Display CPU Info (mac)
● Show L caches
○ sysctl -a | grep cachesize
● Show number of cores
○ sysctl -n hw.physicalcpu
● Show CPU architecture
○ uname -m
Summary
● CPU internals
● Caches
● Architecture
husseinnasser

Instruction Life Cycle


Fetch, Decode, Execute, Read, write
Instruction
● Fetch from memory (MMU)
● Decode (CU)
● Execute (ALU)
● Memory read (optional)
● Write (to register/memory)
Example RAM

2048
bp=0 (4 bytes)
lr=0 (4 bytes)
ptr=1024 (4 bytes)

2036 main

bp=2048 (4 bytes)
lr=668 (4 bytes) malloc
2028

1024 *ptr=11

900 _malloc
800 _free

704 ….
Mycode
pc
640 ….
Fetch RAM
Read address 640
(pc)

ALU
64 bytes
CU (50-100ns)

Registers

MMU

L1

L2

pc

We fetch that instruction by asking the MMU, the MMU translates the virtual memory address to a physical memory (640),
and we check L caches, the address isn’t there so we go to memory, fetch a whole cache line 64 bytes.
Update L caches RAM

ALU

CU

Registers

MMU

L1i 64 bytes

L2 64 bytes

pc

We then update the L caches with the 64 bytes we just receive, the L2 cache is a general cache, but L1i is the
instruction based cache.
Decode RAM

ALU
CU

Registers

MMU

L1i 64 bytes

L2 64 bytes

pc

The CU gets instruction at 640 and starts decoding it, remember instructions are opcodes that need to be
converted to control signals understood by the ALU. Here sub is converted to a -
Decode RAM

ALU
CU
- 100

registers
sp=100

MMU

L1i 64 bytes

L2 64 bytes

pc

The CU also fetches the values of the operands from the register (sp here) and prepare it for the ALU
Execute RAM

ALU sp = 100-12 (88)

CU
- 100

registers
sp=100

MMU

L1i 64 bytes

L2 64 bytes

pc

The ALU executes the instruction and prepares the result


Write RAM

ALU sp = 100-12 (88)

CU
- 100

registers
sp=88

MMU

L1i 64 bytes

L2 64 bytes

pc

The next step is to write the new value back to the sp register , the CU takes over to write, (ALU doesn’t write or
read from/ to registers)
Fetch Next instruction RAM

ALU

CU

registers
sp=88

Cache hit!

MMU

L1i 64 bytes

L2 64 bytes

pc

Increment the program counter and fetch the next instruction, luckily we have it hot in the L1i cache. Fast!
Teaser Question
● What if you called a function
● Function code is NOT within the 64 bytes cache
● Cache misses!
● Inlining
● MySQL 8.x
Summary
● Instruction life cycle
● Each stage has component
● Can be overlapped
● RAM access is slower than cache
husseinnasser

Pipelining and Parallelism


Making CPU efficient
CPU mostly idle
● Notice how parts of the CPU are mostly idle
● Pipelining helps
● While decoding, we can fetch another instruction
● While ALU executing, we can decode another instruction

Instruction 1 Fetch Decode Execute Write

Instruction 2 Fetch Decode Execute Write


Pipelining
Without Pipelining
Instruction 1 Fetch Decode Execute Write

Instruction 2 Fetch Decode Execute Write

With Pipelining
Instruction 1 Fetch Decode Execute Write

Instruction 2 Fetch Decode Execute Write


Parallelism
● App can spin multiple processes/threads
● Each go into it in a CPU core

Big Task
Big Task

Process 1 Process 2

1 2 1 2
Hyper threading
● Sharing cores
● Hyper threading exposes a single core as multiple logical cores
● Dedicated registers( e.g. pc)/shared CU/ALU/ L Cache

Process 2 queued Process 1 Process 2

Process 1

Logical
core/hardware 1 2
thread
1
SIMD
● Single Instruction Multiple data
● With a single instruction add multiple values
● Vectors
● Instead of executing 4 instructions
Traditional
○ 1 instruction on 4
Add a1,b1
● Gaming/DB Btrees Add a2,b2
Add a3,b3
● E.g. ARM Neon Add a4,b4

SIMD
Add [a1,a2,a3,a4], [b1,b2,b3,b4]
Summary
● CPU Is mostly idle
● Need to keep it busy
● Pipelining
● Parallelism
● Hyperthreading
● SIMD
husseinnasser

CPU Wait times


CPU cannot wait
CPU time is precious
● CPU is precious
● The more idle the CPU the more waste
● IOs causes wait
CPU Wait times
● Demo
husseinnasser

Process Management
A dive in processes
husseinnasser

Process vs Thread
Understanding the difference
Stack
0xff999999 int a
Process 0xff999995 function

● An instance of a program 0xff999991

Heap
● Has dedicated code, stack, heap,
0x1000005 Heap
data section 0x1000001

Data (static)
● Has context in the CPU (pc, lr, etc..)
● Process Control Block
Text / code
0x0099991
Process Control Block (PCB)
● Kernel needs metadata about the process
● PCB Contains
○ PID, Process State, Program counter, registers
○ Process Control info (running/stopped, priority)
○ Page Table (Virtual memory to physical mapping)
○ Accounting (CPU/memory usage)
○ Memory management info (Pointer to code/stack etc.)
○ IO info (File descriptors)
○ IPC info, semaphores, mutexes, shared memory, messages.
Process table (kernel)
PID | PCB
Kernel Process Table 100 | 0x222222
200 | 0x333333
PCB 100
● Kernel needs to manage processes int a
100, running,
● A mapping table from PID to PCB 0x555555, pc…
● Process Table
Heap

● Quick lookup int a

● In kernel space Data (static)

Text / code
Heap

Data (static)

PCB 200
Text / code

200, stopped,
0x9999999,pc..
0xff999999 Stack
Main process stack
Thread
0xff999995

Thread 1 stack
● A thread is a light weight process 0xff999991

● Shared code/heap,data and PCB Thread 2 stack


● Stack is different and pc
0x1000005
● Thread Stack lives in same VM Heap
Heap

0x1000001
Data (static)

Text / code
0x0099991
Thread Control Block (TCB)
● Kernel needs metadata about the thread
● TCB Contains
○ TID, Thread State, Program counter, registers
○ Process Control info (running/stopped, priority)
○ Accounting (CPU/memory usage)
○ Memory management info (Pointer to stack etc.)
○ Pointer to parent PCB
Thread table (kernel)
PID | TID | TCB
Kernel Thread Table 100 | 1001 | 0x222222
100 | 1002 | 0x333333
TCB 1001
● Kernel needs to manage threads 1001, running,
● A mapping table from TID to TCB 0x555555,
Main process stack pc1…
● Thread Table
● Quick lookup Thread 1 stack
TCB 1002
1001, stopped,
● In kernel space Thread 2 stack
0x9999999,pc2.
.
Heap

Data (static) PCB 100

200, running,
0x4311332,pc..
Text / code
Shared Memory
● Multiple processes/threads can share memory
● mmap
● Virtual memory different, physical memory same
● Shared Buffers in databases
Postgres Processes
● Postgres uses processes
● Should Postgres move to threads?
● Long running discussions
Fork
● Fork creates a new process
● Child must have new virtual memory
● But OS uses CoW so pages can be shared unless a write happens
● Redis Asynchronous durability
Copy on Write
52,000

A 48,000
A.P4
16,000 44,000
P4 A.P3
12,000 40,000
P3 A.P2
8,000
P2 36,000
4,000
P1
2,000 32,000
0
28,000
A.P1
24,000

A page table 20,000

16,000
VMA PA
12,000

4000 28,000 8,000

8000 40,000 4,000

12,000 44,000
0
16,000 48,000

A is the only running process., B is about to be forked from A


Copy on Write
52,000

A B 48,000
A.P4
16,000 16,000 44,000
P4 P4 A.P3
12,000 12,000 40,000
P3 P3 A.P2
8,000 8,000
P2 P2 36,000
4,000 4,000
P1 P1
2,000 2,000 32,000
0 0
28,000
A.P1
24,000

A page table B page table 20,000

16,000
VMA PA VMA PA
12,000

4000 28,000 4000 28,000 8,000

8000 40,000 8000 40,000 4,000

12,000 44,000 12,000 44,000


0
16,000 48,000 16,000 48,000

B is a fork of A, initially they are identical, their page tables point to the same physical memory
Copy on Write
52,000

A B 48,000
A.P4
16,000 16,000 44,000
P4 P4 A.P3
12,000 12,000 40,000
P3 P3 A.P2
8,000 8,000
P2 P2 36,000
4,000 4,000
P1 P1
2,000 2,000 32,000
0 0
28,000
A.P1
change 24,000

A page table B page table 20,000

16,000
VMA PA VMA PA
12,000

4000 28,000 4000 28,000 8,000

8000 40,000 8000 40,000 4,000

12,000 44,000 12,000 44,000


0
16,000 48,000 16,000 48,000

A makes a change to an address in Page P1, P1 needs to be copied for B


Copy on Write
52,000

A B 48,000
A.P4
16,000 16,000 44,000
P4 P4 A.P3
12,000 12,000 40,000
P3 P3 A.P2
8,000 8,000
P2 P2 36,000
4,000 4,000
P1* P1
2,000 2,000 32,000
0 0
28,000
A.P1
24,000

A page table B page table 20,000

16,000
VMA PA VMA PA
12,000
B.P1
4000 28,000 4000 12,000 8,000

8000 40,000 8000 40,000 4,000

12,000 44,000 12,000 44,000


0
16,000 48,000 16,000 48,000

New page is allocated for B, A.P1 is copied in it, page table is updated to point to it.
Python CoW bug
● Python bug None, True, False
● Refcounting was constantly updated
● Forks were triggering CoW
Summary
● Processes have dedicates VM
● Threads belong to the same process
● Threads share parent process memory
husseinnasser

Context Switching
A critical OS function
CPU Process
● CPU doesn’t really know what a process is
● OS loads data into CPU registers (pc, sp, bp, etc.)
● Pointer to Page Table mapping (ptbr)
● Called “Context”
● Executes instructions
Context Switch
● To switch “context” we save current context and load new context
● “Save” the current registers to current process PCB (memory write)
● “Load” the new process PCB to CPU registers (Memory read)
● pc, bp, sp, lr, ptbr and more
Process 100 Process 200

int a
int a

Heap
pc=4008 Heap

ptbr=0x555555 Data (static)

Data (static)

pc=4008 pc=8008
Text / code

Text / code

100’s page table 200’s page table


PCB 100
0x555555 Process table (kernel) PCB 200 0x444444
VMA PA
100, running, PID | PCB VMA PA
200, ready,
4000 28,000
Page table: 0x555555, 100 | 0x222222 Page table: 0x444444,
pc=4008 4000 8,000
200 | 0x333333 pc=8008
8000 40,000
8000 16,000

12,000 44,000
12,000 20,000

The CPU is executing process 100’s instructions, lots of registers are in place but we will only show pc and ptbr for simplicity. Ptbr is the page table base
register, pointing to the page table physical memory address where mapping is stored. Pc is the program counter which points to the virtual memory
address of the current instruction.
Process 100 Process 200

int a
int a

Heap
pc=4012 Heap

ptbr=0x555555 Data (static)

Data (static)

pc=8008
Text / code
pc=4012
Text / code

100’s page table 200’s page table


PCB 100
0x555555 Process table (kernel) PCB 200 0x444444
VMA PA
100, running, PID | PCB VMA PA
200, ready,
4000 28,000
Page table: 0x555555, 100 | 0x222222 Page table: 0x444444,
pc=4008 4000 8,000
200 | 0x333333 pc=8008
8000 40,000
8000 16,000

12,000 44,000
12,000 20,000

CPU moves to the next instruction, pc register is currently 4012 , but the PCB pc value in memory still has the old values.
Process 100 Process 200

int a
int a

Heap
pc=4012 Heap

ptbr=0x555555 Data (static)

Data (static)

pc=8008
Text / code
pc=4012
Text / code

100’s page table 200’s page table


PCB 100
0x555555 Process table (kernel) PCB 200 0x444444
VMA PA
100, ready, PID | PCB VMA PA
200, ready,
4000 28,000
Page table: 0x555555, 100 | 0x222222 Page table: 0x444444,
pc=4012 4000 8,000
200 | 0x333333 pc=8008
8000 40,000
8000 16,000

12,000 44,000
12,000 20,000

OS initiates a context switch, process 100 state must be saved to its PCB, the process now becomes in a ready state
Process 100 Process 200

int a
int a

Heap
pc=8008 Heap

ptbr=0x444444 Data (static)

Data (static)

pc=8008
Text / code
pc=4012
Text / code

100’s page table PCB 100


200’s page table
0x555555 Process table (kernel) PCB 200 0x444444
VMA PA
100, ready, PID | PCB VMA PA
200, running,
4000 28,000
Page table: 0x555555, 100 | 0x222222 Page table: 0x444444,
pc=4012 4000 8,000
200 | 0x333333 pc=8008
8000 40,000
8000 16,000

12,000 44,000
12,000 20,000

OS then loads Process 200’s PCB into the CPUs registers. Process 200 is now executing. We also need to flush the TLB as well, process 200 is now in
running state
TLB flush
● TLB stores Virtual memory mapping cache
● Processes CANNOT share VM mapping
● Slow
● Threads of same process are faster to switch
○ Same memory, paging
○ As long as threads of the same process
TLB ASID
● Address space id
● Identify the process in the TLB
● 255 values
● Avoid TLB flushing on context switch
● ARM/Intel

Credit arm.com
When does context switch happens
● Scheduling algorithms
● Preemptive multitasking
● IO wait
Preemptive multitasking
● Some processes run for a long time
● OS must switch those out
● Time slice
● Windows 3.1 bug where other processes starve
Scheduling algorithms
● What processes/ threads get what CPU core for how long
● Many papers have been written on this topic
● Complex and hard to find a fair algorithm
● You want to schedule threads of same process on the same core
Scheduling algorithms
● First come first serve
● Shortest Job First
● Round Robin
Note about alignment
0 1 2 3 4 5 6 7
8 9 10 11 12 13 14 15
16 17 18 19 20 21 22

● Data types are aligned 23

● 1, 4 or 8 bytes
● Certain sizes are placed in specific
addresses
● E.g. 4 bytes placed in addressed
divisible by 4

First figure, structure allocated char one byte (that can go anywhere), but next we need a double and that must go in a /8 address the next one available is address 0x8, leaving bytes 1-7 empty, followed 2 bytes short which
we can put on the address 0x16 then the 4 byte int on address 0x20

However in the second figure we allocated less memory for the structure as we defined the 1 byte character on 0x1, then the short which lived in a 2 byte address 0x2 then 4 byte address right after on address 0x4
(divisiable by 4) then we allocated the double on 0x8
Summary
● Context is an OS concept
● CPU execute instructions only
● Context is saved or loaded (thus switched)
● Threads are more efficient in context switch
husseinnasser

Concurrency
Split your program into processes or threads
CPU time is precious
● CPU commodity
● Need to keep it busy
● Can one task be split to concurrent tasks
CPU Bound vs IO Bound Workload
● CPU Bound workload mostly use CPU
○ Encryption, Compression, DB planning, sorting, Protocol Parsing (HTTP/2,
QUIC)
○ We want to limit starvation of those
● IO Bound Workload is mostly does IO
○ Database queries, network connection write/read, file reads/writes
Multi-threaded vs Multi-process
P1 P2

● Multi-process
○ Spin multiple processes
P3
○ Isolated
○ e.g. NGINX, Postgres T1 T2
● Multi-threaded
○ parent process spins multiple threads
○ Share memory with parent
○ E.g. MySQL, libuv
Concurrency issues
● With concurrency comes challenges
● Two threads touching the same variable (thread safety)
● Two processes writing to the same shared memory

a=1

T1 T2
a++ a++

Read a=1 Read a=1


a=a+1 a=a+1
a=2 a=2
Store a=2 Store a=2
Concurrency issues
● Two threads, both increments a global variable
● T1 reads a value 1, T2 reads a value 1
● T1 increments a in register, T2 also increments,
● T1 stores 2 to memory, T2 also stores 2, (should be 3)
a=1

T1 T2
a++ a++
a=2
a=2
Read a=1 Read a=1
a=a+1 a=a+1
a=2 a=2
Store a=2 Store a=2
Mutexes M a=1

T1 T2
● Mutex is binary lock (mutual exclusion) a++ a++
a=2
● Thread 1 acquires a mutex succeeds. a=3
Read Read a=2
● Thread 2 tries to acquire the same mutex, a=1 a=a+1
a=a+1 a=3
waits(blocks) a=2 Store a=3
● Thread 1 does the change and release Store
a=2
● Thread 2 immediately unblocks, gets the
mutex and change
a=1

T1 T2
a++ a++

Acquire Acquire
Mutex Mutex (waits)

Even if both threads attempts to obtain the mutex at the same time, the OS will serialize them the second one to come in will be blocked (its atomic), or one will be chosen
by the OS. Here thread 1 wins. Both T1 and T2 are executing on different cores (T2 will get switched out probably making room for other threads)
a=1

T1 T2
a++ a++

Read a Acquire
a=1 Mutex (waits)
a=a+1
a=2

Thread 1 reads a increments it (CPU executes instructions), a is now 2 in T1’s core register
a=2

T1 T2
a++ a++

Read a Acquire
a=1 Mutex (waits)
a=a+1
a=2
Write a=2

Thread 1 writes a back to the stack/heap memory a is now 2 in memory


a=2

T1 T2
a++ a++

Release mutex Acquire


Mutex(unbloc
ked)

Thread 1 releases the mutex, immediately T2 gets it (OS will context switch T2 back)
a=2

T1 T2
a++ a++

Release mutex Acquire


Mutex

Thread 1 releases the mutex, immediately T2 gets it (OS will context switch T2 back)
a=3

T1 T2
a++ a++

Release mutex a=2


a=a+1
a=3
Write a
Release
mutex

Thread 2 resumes, get the latest version of a which is 2, increments it to 3, writes back to memory, releases
a=3

T1 T2
a++ a++
Mutexes Gotchas
● Mutex has ownership
● The thread that locks Mutex MUST unlock
● If a thread terminates the mutex can remain locked
● Can cause deadlock
Semaphores
● Semaphores can be used for mutual exclusion
● Signal increments, Wait decrements (atomically)
● Wait blocks when semaphore = 0
● Any thread with access to the semaphore can signal/wait
Summary
● Concurrent programming improves performance
● Can use processes or threads
● Challenging as need to deal with race conditions
● Using Mutex/Semaphores help
husseinnasser

Storage Management
The internals of disk storage
husseinnasser

Persistent Storage
Why Persist?
Persistence
● RAM is volatile, if power goes we lose data
● Need to persist data for certain use cases
● Megantic Tape, HDD, SSD, Flash
HDD
● HDD consists of platters, heads, tracks and sectors
● Traditionally the OS addressed using CHS method
○ Cylinder/ Head / Sector
● Geometrical Sector vs Disk sector
● The OS “knows” the physical layout of the HDD
head
cylinder
1

G F
E Geometrical
2 A sector
C D
B

3 Disk sector

Access disk sector: Cylinder yellow, Head 2, Sector C


LBA
● Later OS exposes a new API LBA
● Logical Block Addressing
● The entire disk is an array of blocks
● Disk Controller does the “Translation”
● Additional Cost but disks can be improved
1 2 3 4 5 6 7 8 LBAs
1

G F
2 A E
C D
B

Access LBA 3 -> controller translates to Cylinder yellow, 3 Disk sector


Head 2, Sector C
SSD
● SSD uses NAND technology
● Physical Page (4 KiB, 16 KiB etc.)
● Physical Block (collection of pages)
● Min read/write is page, erase by block
● Logical blocks maps to pages (Flash translation layer)

LBAs 1 2 3 4 5 6 7 8

A E
B F
C G
D H
LBA 1 maps to page A, 7 to G and so on..
SSD Write
● We want to write to LBA 1
● We find a free physical block and map it

1 2 3 4 5 6 7 8

A E
B F
C G
D H
We write to LBA 1 , which maps to PBA A.
SSD Write
● Write 4 KiB to LBA 1
● Now we want to update LBA 1 with new content
● No update in SSD, its a remove and add.

1 2 3 4 5 6 7 8

A E
B F
C G
D H
We write to LBA 1 , which maps to PBA A.
SSD Write - Update
● LBA 1 is now maps to PPA B
● A is marked invalid (must be erased before used)
● Can’t erase a single page

1 2 3 4 5 6 7 8

A E
B F
C G
D H
We write to LBA 1 , which maps to PBA B.
SSD Write - More update
● We continue writing
● LBA1 points to D,
● F, G and H are used by 6, 7 and 8
● One more update to LBA 1

1 2 3 4 5 6 7 8

A E
B F
C G
D H
SSD Write - More Updates
● LBA 1 now points to E
● One more update to LBA 1 but everything is full
● Entire block is now invalid we can safely erase it

1 2 3 4 5 6 7 8

A E
B F
C G
D H
SSD Write Garbage collection
● First erase block
● Then we update LBA 1

1 2 3 4 5 6 7 8

A E
B F
C G
D H
Erasing block
SSD Write Amplification
● Now we have LBA 1 points to A
● E is invalid
● This is called Write amplification
● We wanted to do a single write but end up doing more

1 2 3 4 5 6 7 8

A E
B F
C G
D H
A write amplification occurred when we try to write a page but end up doing more work behind the scene slowing the operation
Wear leveling
● NAND cells have write limit
○ Write endurance/ program
● Cold vs hot pages
○ Page written once and never touched
○ While other pages are updated all the time
● Some pages will die before others

1 2 3 4 5 6 7 8
A E
B F
C G
D H
Page A, B, C and D written once and never been touched ever. pages E, F, G and H are always getting used. E, F, G and H will be weared before the rest, and this part of
the SSD becomes unusable.
Over-provisioning
● SSD over-provision an area for GC/WL

OP
1 2 3 4 5 6 7 8
A E
B F
C G
D H
Page A, B, C and D written once and never been touched ever. pages E, F, G and H are always getting used. E, F, G and H will be weared before the rest, and this part of
the SSD becomes unusable.
Over-provisioning
● SSD over-provision an area for GC/WL
● Cold pages are moved to OP

OP
1 2 3 4 5 6 7 8
A E A
B F B
C G C
D H D
Over-provisioning
● SSD over-provision an area for GC/WL
● Cold pages are moved to OP
● Erase cold block

OP
1 2 3 4 5 6 7 8
E A
F B
G C
H D
Over-provisioning
● SSD over-provision an area for GC/WL
● Cold pages are moved to OP
● Erase cold block
● Move hot pages

OP
1 2 3 4 5 6 7 8
E A
F B
G C
H D
Mismatch block to page size
● LBA doesn’t have to match page size!
● e.g. LBA = 4KiB and Page is 8 KiB
● Mapping to addresses in a page
● Causes issues
1 2 3 4 5 6 7 8

A C

B D

LBA 1 maps to Page A at address 0 and LBA 2 maps to Page A on address 2000
Summary

● We need persisted storage


● Storage is abstracted as blocks
● The OS access logical blocks
husseinnasser

File Systems
Layer above storage
File System
● An abstraction above present storage
● Users like files/directories
● Writing and reading to a file translates to blocks
● Promotes caching
● Must allocate a block (1 or more LBAs)

File System
(OS)
Examples of File Systems
● FAT (FAT16, FAT32)
● NTFS
● APFS (Apple File System)
● EXT4
● XFS
● btrfs
Some terminologies
● PBA - Physical Block Address - Internal to the drive
○ Aka physical sector size
● LBA - Logical Block Address - Exposed to OS
○ aka logical sector size
● File System Block Size Example

○ Minimum read/write size by the fs


LBA 1
● 1 PBA -> 1 or more LBAs FS
block 1 LBA 2
● 1 FS Block = 1 or more LBAs PBA 1
LBA 3
FS
block 2 LBA 4

2048 fs block 1024 LBA 4096 PBA


FAT32
● File Allocation Table
● Basic Idea is Array of 32 bit integers
● The index is the LBA (or Logical Sector traditionally)
● The content is the next LBA, until we reach end.
test.txt

LBA 0 1 2 3 4 5 6 7
- - 3 E - - 2 -

File
Name: test.txt
Start LBA: 6
File test.txt starts at LBA 6, to read the entire file we read go to position 6 , find out that the next LBA is 2, we go to the LBA 2 and find out the next LBA is
3, we go to LBA 3 and find out that it is the end of the file no more chain! So to read the entire file we read LBAs 6, 2 and 3.
32 bit is really 28
● We need to reserve some bits
● End of chain, Is dirty, free, other uses
● LBA = 512 byte (the old standard)
● So we can only address 2^28 *512 bytes ~ = 128 GB!
● Very low! So we need to play a trick
● Meet clusters
Clustering
● Cluster is a logical grouping of LBAs (aka blocks)
● LBA size = 512 byte
● E.g. 8 LBAs = 1 cluster (4 KiB)
● E.g. Cluster 0 -> LBAs 0-7, Cluster 1 -> LBAs 8- 15 (C * 8)
● We can address much more (~1 TB)
test.txt
Blocks 0 1 2 3 4 5 6 7
- - - E - - 3 -
LBAs/logical sectors
File
Name: test.txt
Start Block/Cluster: 6
File test.txt starts at Cluster 6, to read the entire file we read go to position 6 , find out that the next cluster is 3, we go to
the cluster 3 and find out the its the end of file. So to read the file we need to convert clusters 6, and 3 to LBAs, cluster 6
is LBAs (48-55) and cluster 3 to LBAs (24-31)
Clustering
● The larger the cluster, the more disk we can address
● But creates more internal fragmentation
● Wasted space, can’t be used by others.

used

waste
Blocks everywhere
● When formatting a file system you specify a file system block size
● Must not exceed virtual page size 1024, 2048, 4096 bytes
● LBA size = Logical sector block
● PBA size = Physical sector block (on device)
● Fs block maps to collection of LBAs (derived like clusters)
LBA 1 PBA 1
LBA 1 LBA 1
LBA 2 PBA 2
PBA 1
LBA 3 PBA 3
LBA 2 LBA 2
FS LBA 4 PBA 4 FS FS PBA 1
block 1 LBA 5 PBA 5 block 1 LBA 3
block 1 LBA 3
LBA 6 PBA 6
PBA 2
LBA 7 PBA 7
LBA 4 LBA 4
LBA 8 PBA 8

4096 -> 512 -> 512 4096 -> 1024 -> 2048 4096 -> 1024 -> 4096
OS Page Cache
● File system blocks read from disk are cached
● Block number maps to virtual memory page
● FS Block <= OS VM Page (often)
● Reads checks the cache first then disk and updates cache
● Writes go to the cache first, then disk
● Block number maps to LBAs
OS Page Cache (Cached Read)
● User wants to read block number 3
● OS first checks the page cache
● Block 3 is cached and mapped to a VM page
● content is copied to the user buffer

Read Block 3 (hit)

Page cache
Disk
(memory)
OS Page Cache (Missed Read)
● User wants to read block number 7
● OS first checks the page cache
● Block 7 isn’t in the page cache, we go to disk
● Translate block 7 to LBAs and read from disk and update the cache
● 1 Block = 8 LBAs (B * 8 + 7)
Read Block 7 (miss) Read LBAs 56-63

Page cache
Disk
(memory)
OS Page Cache (Write)
● User wants to write to Block 7
● OS writes to the page cache creating a new entry
● LATER the OS flushes the page cache to disk
● OR until fsync() is called

(Buffered)
Write Block 7
Write LBAs 56-63

Page cache
Disk
(memory)
Too many fsync is not good…
Page cache woes

● Faster reads
○ two apps can page cache files
● Can cause corruption
○ Write goes to cache then OS crashes
○ Torn database pages
File Modes

● A file must be opened to be used


● Different modes in open
● Some examples
● O_APPEND - append mode
● O_DIRECT - skips page cache
● O_SYNC - write always flushes cache (slow)
Postgres/MySQL and File Modes in WAL
LBA 1

Partitions
LBA 2
P1
LBA 3

LBA 4

● Disks are exposed as big array of LBAS LBA 5

(logical sectors) LBA 6

● Partitions start from LBA and end in an LBA LBA 7

● Provides logical segmentation LBA 8

● E.g. Partition 1 - is LBA 1 - LBA 4 P2 LBA 9

LBA 10
● Each partition can have its own FS LBA 11

● Each FS different Block size (cluster) LBA 12

LBA 13

LBA 14

P3 LBA 15

LBA 16

……
Partition Alignment
● Partition starting at an odd LBA
● Say 8 LBAs map to 1 PBA
● FS Block size is 8 LBAs
● Partition starts at LBA 2
● One fs block is 2-9 LBAs
○ Means PBA 1 & PBA 2 Credit: IBM

● Overlapping!
● Performance issues
Moving File Systems
● Success moves of file
systems
● Allegro moved their Kafka
from ext4 to XFS
● The cost of journal commits
● Metadata updates
Summary
● File systems exposes files and directories
● Itself has a structure and storage
● Linked to the OS page cache
● Translation of POSIX read, fs block, to pages
husseinnasser

A simple read
What really happens behind a read?
POSIX
Read example Test.dat (5000 bytes)

● Read entire file “test.dat”, 5000 bytes


● Logical sector size 4096, physical sector size 4096
● Page size 4096
● File system block 4096
● 1 Block = 1 LBA (sector)
● Two blocks:
○ One full block and
○ another block with 1000 bytes
Read file Blocks
● Issue read command on the file
● The OS reads the file system metadata
● Find the start block, follow until end of file
● Blocks 6,3 (Still need LBAs)
test.dat

Block 0 1 2 3 4 5 6 7
File - - - E - - 3 -
Name: test.dat
Start Block: 6
Check the page cache Page cache
Block | VM memory
● Query the OS file system page cache 1 x
● Block 3 is cached but 6 is not. 2 y
● Next we send a command to disk controller 3 z
Read from disk controller Page cache
Block | VM memory
● The OS sends a read command to the disk 1 x
● Block 6 is LBA 6 (1 Block = 1 LBA) 2 y
● LBA 6, for length 1 3 z

● Disk controller converts LBA to PBA


● Returns the data to OS

LBA 6
OS updates cache Page cache
Block | VM memory
● OS gets the content and update its cache 1 x
2 y
3 z
6 j
OS return to user Page cache
Block | VM memory
● OS takes content from z, j 1 x
● Z has 4096 bytes, but user only asks for 1000 2 y
● Copies the memory to the user buffer 3 z
6 j
● Return success/unblocks
Summary
● So many layers
● Translation of POSIX read, fs block, to pages
● Page cache/delayed
husseinnasser

Socket Management
The networking part of the OS
husseinnasser

Network Fundamentals
Brief intro on Networking
Client-Server Architecture
A revolution in networking
Client-Server Architecture
● Machines are expensive, applications are complex
● Seperate the application into two components
● Expensive workload can be done on the server
● Clients call servers to perform expensive tasks
● Remote procedure call (RPC) was born
Client-Server Architecture Benefits
● Servers have beefy hardware
● Clients have commodity hardware
● Clients can still perform lightweight tasks
● Clients no longer require dependencies
● However, we need a communication model
husseinnasser

OSI Model
Open Systems Interconnection model
husseinnasser

Why do we need a communication model?


● Agnostic applications
○ Without a standard model, your application must have knowledge of the
underlying network medium
○ Imagine if you have to author different version of your apps so that it
works on wifi vs ethernet vs LTE vs fiber
● Network Equipment Management
○ Without a standard model, upgrading network equipments becomes difficult
● Decoupled Innovation
○ Innovations can be done in each layer separately without affecting the rest of the models
husseinnasser

What is the OSI Model?


● 7 Layers each describe a specific networking component
● Layer 7 - Application - HTTP/FTP/gRPC
● Layer 6 - Presentation - Encoding, Serialization
● Layer 5 - Session - Connection establishment, TLS
● Layer 4 - Transport - UDP/TCP
● Layer 3 - Network - IP
● Layer 2 - Data link - Frames, Mac address Ethernet
● Layer 1 - Physical - Electric signals, fiber or radio waves
husseinnasser

The OSI Layers - an Example (Sender)


● Example sending a POST request to an HTTPS webpage
● Layer 7 - Application
○ POST request with JSON data to HTTPS server
● Layer 6 - Presentation
○ Serialize JSON to flat byte strings
● Layer 5 - Session
○ Request to establish TCP connection/TLS
● Layer 4 - Transport
○ Sends SYN request target port 443
● Layer 3 - Network
○ SYN is placed an IP packet(s) and adds the source/dest IPs
● Layer 2 - Data link
○ Each packet goes into a single frame and adds the source/dest MAC addresses
● Layer 1 - Physical
○ Each frame becomes string of bits which converted into either a radio signal (wifi), electric signal (ethernet), or light (fiber)
● Take it with a grain of salt, it's not always cut and dry
husseinnasser

The OSI Layers - an Example (Receiver)


● Receiver computer receives the POST request the other way around
● Layer 1 - Physical
○ Radio, electric or light is received and converted into digital bits
● Layer 2 - Data link
○ The bits from Layer 1 is assembled into frames
● Layer 3 - Network
○ The frames from layer 2 are assembled into IP packet.
● Layer 4 - Transport
○ The IP packets from layer 3 are assembled into TCP segments
○ Deals with Congestion control/flow control/retransmission in case of TCP
○ If Segment is SYN we don’t need to go further into more layers as we are still processing the connection request
● Layer 5 - Session
○ The connection session is established or identified
○ We only arrive at this layer when necessary (three way handshake is done)
● Layer 6 - Presentation
○ Deserialize flat byte strings back to JSON for the app to consume
● Layer 7 - Application
○ Application understands the JSON POST request and your express json or apache request receive event is triggered
● Take it with a grain of salt, it's not always cut and dry
Client sends an HTTPS POST request

Client Server

Application Application

Presentation Presentation

Session Session

Transport Segment Transport

Network Packet Network

Data Link Frame Data Link

Physical Physical
Client Server

SPORT DPORT SPORT DPORT

SIP DIP SIP DIP

SMAC DMAC SMAC DMAC


Across networks

Client Server

Application Application

Presentation Presentation

Session Session

Transport Router Transport

Network Switch Network Network

Data Link Data Link Data Link Data Link

Physical Physical Physical Physical


Across networks
Layer 7 Load
Client Balancer/CDN Backend
Server

Application Application Application

Presentation Presentation Presentation


Layer 4
Proxy,
Session Firewall Session Session

Transport Transport Transport Transport

Network Network Network Network

Data Link Data Link Data Link Data Link

Physical Physical Physical Physical


husseinnasser

The shortcomings of the OSI Model


● OSI Model has too many layers which can be hard to comprehend
● Hard to argue about which layer does what
● Simpler to deal with Layers 5-6-7 as just one layer, application
● TCP/IP Model does just that
husseinnasser

TCP/IP Model
● Much simpler than OSI just 4 layers
● Application (Layer 5, 6 and 7)
● Transport (Layer 4)
● Internet (Layer 3)
● Data link (Layer 2)
● Physical layer is not officially covered in the model
husseinnasser

OSI Model Summary


● Why do we need a communication model?
● What is the OSI Model?
● Example
● Each device in the network doesn’t have to map the entire 7 layers
● TCP/IP is simpler model
husseinnasser

A B

Host to Host communication


How messages are sent between hosts
husseinnasser

Host to Host communication


● I need to send a message from host A to host B
● Usually a request to do something on host B (RPC)
● Each host network card has a unique Media Access Control address (MAC)
● E.g. 00:00:5e:00:53:af

A B

00:00:5e:00:53:aa 00:00:3a:12:31:0b
husseinnasser

Host to Host communication


● A sends a message to B specifying the MAC address
● Everyone in the network will “get” the message but only B will accept it

D
husseinnasser

Host to Host communication


● Imagine millions of machines?
● We need a way to eliminate the need to send it to everyone
● The address needs to get better
● We need routability, meet the IP Address
husseinnasser

Host to Host communication


● The IP Address is built in two parts
● One part to identify the network, the other is the host
● We use the network portion to eliminate many networks
● The host part is used to find the host
● Still needs MAC addresses!
Host A on network N1 wants to talk to Host B on network N2

A A

B B

C C

N1 N2
Host 192.168.1.3 wants to talk to 192.168.2.2

192.168.1.3 192.168.2.3

192.168.1.2 192.168.2.2

192.168.1.1 192.168.2.1

192.168.1.0/24 192.168.2.0/24
husseinnasser

But my host have many apps!


● It's not enough just to address the host
● The host is runnings many apps each with different requirements
● Meet ports
● You can send an HTTP request on port 80, a DNS request on port 53 and an
SSH request on port 22 all running on the same server!
husseinnasser

Host to Host communication - Summary


● Host needs addresses
● MAC Addresses are great but not scalable in the Internet
● Internet Protocol Address solves this by routing
● Layer 4 ports help create finer addressability to the process level
husseinnasser

1.2.3.4

The IP building blocks


Understanding the IP Protocol
husseinnasser

IP Address
● Layer 3 property
● Can be set automatically or statically
● Network and Host portion
● 4 bytes in IPv4 - 32 bits
husseinnasser

Network vs Host
● a.b.c.d/x (a.b.c.d are integers) x is the network bits and remains are host
● Example 192.168.254.0/24
● The first 24 bits (3 bytes) are network the rest 8 are for host
● This means we can have 2^24 (16777216) networks and each network has
2^8 (255) hosts
● Also called a subnet
husseinnasser

Subnet Mask
● 192.168.254.0/24 is also called a subnet
● The subnet has a mask 255.255.255.0
● Subnet mask is used to determine whether an IP is in the same subnet
husseinnasser

Default Gateway
● Most networks consists of hosts and a Default Gateway
● Host A can talk to B directly if both are in the same subnet
● Otherwise A sends it to someone who might know, the gateway
● The Gateway has an IP Address and each host should know its gateway
E.g. Host 192.168.1.3 wants to talk to 192.168.1.2

● 192.168.1.3 applies subnet


mask to itself and the
destination IP 192.168.1.2
● 255.255.255.0 & 192.168.1.3 192.168.2.3

192.168.1.3 =
192.168.1.0
● 255.255.255.0 & 192.168.1.2 192.168.2.2
192.168.1.2 =
192.168.1.0
● Same subnet ! no need to
192.168.1.1 192.168.2.1
route
192.168.1.0/24 192.168.2.0/24
E.g. Host 192.168.1.3 wants to talk to 192.168.2.2

● 192.168.1.3 applies subnet


mask to itself and the
destination IP 192.168.2.2 192.168.2.100
● 255.255.255.0 & 192.168.1.3 192.168.2.3

192.168.1.3 =
192.168.1.0
● 255.255.255.0 & 192.168.1.2 192.168.1.100 192.168.2.2
192.168.2.2 =
192.168.2.0
● Not the subnet ! The packet
192.168.1.1 192.168.2.1
is sent to the Default
Gateway 192.168.1.100 192.168.1.0/24 192.168.2.0/24
husseinnasser

Summary
● IP Address
● Network vs Host
● Subnet and subnet mask
● Default Gateway
husseinnasser

UDP
User Datagram Protocol
husseinnasser

UDP
● Stands for User Datagram Protocol
● Layer 4 protocol
● Ability to address processes in a host using ports
● Simple protocol to send and receive data
● Prior communication not required (double edge sword)
● Stateless no knowledge is stored on the host
● 8 byte header Datagram
husseinnasser

UDP Use cases


● Video streaming
● VPN
● DNS
● WebRTC

A B
husseinnasser

Multiplexing and demultiplexing


● IP target hosts only
● Hosts run many apps each with different requirements
● Ports now identify the “app” or “process”
● Sender multiplexes all its apps into UDP
● Receiver demultiplex UDP datagrams to each app

App1-port 5555 AppX-port 53


App2-port 7712 AppY-port 68
App3-port 2222 AppZ-port 6978

10.0.0.1 10.0.0.2
husseinnasser

Source and Destination Port


● App1 on 10.0.0.1 sends data to AppX on 10.0.0.2
● Destination Port = 53
● AppX responds back to App1
● We need Source Port so we know how to send back data
● Source Port = 5555

10.0.0.1 5555 53 10.0.0.2 AppX-port 53


App1-port 5555
App2-port 7712 AppY-port 68
App3-port 2222 AppZ-port 6978

10.0.0.1 10.0.0.2 53 5555 10.0.0.1 10.0.0.2


husseinnasser

Summary
● UDP is a simple layer 4 protocol
● Uses ports to address processes
● Stateless
husseinnasser

UDP Pros and Cons


The power and drawbacks of UDP
husseinnasser

UDP Pros
● Simple protocol
● Header size is small so datagrams are small
● Uses less bandwidth
● Stateless
● Consumes less memory (no state stored in the server/client)
● Low latency - no handshake , order, retransmission or guaranteed delivery
husseinnasser

UDP Cons
● No acknowledgement
● No guarantee delivery
● Connection-less - anyone can send data without prior knowledge
● No flow control
● No congestion control
● No ordered packets
● Security - can be easily spoofed
husseinnasser

TCP
Transmission Control Protocol
husseinnasser

TCP
● Stands for Transmission Control Protocol
● Layer 4 protocol
● Ability to address processes in a host using ports
● “Controls” the transmission unlike UDP which is a firehose
● Connection
● Requires handshake
● 20 bytes headers Segment (can go to 60)
● Stateful
husseinnasser

TCP Use cases


● Reliable communication
● Remote shell
● Database connections
● Web communications
● Any bidirectional communication

A B
husseinnasser

TCP Connection
● Connection is a Layer 5 (session)
● Connection is an agreement between client and server
● Must create a connection to send data
● Connection is identified by 4 properties
○ SourceIP-SourcePort
○ DestinationIP-DestinationPort
husseinnasser

TCP Connection
● Can’t send data outside of a connection
● Sometimes called socket or file descriptor
● Requires a 3-way TCP handshake
● Segments are sequenced and ordered
● Segments are acknowledged
● Lost segments are retransmitted
husseinnasser

Multiplexing and demultiplexing


● IP target hosts only
● Hosts run many apps each with different requirements
● Ports now identify the “app” or “process”
● Sender multiplexes all its apps into TCP connections
● Receiver demultiplex TCP segments to each app based on connection pairs

App1-port 5555 AppX-port 53


App2-port 7712 AppY-port 68
App3-port 2222 AppZ-port 6978

10.0.0.1 10.0.0.2
husseinnasser

Connection Establishment
● App1 on 10.0.0.1 want to send data to AppX on 10.0.0.2
● App1 sends SYN to AppX to synchronous sequence numbers
● AppX sends SYN/ACK to synchronous its sequence number
● App1 ACKs AppX SYN.
● Three way handshake

10.0.0.1 5555 SYN 22 10.0.0.2


AppX-port 22
App1-port 5555 AppY-port 443
App2-port 7712 AppZ-port 80
App3-port 2222 10.0.0.2 22 SYN/ACK 5555 10.0.0.1

10.0.0.1 10.0.0.2 10.0.0.1:5555:


10.0.0.1:5555: 10.0.0.1 5555 ACK 22 10.0.0.2 10.0.0.2:22
10.0.0.2:22
File descriptor
File descriptor
husseinnasser

Sending data
● App1 sends data to AppX
● App1 encapsulate the data in a segment and send it
● AppX acknowledges the segment
● Hint: Can App1 send new segment before ack of old segment arrives?

10.0.0.1 5555 ls 22 10.0.0.2


AppX-port 22
App1-port 5555 AppY-port 443
App2-port 7712 AppZ-port 80
App3-port 2222 10.0.0.2 22 ACK 5555 10.0.0.1

10.0.0.1 10.0.0.2 10.0.0.1:5555:


10.0.0.1:5555: 10.0.0.2:22
10.0.0.2:22
File descriptor
File descriptor
husseinnasser

Acknowledgment
● App1 sends segment 1,2 and 3 to AppX
● AppX acknowledge all of them with a single ACK 3

10.0.0.1 5555 seq1 22 10.0.0.2


AppX-port 22
App1-port 5555 AppY-port 443
10.0.0.1 5555 seq2 22 10.0.0.2
App2-port 7712 AppZ-port 80
App3-port 2222
10.0.0.1 5555 seq3 22 10.0.0.2
10.0.0.1
10.0.0.2 10.0.0.1:5555:
10.0.0.1:5555: 10.0.0.2:22
10.0.0.2:22

File descriptor
File descriptor
10.0.0.2 22 ACK3 5555 10.0.0.1
husseinnasser

Lost data
● App1 sends segment 1,2 and 3 to AppX
● Seg 3 is lost, AppX acknowledge 3
● App1 resend Seq 3

10.0.0.1 5555 seq1 22 10.0.0.2


AppX-port 22
App1-port 5555 AppY-port 443
10.0.0.1 5555 seq2 22 10.0.0.2
App2-port 7712 AppZ-port 80
App3-port 2222
10.0.0.1 5555 seq3 22 10.0.0.2
10.0.0.1
10.0.0.2 10.0.0.1:5555:
10.0.0.1:5555: 10.0.0.2:22
10.0.0.2:22

10.0.0.2 22 ACK2 5555 10.0.0.1 File descriptor


File descriptor

10.0.0.1 5555 seq3 22 10.0.0.2

10.0.0.2 22 ACK3 5555 10.0.0.1


husseinnasser

Closing Connection
● App1 wants to close the connection
● App1 sends FIN, AppX ACK
● AppX sends FIN, App1 ACK
● Four way handshake
10.0.0.1 5555 FIN 22 10.0.0.2
AppX-port 22
App1-port 5555 AppY-port 443
App2-port 7712 AppZ-port 80
App3-port 2222 10.0.0.2 22 ACK 5555 10.0.0.1

10.0.0.1 10.0.0.2 10.0.0.1:5555:


10.0.0.1:5555: 10.0.0.2 22 FIN 5555 10.0.0.1 10.0.0.2:22
10.0.0.2:22
File descriptor
File descriptor 10.0.0.1 5555 ACK 22 10.0.0.2
husseinnasser

Summary
● Stands for Transmission Control Protocol
● Layer 4 protocol
● “Controls” the transmission unlike UDP which is a firehose
● Introduces Connection concept
● Retransmission, acknowledgement, guaranteed delivery
● Stateful, connection has a state
husseinnasser

Sockets, Connections and


Queues
Kernel networking structures
Socket
● When a process listens on an IP/Port it produces a socket
● Socket is a file (at least in linux)
Process A
● The process owns the socket
● Can be shared during fork Socket file descriptor
Process A
SYN Queue, Accept Queues
Socket S

● When a socket is created we get two queues with it S SYN Queue


● SYN Queue, stores incoming SYNs S Accept Queue
● Accept Queue, stores completed connections
● The size of the queues is determined by the
backlog
● Not really queues but hash tables
Process A

Socket S
Connection, Receive and Send queue
S SYN Queue

● Completed connections are placed in the accept queue S Accept Queue

● When a process “accepts” a connection is created Connection C

● Accept returns a file desc for the connection C Send Queue

● Two new queues created with the connection C Receive Queue


● Send queue stores connection outgoing data
● Receive queue stores incoming connection data
Connection Establishment

● TCP Three way handshake


● SYN/SYN-ACK/ACK
● But what happens on the backend?
SYN
SYN/ACK

ACK
Connection Establishment

● Server Listens on an address:port


● Client connects
● Kernel does the handshake creating
a connection
● Backend process “Accepts” the
connection
Connection Establishment
● Kernel creates a socket & two queues SYN and Accept
● Client sends a SYN
● Kernels adds to SYN queue, replies with SYN/ACK
● Client replies with ACK
● Kernel finish the connection
● Kernel removes SYN from SYN queue
● Kernel adds full connection to Accept queue
● Backend accepts a connection, removed from accept
queue
● A file descriptor is created for the connection
Connection Establishment
SYN
SYN Queue

SYN/ACK

ACK
Accept Queue

accept()
copy
PID
Problems with accepting connections

● Backend doesn’t accept fast enough


● Clients who don’t ACK
● Small backlog
Socket Sharding
● Normally listening on active port/ip fails
● But you can override it with SO_REUSEPORT
● Two distinct sockets different processes on the same ip/port pair
Summary

● Kernel manages networking


● Socket represents a port/ip
● Each connected client gets a connection
● Kernel managed data structures
husseinnasser

Reading and Sending Data


Receive vs Send buffers
Send and receive buffers

● Client sends data on a connection


● Kernel puts data in receive queue
● Kernel ACKs (may delay) and update window
● App calls read to copy data
Receive buffers
data
Receive buffer

ACK

read()

Copy
PID
Send buffers
data
Send buffer

ACK

send()

Copy
PID
Problems with reading and sending

● Backend doesn’t read fast enough


● Receive queue is full
● Client slows down
husseinnasser

Socket Programming
Patterns
Common socket patterns
Single Listener/Single Worker Thread
Single Listener/Multiple Worker threads
Single Listener/Multiple Worker threads with load balancing
Multiple Acceptor Threads single Socket
Multiple Listeners on the same port
husseinnasser

Asynchronous IO
Non blocking reads and writes
Blocking operations

● Read, write and accept are blocking


● The process cannot move their Program counter
● Read blocks when no data
● accept blocks when no connections
● Leads to context switches and slow down

Process OS kernel
Blocking operations read
con1
read
con2
read
con3

con4

con5

con6

con7

con8

con9

This is just a pseudo code for simplicity, green indicates the receive buffer for the connection has data to be read, gray indicates an con10
empty receive buffer (client hasn’t sent anything). In this example the process gets blocked on reading connection 3 until some data
arrives there, while the other connections are starved.
Asynchronous I/O
● Read blocks when no data in receive buffer
● Accept blocks when no connections in accept queue
● Ready approach
○ Ask the OS to tell us if a file is ready
○ When it is ready, we call it without blocking
○ Select,epoll, kqueue
● Completion approach
○ Ask the OS (or worker thread to do the blocking io)
○ When completed notify
○ IOCP, io_uring
● Doesn’t work with storage files
Select (polling)
● Select takes a collection of file descriptors for kernel to monitor
● Select is blocking (with a timeout)
● When any is ready, select returns
● Process checks which one is ready (loop w/FD_ISSET)
● Process calls read/write/accept etc. on the file descriptor
select con1

con2

con3

con4

con5
select
con6

con7

con8

con9

In this example, only connection 9 has data in the receive buffer, but client is monitoring all 10 connections. The select call sends all con10
10 file descriptors, kernel checks one by one (all 10) discovers the connection 9 had data, unblocks and return
select con1

con2

con3

con4

con5

con6

con7

con8

con9
read

The client then gets its program counter incremented, loops through all connections and check which one is ready O(n). It finds that con10
connection 9 is ready and issues the read on it with no blocking.
Select pros and cons
● Pros
○ Avoid reading (unready) resources
○ async
● Cons
○ But slow we have to loop through all of them O(n)
○ Lots of copying from kernel/user space
○ Supports fixed size of file descriptors
epoll (eventing)
● Register an interest list of the fds (once) in the kernel
● Kernel keeps working and updates the ready list
● User calls epoll_wait, kernel builds an events array of ready list

epoll_create con1,con2,con3 (1)

Process epoll_wait (3) OS Kernel


events (4) - con3

con1 Incoming
Packets con3 (2)
con2
con3
epoll epoll e (kernel) con1

con2

con3

con4

con5

con6

con7

con8

con9

Again this is pseudo code, we first create the epoll_instance, which lives in the kernel memory. Then we add the connections/file con10
descriptors
epoll epoll e (kernel) con1

con2

Events ready list: con3


[con4, con9]
con4

con5

con6

con7

con8

con9

The code process moves on, meanwhile the kernel keeps receiving packets and as it does so, it will populate the ready list for those con10
connections with data. In this case we got data for con9 and con4
epoll epoll e (kernel) con1

con2

Events ready list: con3


[con4, con9]
con4

con5

con6

con7

con8

con9

At some point, code calls epoll_wait, if the epoll instance is empty wait is blocked, else we immediately send back the array of events con10
to the user space. Only what is ready. Events count is now 2 elements.
epoll epoll e (kernel) con1

con2

Events ready list: con3


[]
con4

con5

con6

con7

con8

con9

the user calls read as non-blocking on connection 9 and connection 4 emptying the receive buffers. con10
epoll drawbacks
● Complex
○ Level-triggered vs edge-triggered
● Only in linux
● Too many syscalls
● Doesn’t work on files
io_uring
● Based on completion
● Kernel does the work
● Shared memory, user puts “job”
● kernel does the work and writes results
io_uring
● Based on completion
● Kernel does the work
● Shared memory, user puts “job”
● kernel does the work and writes results
User process

io_uring Process writes a job Process consumes


to the submission results from the head
queue tail

Shared memory

read() write()
read() write()
Submission write() accept() Completion
queue queue
accept() accept()
read()

Kernel writes completed job results


Kernel pops jobs from the head
in the tail
Kernel
io_uring
● Fast, non-blocking
● Security is a big issue (shared memory)
● Google disabled it for now
Cross platform
● Node (through lib_uv) supports all platforms async io
Summary
● Some kernel syscalls are blocking
● Process put to sleep
● Asynchronous io is a way around it
● Ready based or completion based
husseinnasser

More OS Concepts
husseinnasser

Compilers and Linkers


On Programming Languages
Machine Code

● Programs run on machine code


● Specific to the CPU
● Each CPU has different instructions set
● RISC vs CISC
Assembly

● Closest to the machine code


● Still sometimes CPU specific
● Easier to write
● Not easy enough though
High level languages

● HLL are more convenient


● Abstractions to hide complexity*
● Need to compile for a CPU
● Compile turns code to machine code
● Linking creates executable file

*That is good and bad


Compiling
● Compile produces machine code in form of object files
● Each object file may present a source file
● Object files are not ready to be run
● They need to be linked and create an executable
● E.g. gcc, clang, rustc
Linking
● Linkers creates an executable file
● Finds and links all object files required and create one file
● The file is an “executable”
● Executable files have types
● E.g. ld,gold linker,lld, mold (new)
Executable files formats
● The executable file is a program
● Specific format so the OS knows how to create process
● Created by linker
● Example ELF linux
Executable files formats
● exe PE -> windows
● ELF -> Linux
● Mach-O
Interpreted Languages
● Compiled Program doesn’t work everywhere
● Must match the CPU/OS
● Can I write my code once and run it everywhere?
● Interrupted languages
● Python/Javascript/Java
Interpreted Languages
● Must have a runtime, python.exe
● Python.exe is a compiled program for every OS/CPU
● Your code hello.py runs everywhere
○ Windows python.exe hello.py
○ Linux ./python hello.py
● Same for Node and Javascript
Interpreted Languages
● The trick is each line interrupted
● If you see “+” do “this”, if you see “-” do “this”
● Slower
● Byte code (not string code)
Just in Time Compilation (JIT)
● I’m interpreting this code a lot
● Let me compile it directly to machine code
● Put it on the heap
● Mark memory as executable
● Point the CPU program counter to it
Garbage Collection
● Memory management is tricky
● Some languages manages it for you
○ Go, Python, Java
● Some languages you have to do it
○ C, C++
● Garbage collection is part of the runtime
● Tags every object and tracks it
● Can cause slow downs
Summary
● Compiling vs Linking
● Compiled Languages
● Interrupted Languages
● Garbage collection
Kernel and User Space
Mode switch
0xFFFFFFFF
Kernel
Kernel vs User 0xC0000000

● Virtual address of a process two parts


● Kernel and User User
● Kernel maps to dedicated physical memory
● User pages map to different physical memory
○ Page table help the mapping 0x00000000 Process
0xFFFFFFFF
Kernel
Modes 0xC0000000

● The CPU can be in user or kernel mode


● In user mode user code executes User

● When kernel mode kernel code executes


○ Syscalls, drivers 0x00000000
Process
● User mode can’t access kernel pages
● Kernel mode can access both
0xFFFFFFFF
read()
Kernel
Kernel mode switch Cost 0xC0000000

● System code/stack lives in the kernel area


● Kernel functions runs on the kernel stack User

● Process invokes syscall (e.g. read)


● CPU is put on kernel mode
0x00000000 Process
● Happens in page faults
0xFFFFFFFF
read()
bp Kernel
Kernel mode switch 0xC0000000

● Stores current base pointer on kernel stack


User
● Stores return address as well
● Store all user registers/state to memory
● Once done, user registers are restored 0x00000000 Process

● User mode activates, execution resumes


Cost
● Mode switch (store all registers and restore them)
● Memory access
● Security check and validation
● System call number lookup
● Process stats and runtime different from kernel mode
Summary
● Kernel mode vs user mode
● Security to protect kernel
● Kernel time is not process time
● Cost of mode switch
Virtualization and
Containerization
Modern OS techniques
One Machine One OS
● Very limiting
● One machine multiple OS?
● One at a time (switch at startup)
● Virtual Machines
● Containerization (jails)
Multiple native OS
● Many OS on top of the hard
● One active at a time
● Switch in runtime
● High isolation Full OS1 Full OS2 Full OS3
Kernel Kernel Kernel

Hardware
Virtualization Full OS1 Full OS2 Full OS3

● Many OS on top of One base OS Hypervisor (software)


● Hypervisor controls upper OS
Base OS tools
● Proxies syscalls to lower kernel
● Full isolation but lots of redundancy Base Kernel

Hardware
Virtualization
● Can limit CPU/Memory for each VM
● E.g. VMWare Oracle virtualbox
Containerization
● Containers isolated by namespace Container1 Container2 Container3
○ Mount fs namespace
○ Network namespace Base OS tools

○ Process namespace
Base Kernel
● Limited by cgroups
○ CPU/Memory usage Hardware
● All containers share kernel!
● E.g. docker
namespace
● Kernel feature for isolation Con1
● Mount namespace -> isolates mounts
Base OS tools
● PID namespace -> isolates processes
● NIC namespace -> isolates networks Base Kernel

Hardware
cgroup
● Control group kernel feature
● Assigns certain cpu and memory to processes
● So that containers can’t starve others
What happens on a new container
● New namespaces created
● It can only see what is mounted
● Usually a directory in host is created
● Container gets 1 mount to that directory
● Can’t see anything else
What happens on a new container
● The OS necessary tools are loaded
● E.g. ubuntu apt-get, ifconfig etc.
● Then user code/processes are loaded
○ E.g. node
● But there is a better way
overlayfs
● new container gets its copy of base OS tools
● This will quickly run the disk out
● Meet overlay fs
● Base layer os is read only (shared)
● Only changes are maintained
network
● new container gets its own NIC namespace
● Won’t see host NICs (unless explicitly given)
● Docker often has one network that adds all container to
● New container gets a new mac address and get DHCPed
processes
● New container can spin any number of processes
● Container can only see its processes
● PID 1 can exist in all containers, they are isolated
● Each process has a globalid the base kernel sees
● PID namespace
Summary
● Hardware
● Kernel
● Virtual machines
● Containers

You might also like