Operating Systems
Last Lecture: Scheduling
Basic concepts
Preemptible vs. Non-preemptible resources
Dispatcher vs. Scheduler
Preemptive vs. Non-preemptive scheduling
Scheduling Performance Metrics
Different scheduling algorithms
First Come First Served (FCFS)
Shortest Job First (SJF)
• Shortest Remaining Time First (SRTF)
Round Robin (RR)
Priority
Gantt Chart for analyzing scheduling algorithms
Today: Advanced Scheduling
So far have seen basic scheduling algorithms
No one-size-fits-all scheduler
Workloads
Performance requirements
Today
Advanced Scheduling Concepts
• Multilevel Queue Scheduling
• Multiprocessor Scheduling
• Real-time Scheduling
Linux Scheduling
Multilevel Queue
Ready queue is partitioned into separate queues. E.g.,
foreground (interactive), background (batch)
Each queue has its own scheduling algorithm. E.g.,
foreground – RR, background – FCFS
Scheduling must be done between the queues. E.g.,
Fixed priority scheduling; (i.e., serve all from foreground then
from background).
• Problem: starvation.
Time slice – each queue gets a certain amount of CPU time which
it can schedule amongst its processes; i.e., 80% to foreground in
RR, 20% to background in FCFS
• Problem: how to split?
Multilevel Queue Example
4
Advantages and Disadvantages of Multilevel
Queue
Advantage: flexible
Disadvantage
Complex compared to basic algorithms
How to set priorities or time slice for queues?
Process role may change
5
Multilevel Feedback Queue
Process can move between the various queues; aging can be
implemented this way
Multilevel-feedback-queue scheduler defined by the following
parameters:
number of queues
scheduling algorithms for each queue
method used to determine when to upgrade a process
method used to determine when to demote a process
method used to determine which queue a process will enter when
that process needs service
6
Example of Multilevel Feedback Queue
Three queues:
Q0 – RR with time quantum 8 milliseconds
Q1 – RR time quantum 16 milliseconds
Q2 – FCFS
Scheduling
New process goes to Q0
When scheduled, if doesn’t finish in 8 ms, moved to Q1
When scheduled again, if doesn’t finish in 16 ms, moved to Q2
Multilevel Feedback Queues
8
Multiprocessor Scheduling Issues
Shared-memory Multiprocessor
processes
CPU0 CPU1 CPU2 CPU3
How to allocate processes to CPU?
Symmetric Multiprocessor
Architecture
Shared Memory
$ $ $ $
CPU0 CPU1 CPU2 CPU3
Small number of CPUs
Same access time to main memory
Private cache
SMP: Global Queue of Processes
One ready queue shared across all CPUs
CPU0 CPU1 CPU2 CPU3
Advantages
Good CPU utilization
Fair to all processes
Disadvantages
Not scalable (contention for global queue lock)
Poor cache locality
Linux 2.4 uses global queue
SMP: Per-CPU Queue of Processes
Static partition of processes to CPUs
Advantages CPU0 CPU1 CPU2 CPU3
Easy to implement
Scalable (no contention on ready queue)
Better cache locality
Disadvantages
Load-imbalance (some CPUs have more processes)
• Unfair to processes and lower CPU utilization
12
SMP: Hybrid Approach
Use both global and per-CPU queues
Balance jobs across queues
CPU0 CPU1 CPU2 CPU3
Processor Affinity
Add process to a CPU’s queue if recently run on the CPU
• Cache state may still present
Linux 2.6 uses a very similar approach
13
SMP: “Gang” Scheduling
Multiple processes need coordination
Should be scheduled simultaneously
CPU0 CPU1 CPU2 CPU3
Dispatcher on each CPU does not act independently
Coscheduling (gang scheduling): run a set of processes
simultaneously
Global context-switch across all CPUs
14
Real-Time Scheduling
Real-time processes have timing constraints
Expressed as deadlines or rate requirements
E.g. gaming, video/music player, autopilot…
Hard real-time systems – required to complete a critical
task within a guaranteed amount of time
Soft real-time computing – requires that critical
processes receive priority over less fortunate ones
Linux supports soft real-time
15
Today: Advanced Scheduling
Advanced Scheduling Concepts
Multilevel Queue Scheduling
Multiprocessor Scheduling
Real-time Scheduling
Linux Scheduling
16
Linux Scheduling Overview
Multilevel Queue Scheduler
Each queue associated with a priority
A process’s priority may be adjusted dynamically
Two classes of processes
Real-time processes: always schedule highest priority processes
• FCFS (SCHED_FIFO) or RR (SCHED_RR) for processes with
same priority
Normal processes: priority with aging
• RR for processes with same priority (SCHED_NORMAL)
• Aging is implemented efficiently
17
Priority Partition
Total 140 priorities [0, 140)
Smaller integer = higher priority
Real-time: [0,100)
Normal: [100, 140)
MAX_PRIO and MAX_RT_PRIO
include/linux/sched.h
18
Linux Scheduling Goals
Boost interactivity
Fast response to user despite high load
Achieved by inferring interactive processes and dynamically increasing
their priorities
Avoid starvation
Scale well with number of processes
O(1) scheduling overhead
SMP goals
Scale well with number of processors
Load balance: no CPU should be idle if there is work
CPU affinity: no random bouncing of processes
Reference: Documentation/sched-design.txt
19
runqueue data structure
kernel/sched.c
struct prio_array
Array of priority queues
struct runqueue
Two arrays, active and expired
20
task_struct scheduling related fields
static_prio and time_slice: used to compute and store
time-slice
prio: dynamic priority
Index to prio_array
rt_priority: real time priority
prio = 99 – rt_priority
include/linux/sched.h
21
Scheduling Algorithm
1. Find highest priority non-empty queue in rq->active
2. next = first process on that queue
3. Adjust next’s priority
4. Context switch to next
5. When next used up its time slice, insert next to the
right queue and call schedule again
schedule() in kernel/sched.c
22
Find Highest Priority Non-empty Queue
Use the bitmap field of runqueue
140 queues 5 integers
Time complexity: O(1)
depends on the number of priority levels, not the number of
processes
Implementation: only a few compares to find the first
that is non-zero
Hardware instruction to find the first 1-bit
• bsfl on Intel
sched_find_first_bit() in include/asm-i386/bitops.h
23
Adjusting Priority
Goal: dynamically increase priority of interactive process
How to determine interactive?
Sleep ratio
Mostly sleeping: I/O bound
Mostly running: CPU bound
Implementation: sleep_avg in task_struct
Before switching out a process, subtract from sleep_avg how many
ticks a task ran
• schedule()
Before switching in a process, add to sleep_avg how many ticks it
was blocked up to MAX_SLEEP_AVG (10 ms)
• schedule() recalc_task_prio() effective_prio()
24
Calculating Time Slices
Higher priority processes also get bigger time-slice
task_timeslice() in sched.c
If (static_priority < 120) time_slice = (140-static_priority) *
20
If (static_priority >= 120) time_slice = (140-static_priority)
*5
How to set static_priority?
Default: 120 (even for realtime processes)
Set use sys_nice() or sys_setpriority()
• Both call set_user_nice()
25
Example Time Slices
Priority: Static Pri Niceness Quantum
Highest 100 -20 800 ms
High 110 -10 600 ms
Normal 120 0 100 ms
Low 130 10 50 ms
Lowest 139 20 5 ms
26
On each timer interrupt …
scheduler_tick()
Called on each tick
• timer_interrupt do_timer_interrupt do_timer_interrupt_hook
update_process_times
If realtime and SCHED_FIFO, do nothing
SCHED_FIFO is non-preemptive
If realtime and SCHED_RR and used up time slice, move to end of rq-
>active[prio]
If SCHED_NORMAL and used up time slice
If not interactive or starving expired queue, move to end of rq->expired[prio]
Otherwise, move to end of rq->active[prio]
• Boost interactive
Else // SCHED_NORMAL, and not used up time slice
Break large time slice into pieces TIMESLICE_GRANULARITY
Aging: the Traditional Algorithm
for(pp = proc; pp < proc+NPROC; pp++) {
if (pp->prio != MAX)
pp->prio++;
if (pp->prio > curproc->prio)
reschedule();
}
Problem: O(N). Every process is examined on each
schedule() call!
This code is taken almost verbatim from 6th Edition
Unix, circa 1976.)
Simulate Aging
struct runqueue has active and expired
schedule() only runs processes from active , and puts them on
expired when use up time slice
When a queue in active is empty, look for next-highest priority
queue
After running all queues on active, swap active and expired
Advantage: O(1)
Processes are touched only when they start or stop running
schedule() in kernel/sched.c
29
Real-Time Scheduling
Linux has soft real-time scheduling
No hard real-time guarantees
All real-time processes are higher priority than any
conventional processes
Processes with priorities [0, 99] are real-time
saved in rt_priority in the task_struct
scheduling priority of a real time task is: 99 - rt_priority
Process can be converted to real-time via
sched_setscheduler system call
30
Real-Time Policies
First-in, first-out: SCHED_FIFO
Static priority
Process is only preempted for a higher-priority process
No time quanta; it runs until it blocks or yields voluntarily
RR within same priority level
Round-robin: SCHED_RR
As above but with a time quanta
Normal processes have SCHED_OTHER scheduling
policy
31
Multiprocessor Scheduling
Each processor has a separate run queue
Each processor only selects processes from its own
queue to run
Yes, it’s possible for one processor to be idle while
others have jobs waiting in their run queues
Periodically, the queues are rebalanced: if one
processor’s run queue is too long, some processes are
moved from it to another processor’s queue
32
Locking Runqueues
To rebalance, the kernel sometimes needs to move
processes from one runqueue to another
This is actually done by special kernel threads
Naturally, the runqueue must be locked before this
happens
The kernel always locks runqueues in order of
increasing indexes
Why? Deadlock prevention!
33
Processor Affinity
Each process has a bitmask saying what CPUs it can
run on
Normally, of course, all CPUs are listed
Processes can change the mask
The mask is inherited by child processes (and
threads), thus tending to keep them on the same CPU
Rebalancing does not override affinity
34
Load Balancing
To keep all CPUs busy, load balancing pulls tasks
from busy runqueues to idle runqueues.
If schedule finds that a runqueue has no runnable tasks
(other than the idle task), it calls load_balance
load_balance also called via timer
schedule_tick calls rebalance_tick
Every tick when system is idle
Every 100 ms otherwise
35
Load Balancing
load_balance looks for the busiest runqueue (most
runnable tasks) and takes a task that is (in order of
preference):
inactive (likely to be cache cold)
high priority
load_balance skips tasks that are:
likely to be cache warm (hasn't run for
cache_decay_ticks time)
currently running on a CPU
not allowed to run on the current CPU (as indicated
by the cpus_allowed bitmask in the task_struct)
36
Optimizations
If next is a kernel thread, borrow the MM mappings
from prev
User-level MMs are unused.
Kernel-level MMs are the same for all kernel threads
If prev == next
Don’t context switch
37