If all reference bits in the table are set, then second chance degrades to FIFO, but also
requires a complete search of the table for every page-replacement.
      As long as there are some pages whose reference bits are not set, then any page
       referenced frequently enough gets to stay in the page table indefinitely.
      This algorithm is also known as the clock algorithm, from the hands of the clock moving
       around the circular queue.
                    Fig: Second-chance ( clock ) page-replacement algorithm
Enhanced Second-Chance Algorithm
      The enhanced second chance algorithm looks at the reference bit and the modify bit (
       dirty bit ) as an ordered page, and classifies pages into one of four classes:
            1. ( 0, 0 ) - Neither recently used nor modified.
            2. ( 0, 1 ) - Not recently used, but modified.
            3. ( 1, 0 ) - Recently used, but clean.
            4. ( 1, 1 ) - Recently used and modified.
      This algorithm searches the page table in a circular fashion ( in as many as four passes ),
       looking for the first page it can find in the lowest numbered category. I.e. it first makes a
       pass looking for a ( 0, 0 ), and then if it can't find one, it makes another pass looking for a
       ( 0, 1 ), etc.
      The main difference between this algorithm and the previous one is the preference for
       replacing clean pages if possible.
Counting-Based Page Replacement
      There are several algorithms based on counting the number of references that have been
       made to a given page, such as:
                                                 76
           o Least Frequently Used, LFU: Replace the page with the lowest reference count. A
              problem can occur if a page is used frequently initially and then not used any
              more, as the reference count remains high. A solution to this problem is to right-
              shift the counters periodically, yielding a time-decaying average reference count.
           o Most Frequently Used, MFU: Replace the page with the highest reference count.
              The logic behind this idea is that pages that have already been referenced a lot
              have been in the system a long time, and we are probably done with them,
              whereas pages referenced only a few times have only recently been loaded, and
              we still need them.
      In general counting-based algorithms are not commonly used, as their implementation is
       expensive and they do not approximate OPT well.
Page-Buffering Algorithms
There are a number of page-buffering algorithms that can be used in conjunction with the afore-
mentioned algorithms, to improve overall performance and sometimes make up for inherent
weaknesses in the hardware and/or the underlying page-replacement algorithms:
      Maintain a certain minimum number of free frames at all times. When a page-fault
       occurs, go ahead and allocate one of the free frames from the free list first, to get the
       requesting process up and running again as quickly as possible, and then select a victim
       page to write to disk and free up a frame as a second step.
      Keep a list of modified pages, and when the I/O system is otherwise idle, have it write
       these pages out to disk, and then clear the modify bits, thereby increasing the chance of
       finding a "clean" page for the next potential victim.
      Keep a pool of free frames, but remember what page was in it before it was made free.
       Since the data in the page is not actually cleared out when the page is freed, it can be
       made an active page again without having to load in any new data from disk. This is
       useful when an algorithm mistakenly replaces a page that in fact is needed again soon.
Applications and Page Replacement
      Some applications ( most notably database programs ) understand their data accessing
       and caching needs better than the general-purpose OS, and should therefore be given
       reign to do their own memory management.
      Sometimes such programs are given a raw disk partition to work with, containing raw
       data blocks and no file system structure. It is then up to the application to use this disk
       partition as extended memory or for whatever other reasons it sees fit.
Allocation of Frames
We said earlier that there were two important tasks in virtual memory management: a page-
replacement strategy and a frame-allocation strategy. This section covers the second part of that
pair.
                                                 77
Minimum Number of Frames
     The absolute minimum number of frames that a process must be allocated is dependent
      on system architecture, and corresponds to the worst-case scenario of the number of
      pages that could be touched by a single ( machine ) instruction.
     If an instruction ( and its operands ) spans a page boundary, then multiple pages could be
      needed just for the instruction fetch.
     Memory references in an instruction touch more pages, and if those memory locations
      can span page boundaries, then multiple pages could be needed for operand access also.
     The worst case involves indirect addressing, particularly where multiple levels of indirect
      addressing are allowed. Left unchecked, a pointer to a pointer to a pointer to a pointer to
      a . . . could theoretically touch every page in the virtual address space in a single machine
      instruction, requiring every virtual page be loaded into physical memory simultaneously.
      For this reason architectures place a limit ( say 16 ) on the number of levels of indirection
      allowed in an instruction, which is enforced with a counter initialized to the limit and
      decremented with every level of indirection in an instruction - If the counter reaches zero,
      then an "excessive indirection" trap occurs. This example would still require a minimum
      frame allocation of 17 per process.
Allocation Algorithms
     Equal Allocation - If there are m frames available and n processes to share them, each
      process gets m / n frames, and the leftovers are kept in a free-frame buffer pool.
     Proportional Allocation - Allocate the frames proportionally to the size of the process,
      relative to the total size of all processes. So if the size of process i is S_i, and S is the sum
      of all S_i, then the allocation for process P_i is a_i = m * S_i / S.
     Variations on proportional allocation could consider priority of process rather than just
      their size.
     Obviously all allocations fluctuate over time as the number of available free frames, m,
      fluctuates, and all are also subject to the constraints of minimum allocation. ( If the
      minimum allocations cannot be met, then processes must either be swapped out or not
      allowed to start until more free frames become available. )
Global versus Local Allocation
     One big question is whether frame allocation ( page replacement ) occurs on a local or
      global level.
     With local replacement, the number of pages allocated to a process is fixed, and page
      replacement occurs only amongst the pages allocated to this process.
     With global replacement, any page may be a potential victim, whether it currently
      belongs to the process seeking a free frame or not.
     Local page replacement allows processes to better control their own page fault rates, and
      leads to more consistent performance of a given process over different system load levels.
                                                 78
     Global page replacement is overall more efficient, and is the more commonly used
      approach.
Non-Uniform Memory Access
     The above arguments all assume that all memory is equivalent, or at least has equivalent
      access times.
     This may not be the case in multiple-processor systems, especially where each CPU is
      physically located on a separate circuit board which also holds some portion of the
      overall system memory.
     In these latter systems, CPUs can access memory that is physically located on the same
      board much faster than the memory on the other boards.
     The basic solution is akin to processor affinity - At the same time that we try to schedule
      processes on the same CPU to minimize cache misses, we also try to allocate memory for
      those processes on the same boards, to minimize access times.
     The presence of threads complicates the picture, especially when the threads get loaded
      onto different processors.
     Solaris uses an lgroup as a solution, in a hierarchical fashion based on relative latency.
      For example, all processors and RAM on a single board would probably be in the same
      lgroup. Memory assignments are made within the same lgroup if possible, or to the next
      nearest lgroup otherwise. ( Where "nearest" is defined as having the lowest access time. )
Thrashing
     If a process cannot maintain its minimum required number of frames, then it must be
      swapped out, freeing up frames for other processes. This is an intermediate level of CPU
      scheduling.
     But what about a process that can keep its minimum, but cannot keep all of the frames
      that it is currently using on a regular basis? In this case it is forced to page out pages that
      it will need again in the very near future, leading to large numbers of page faults.
     A process that is spending more time paging than executing is said to be thrashing.
Cause of Thrashing
     Early process scheduling schemes would control the level of multiprogramming allowed
      based on CPU utilization, adding in more processes when CPU utilization was low.
     The problem is that when memory filled up and processes started spending lots of time
      waiting for their pages to page in, then CPU utilization would lower, causing the schedule
      to add in even more processes and exacerbating the problem! Eventually the system
      would essentially grind to a halt.
     Local page replacement policies can prevent one thrashing process from taking pages
      away from other processes, but it still tends to clog up the I/O queue, thereby slowing
                                                79
       down any other process that needs to do even a little bit of paging ( or any other I/O for
       that matter. )
                                     Fig: Thrashing
      To prevent thrashing we must provide processes with as many frames as they really need
       "right now", but how do we know what that is?
      The locality model notes that processes typically access memory references in a given
       locality, making lots of references to the same general area of memory before moving
       periodically to a new locality, as shown in Figure 9.19 below. If we could just keep as
       many frames as are involved in the current locality, then page faulting would occur
       primarily on switches from one locality to another. ( E.g. when one function exits and
       another is called. )
Working-Set Model
The working set model is based on the concept of locality, and defines a working set window, of
length delta. Whatever pages are included in the most recent delta page references are said to be
in the processes working set window, and comprise its current working set, as illustrated in
below Figure:
                                     Fig: Working-set model
                                                80
     The selection of delta is critical to the success of the working set model - If it is too small
      then it does not encompass all of the pages of the current locality, and if it is too large,
      then it encompasses pages that are no longer being frequently accessed.
     The total demand, D, is the sum of the sizes of the working sets for all processes. If D
      exceeds the total number of available frames, then at least one process is thrashing,
      because there are not enough frames available to satisfy its minimum working set. If D is
      significantly less than the currently available frames, then additional processes can be
      launched.
     The hard part of the working-set model is keeping track of what pages are in the current
      working set, since every reference adds one to the set and removes one older page. An
      approximation can be made using reference bits and a timer that goes off after a set
      interval of memory references:
               o For example, suppose that we set the timer to go off after every 5000
                   references ( by any process ), and we can store two additional historical
                   reference bits in addition to the current reference bit.
          o Every time the timer goes off, the current reference bit is copied to one of the two
               historical bits, and then cleared.
          o If any of the three bits is set, then that page was referenced within the last 15,000
               references, and is considered to be in that processes reference set.
          o Finer resolution can be achieved with more historical bits and a more frequent
               timer, at the expense of greater overhead.
Page-Fault Frequency
     A more direct approach is to recognize that what we really want to control is the page-
      fault rate, and to allocate frames based on this directly measurable value. If the page-fault
      rate exceeds a certain upper bound then that process needs more frames, and if it is below
      a given lower bound, then it can afford to give up some of its frames to other processes.
     ( I suppose a page-replacement strategy could be devised that would select victim frames
      based on the process with the lowest current page-fault frequency. )
                                    Fig: Page-fault frequency
                                                81
Memory-mapped files
Rather than accessing data files directly via the file system with every file access, data files can
be paged into memory the same as process files, resulting in much faster accesses ( except of
course when page-faults occur. ) This is known as memory-mapping a file.
Basic Mechanism
      Basically a file is mapped to an address range within a process's virtual address space,
       and then paged in as needed using the ordinary demand paging system.
      Note that file writes are made to the memory page frames, and are not immediately
       written out to disk. ( This is the purpose of the "flush( )" system call, which may also be
       needed for stdout in some cases. See the time killer program for an example of this. )
      This is also why it is important to "close( )" a file when one is done writing to it - So that
       the data can be safely flushed out to disk and so that the memory frames can be freed up
       for other purposes.
      Some systems provide special system calls to memory map files and use direct disk
       access otherwise. Other systems map the file to process address space if the special
       system calls are used and map the file to kernel address space otherwise, but do memory
       mapping in either case.
      File sharing is made possible by mapping the same file to the address space of more than
       one process, as shown in below Figure. Copy-on-write is supported, and mutual
       exclusion techniques may be needed to avoid synchronization problems.
                                   Fig: Memory-mapped files.
                                                 82
      Shared memory can be implemented via shared memory-mapped files ( Windows ), or it
       can be implemented through a separate process ( Linux, UNIX. )
Shared Memory in the Win32 API
      Windows implements shared memory using shared memory-mapped files, involving
       three basic steps:
           1. Create a file, producing a HANDLE to the new file.
           2. Name the file as a shared object, producing a HANDLE to the shared object.
           3. Map the shared object to virtual memory address space, returning its base address
               as a void pointer ( LPVOID ).
This is illustrated in below Figure
                  Fig: Shared memory in Windows using memory-mapped I/O
Memory-Mapped I/O
      All access to devices is done by writing into ( or reading from ) the device's registers.
       Normally this is done via special I/O instructions.
      For certain devices it makes sense to simply map the device's registers to addresses in the
       process's virtual address space, making device I/O as fast and simple as any other
       memory access. Video controller cards are a classic example of this.
      Serial and parallel devices can also use memory mapped I/O, mapping the device
       registers to specific memory addresses known as I/O Ports, e.g. 0xF8. Transferring a
       series of bytes must be done one at a time, moving only as fast as the I/O device is
       prepared to process the data, through one of two mechanisms:
           o Programmed I/O ( PIO ), also known as polling. The CPU periodically checks
               the control bit on the device, to see if it is ready to handle another byte of data.
           o Interrupt Driven. The device generates an interrupt when it either has another
               byte of data to deliver or is ready to receive another byte.
                                                83
Allocating Kernel Memory
     Previous discussions have centered on process memory, which can be conveniently
      broken up into page-sized chunks, and the only fragmentation that occurs is the average
      half-page lost to internal fragmentation for each process ( segment. )
     There is also additional memory allocated to the kernel, however, which cannot be so
      easily paged. Some of it is used for I/O buffering and direct access by devices, example,
      and must therefore be contiguous and not affected by paging. Other memory is used for
      internal kernel data structures of various sizes, and since kernel memory is often locked (
      restricted from being ever swapped out ), management of this resource must be done
      carefully to avoid internal fragmentation or other waste. ( I.e. you would like the kernel to
      consume as little memory as possible, leaving as much as possible for user processes. )
      Accordingly there are several classic algorithms in place for allocating kernel memory
      structures.
Buddy System
     The Buddy System allocates memory using a power of two allocator.
     Under this scheme, memory is always allocated as a power of 2 ( 4K, 8K, 16K, etc ),
      rounding up to the next nearest power of two if necessary.
     If a block of the correct size is not currently available, then one is formed by splitting the
      next larger block in two, forming two matched buddies. ( And if that larger size is not
      available, then the next largest available size is split, and so on. )
     One nice feature of the buddy system is that if the address of a block is exclusively ORed
      with the size of the block, the resulting address is the address of the buddy of the same
      size, which allows for fast and easy coalescing of free blocks back into larger blocks.
           o Free lists are maintained for every size block.
           o If the necessary block size is not available upon request, a free block from the
              next largest size is split into two buddies of the desired size. ( Recursively
              splitting larger size blocks if necessary. )
           o When a block is freed, its buddy's address is calculated, and the free list for that
              size block is checked to see if the buddy is also free. If it is, then the two buddies
              are coalesced into one larger free block, and the process is repeated with
              successively larger free lists.
           o See the ( annotated ) Figure below for an example.
                                                84
                                 Fig: Buddy System Allocation
Slab Allocation
     Slab Allocation allocates memory to the kernel in chunks called slabs, consisting of one
      or more contiguous pages. The kernel then creates separate caches for each type of data
      structure it might need from one or more slabs. Initially the caches are marked empty,
      and are marked full as they are used.
     New requests for space in the cache is first granted from empty or partially empty slabs,
      and if all slabs are full, then additional slabs are allocated.
     ( This essentially amounts to allocating space for arrays of structures, in large chunks
      suitable to the size of the structure being stored. For example if a particular structure were
      512 bytes long, space for them would be allocated in groups of 8 using 4K pages. If the
      structure were 3K, then space for 4 of them could be allocated at one time in a slab of
      12K using three 4K pages.
     Benefits of slab allocation include lack of internal fragmentation and fast allocation of
      space for individual structures
     Solaris uses slab allocation for the kernel and also for certain user-mode memory
      allocations. Linux used the buddy system prior to 2.2 and switched to slab allocation
      since then.
                                      Fig: Slab Allocation
                                               85
Other Considerations
Prepaging
     The basic idea behind prepaging is to predict the pages that will be needed in the near
      future, and page them in before they are actually requested.
     If a process was swapped out and we know what its working set was at the time, then
      when we swap it back in we can go ahead and page back in the entire working set, before
      the page faults actually occur.
     With small ( data ) files we can go ahead and prepage all of the pages at one time.
     Prepaging can be of benefit if the prediction is good and the pages are needed eventually,
      but slows the system down if the prediction is wrong.
Page Size
     There are quite a few trade-offs of small versus large page sizes:
     Small pages waste less memory due to internal fragmentation.
     Large pages require smaller page tables.
     For disk access, the latency and seek times greatly outweigh the actual data transfer
      times. This makes it much faster to transfer one large page of data than two or more
      smaller pages containing the same amount of data.
     Smaller pages match locality better, because we are not bringing in data that is not really
      needed.
     Small pages generate more page faults, with attending overhead.
     The physical hardware may also play a part in determining page size.
     It is hard to determine an "optimal" page size for any given system. Current norms range
      from 4K to 4M, and tend towards larger page sizes as time passes.
TLB Reach
     TLB Reach is defined as the amount of memory that can be reached by the pages listed in
      the TLB.
     Ideally the working set would fit within the reach of the TLB.
     Increasing the size of the TLB is an obvious way of increasing TLB reach, but TLB
      memory is very expensive and also draws lots of power.
     Increasing page sizes increases TLB reach, but also leads to increased fragmentation loss.
     Some systems provide multiple size pages to increase TLB reach while keeping
      fragmentation low.
     Multiple page sizes requires that the TLB be managed by software, not hardware.
Inverted Page Tables
                                               86
      Inverted page tables store one entry for each frame instead of one entry for each virtual
       page. This reduces the memory requirement for the page table, but loses the information
       needed to implement virtual memory paging.
      A solution is to keep a separate page table for each process, for virtual memory
       management purposes. These are kept on disk, and only paged in when a page fault
       occurs. ( I.e. they are not referenced with every memory access the way a traditional page
       table would be. )
Program Structure
      Consider a pair of nested loops to access every element in a 1024 x 1024 two-
       dimensional array of 32-bit ints.
      Arrays in C are stored in row-major order, which means that each row of the array would
       occupy a page of memory.
      If the loops are nested so that the outer loop increments the row and the inner loop
       increments the column, then an entire row can be processed before the next page fault,
       yielding 1024 page faults total.
      On the other hand, if the loops are nested the other way, so that the program worked
       down the columns instead of across the rows, then every access would be to a different
       page, yielding a new page fault for each access, or over a million page faults all together.
      Be aware that different languages store their arrays differently. FORTRAN for example
       stores arrays in column-major format instead of row-major. This means that blind
       translation of code from one language to another may turn a fast program into a very slow
       one, strictly because of the extra page faults.
I/O Interlock and Page Locking
There are several occasions when it may be desirable to lock pages in memory, and not let them
get paged out:
      Certain kernel operations cannot tolerate having their pages swapped out.
      If an I/O controller is doing direct-memory access, it would be wrong to change pages in
       the middle of the I/O operation.
      In a priority based scheduling system, low priority jobs may need to wait quite a while
       before getting their turn on the CPU, and there is a danger of their pages being paged out
       before they get a chance to use them even once after paging them in. In this situation
       pages may be locked when they are paged in, until the process that requested them gets at
       least one turn in the CPU.
                                                87
              Figure :The reason why frames used for I/O must be in memory.
Operating-System Examples ( Optional )
Windows
     Windows uses demand paging with clustering, meaning they page in multiple pages
      whenever a page fault occurs.
     The working set minimum and maximum are normally set at 50 and 345 pages
      respectively. ( Maximums can be exceeded in rare circumstances. )
     Free pages are maintained on a free list, with a minimum threshold indicating when there
      are enough free frames available.
     If a page fault occurs and the process is below their maximum, then additional pages are
      allocated. Otherwise some pages from this process must be replaced, using a local page
      replacement algorithm.
     If the amount of free frames falls below the allowable threshold, then working set
      trimming occurs, taking frames away from any processes which are above their
      minimum, until all are at their minimums. Then additional frames can be allocated to
      processes that need them.
     The algorithm for selecting victim frames depends on the type of processor:
     On single processor 80x86 systems, a variation of the clock ( second chance ) algorithm
      is used.
     On Alpha and multiprocessor systems, clearing the reference bits may require
      invalidating entries in the TLB on other processors, which is an expensive operation. In
      this case Windows uses a variation of FIFO.
                                             88
Solaris
      Solaris maintains a list of free pages, and allocates one to a faulting thread whenever a
       fault occurs. It is therefore imperative that a minimum amount of free memory be kept on
       hand at all times.
      Solaris has a parameter, lotsfree, usually set at 1/64 of total physical memory. Solaris
       checks 4 times per second to see if the free memory falls below this threshold, and if it
       does, then the page out process is started.
      Pageout uses a variation of the clock ( second chance ) algorithm, with two hands rotating
       around through the frame table. The first hand clears the reference bits, and the second
       hand comes by afterwards and checks them. Any frame whose reference bit has not been
       reset before the second hand gets there gets paged out.
      The Pageout method is adjustable by the distance between the two hands, ( the handspan
       ), and the speed at which the hands move. For example, if the hands each check 100
       frames per second, and the handspan is 1000 frames, then there would be a 10 second
       interval between the time when the leading hand clears the reference bits and the time
       when the trailing hand checks them.
      The speed of the hands is usually adjusted according to the amount of free memory, as
       shown below. Slowscan is usually set at 100 pages per second, and fastscan is usually set
       at the smaller of 1/2 of the total physical pages per second and 8192 pages per second.
                                   Fig: Solaris Page Scanner
      Solaris also maintains a cache of pages that have been reclaimed but which have not yet
       been overwritten, as opposed to the free list which only holds pages whose current
       contents are invalid. If one of the pages from the cache is needed before it gets moved to
       the free list, then it can be quickly recovered.
      Normally page out runs 4 times per second to check if memory has fallen below lotsfree.
       However if it falls below desfree, then page out will run at 100 times per second in an
                                               89
        attempt to keep at least desfree pages free. If it is unable to do this for a 30-second
        average, then Solaris begins swapping processes, starting preferably with processes that
        have been idle for a long time.
       If free memory falls below minfree, then page out runs with every page fault.
       Recent releases of Solaris have enhanced the virtual memory management system,
        including recognizing pages from shared libraries, and protecting them from being paged
        out.
3.3 DEAD LOCKS
3.3.1. System Model
    ●   For the purposes of deadlock discussion, a system can be modeled as a collection of limited
        resources, which can be partitioned into different categories, to be allocated to a number of
        processes, each having different needs.
    ●   Resource categories may include memory, printers, CPUs, open files, tape drives, CD-ROMS,
        etc.
    ●   By definition, all the resources within a category are equivalent, and a request of this category can
        be equally satisfied by any one of the resources in that category. If this is not the case ( i.e. if
        there is some difference between the resources within a category ), then that category needs to be
        further divided into separate categories. For example, "printers" may need to be separated into
        "laser printers" and "color inkjet printers".
    ●   Some categories may have a single resource.
    ●   In normal operation a process must request a resource before using it, and release it when it is
        done, in the following sequence:
             1. Request - If the request cannot be immediately granted, then the process must wait until
                 the resource(s) it needs become available. Example: system calls open( ), malloc( ),
                 new( ), and request( ).
             2. Use - The process uses the resource.
                Example: prints to the printer or reads from the file.
            3. Release - The process relinquishes the resource. so that it becomes available for other
               processes.
                Example: close( ), free( ), delete( ), and release( ).
    ●   For all kernel-managed resources, the kernel keeps track of what resources are free and which are
        allocated, to which process they are allocated, and a queue of processes waiting for this resource
        to become available. Application-managed resources can be controlled using mutexes or wait( )
        and signal( ) calls, ( i.e. binary or counting semaphores. )
    ●   A set of processes is deadlocked when every process in the set is waiting for a resource that is
        currently allocated to another process in the set ( and which can only be released when that other
        waiting process makes progress. )
3.3.2. Deadlock Characterization
Necessary Conditions:
There are four conditions that are necessary to achieve deadlock:
                                                     90