Unit-III: File Systems
Subhrendu Chattopadhyay, IDRBT
 Disclaimer: A few of the Images are taken from the Internet and Textbooks   1
FAT   File Allocation Table
                              2
FAT
•   Simple file system popularized by MS-DOS
•   First introduced in 1977
•   Most devices today use the FAT32 spec from 1996
•   FAT12, FAT16, VFAT, FAT32, etc.
•   Still quite popular today
•   Default format for USB sticks and memory cards
•   Used for EFI boot partitions
•   Name comes from the index table used to track directories and files
   FAT Layout
Superblock                                          FAT
 ● Stores basic info about the file system            ●       File allocation table (FAT)
 ● FAT version, location of boot files                ●       Marks which blocks are free or in-use
 ● Total number of blocks                             ●       Linked-list structure to manage large
 ● Index of the root directory in the FAT                     files
                                                                               Data Block
                                                                                ● Store file and directory data
                                                                                ● Each block is a fixed size (4KB – 64KB)
                                                                                ● Files may span multiple blocks
                                       https://elixir.bootlin.com/linux/v6.11-rc4/source/mm
FAT
•   Entry size depends on FAT-12, FAT-16, FAT-32
•   Directories are special files
       •     Contains a list of entries inside the directory file
•   Possible values for FAT entries:
       •     0 – entry is empty
       •     1 – reserved by the OS
       •     1 < N < 0xFFFF – next block in a chain
       •     0xFFFF – end of a chain
EXT2   Extended File System
                              6
Ext2
•   Inode: Every file or directory is represented by an
    inode. Special data structure to index
    files/directories
•   Block Groups: The file system is divided into block
    groups, which contain a fixed number of inodes,
    data blocks, and a superblock. This layout improves
    performance and allows for better disk usage.
                     Directories are files. Contains
                     the list of entries in the directory               ●   Each inode can directly point to 12 blocks
                                                                        ●   Can also indirectly point to blocks at 1, 2, and 3
    Ext2                                                                    levels of depth
                                                                    Name            Inode
                              bin
                                                                    .               0
                 /                  home
                                                                    bin             1
                                                                    home            2
                                                                    initrd.img      3
                                       Inode      Data
                                                            Inode                                  Data Block
                                      Bitmap     Bitmap
Root inode = 0
                                                    /*
Ext2: Inode Entries                                  * Structure of an inode on the disk
                                                     */
                                                    struct ext2_inode { … }
   Field   Size   Description
   Mode    2B     File type and permissions (indicates file type, owner, group, and other permissions).
   UID     2B     User ID of the file's owner.
   Size    4B     File size in bytes. For directories, it is the total size of the directory contents
   Atime   4B     Last access time (in seconds since Unix epoch).
   Ctime   4B     Creation time or last inode status change time.
   Mtime   4B     Last modification time.
   Dtime   4B     Deletion time. Stores when the file was deleted (if applicable).
                             https://elixir.bootlin.com/linux/v6.11-rc4/source/fs/ext2/ext2.h#L290
Ext2: Inode Entries
Field                     Size   Description
GID                       2B     Group ID of the file.
Links count               2B     Number of hard links to the file.
Flags                     4B     File attributes (e.g., secure deletion, immutability, compression).
Blocks                    4B     Total number of 512-byte blocks used by the file (including indirect blocks).
Pointers to data Blocks   60B    12 direct, 1 single indirect, 1 double indirect, 1 triple indirect block pointers.
Generation number         4B     File versioning number, used in network file systems (e.g., NFS).
File ACL                  4B     Access control list information for additional file permissions.
Directory ACL             4B     ACL information specific to directories.
Fragment address          4B     Not commonly used; holds file fragment information if the file is fragmented.
OS-specific               12B    OS-specific data fields for additional functionality.
Ext2: Bitmaps: Block Bitmap
•   The block bitmap tracks the allocation status of data blocks within a block group.
•   The size of the block bitmap depends on the number of blocks in the block group. Each bit in the
    bitmap represents a single block in the block group.
      • 1: The block is allocated (in use).
      • 0: The block is free (available for allocation).
•   Usage:
      • When a file is created or expanded, the filesystem scans the block bitmap to find free blocks
           (bits set to 0). Once blocks are allocated, the corresponding bits in the block bitmap are set to
           1.
      • During file deletion, the blocks are deallocated, and the corresponding bits in the bitmap are
           cleared (set to 0), marking those blocks as free for future use.
       Block Bitmap:       | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
                           |---|---|---|---|---|---|---|---|---|---|
                           |   |   |   |   |   |   |   |   |   |   |
                           | B1| B2| B3| B4| B5| B6| B7| B8| B9|B10|
Ext2: Bitmaps: inode Bitmap
•   The inode bitmap tracks the allocation status of inodes within a block group.
•   The size of the inode bitmap depends on the number of inodes in the block group. Each bit in the
    inode bitmap represents a single inode.
      • 1: The inode is allocated (in use, assigned to a file or directory).
      • 0: The inode is free (available for use).
•   Usage:
      • When a new file or directory is created, the inode bitmap is scanned to find a free inode (bit
           set to 0). The inode is then allocated, and the corresponding bit in the bitmap is set to 1.
      • When a file or directory is deleted, the inode is deallocated, and its corresponding bit is
           cleared to 0, marking the inode as free.
       Inode Bitmap:      | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
                          |---|---|---|---|---|---|---|---|---|---|
                          |   |   |   |   |   |   |   |   |   |   |
                          | I1| I2| I3| I4| I5| I6| I7| I8| I9|I10|
Ext2: Write Operation Example
•   Let’s go through an example where you write a 10 KB file to an EXT2 filesystem, assuming the block size is 4 KB.
•   Open the File:
       •     The kernel opens the file, looks up its inode, and initializes the necessary metadata.
       •     Write 10 KB of Data:
       •     The kernel splits the 10 KB of data into blocks. Since the block size is 4 KB, the file will require 3 blocks: two full 4 KB
             blocks and one partial 2 KB block.
•   Block Allocation:
       •     The kernel checks the block bitmap and allocates 3 blocks for the file. The block pointers in the inode are updated to
             point to these blocks.
•   Data Write:
       •     The kernel writes the first 4 KB of data to the first block, the next 4 KB to the second block, and the remaining 2 KB to
             the third block.
•   Update Inode:
       •     The inode is updated to reflect the new file size (10 KB) and the block pointers are updated. The modification time
             (i_mtime) is also updated.
•   Delayed Write:
       •     The data may initially be written to the page cache and only later written to the disk during a sync operation or a
             periodic flush.
Ext2: Inconsistency
•   Many operations results in multiple, independent writes to the file system
     • Example: append a block to an existing file
           • Update the free data bitmap
           • Update the inode
           • Write the user data
•   What happens if the computer crashes in the middle of this process?
Ext2: Inconsistency
•   The disk guarantees that sector writes are atomic
      • No way to make multi-sector writes atomic
•   How to ensure consistency after a crash?
      • Don’t bother to ensure consistency
            • Accept that the file system may be inconsistent after a crash
            • Run a program that fixes the file system during bootup
            • File system checker (fsck)
      • Use a transaction log to make multi-writes atomic
            • Log stores a history of all writes to the disk
            • After a crash the log can be “replayed” to finish updates
            • Journaling file system
Approach-1: Check file-system
•   Key idea: fix inconsistent file systems during bootup
      • Unix utility called fsck (chkdsk on Windows)
      • Scans the entire file system multiple times, identifying and correcting inconsistencies
•   Why during bootup?
      • No other file system activity can be going on
      • After fsck runs, bootup/mounting can continue
fsck
•   Superblock: validate the superblock, replace it with a backup if it is corrupted
•   Free blocks and inodes: rebuild the bitmaps by scanning all inodes
•   Reachability: make sure all inodes are reachable from the root of the file system
•   inodes: delete all corrupted inodes, and rebuild their link counts by walking the directory tree
•   directories: verify the integrity of all directories
•   … and many other minor consistency checks
•   Advantages of fsck
       •   Doesn’t require the file system to do any work to ensure consistency
       •   Makes the file system implementation simpler
•   Disadvantages of fsck
       •   Very complicated to implement the fsck program
       •   Many possible inconsistencies that must be identified
       •   Many difficult corner cases to consider and handle
       •   fsck is super slow
       •   Scans the entire file system multiple times
       •   Imagine how long it would take to fsck a 40 TB RAID array
Approach-2: Journaling
•   Key idea: make writes transactional by using a write-ahead log
      • Commonly referred to as a journal
      • Ext3 and NTFS use journaling
      • After the log is written, the writes execute normally
      • In essence, the log records transactions
Approach-2: Journaling
•   What happens after a crash…
     • If the writes to the log are interrupted?
           • The transaction is incomplete
           • The user’s data is lost, but the file system is consistent
     • If the writes to the log succeed, but the normal writes are interrupted?
           • The file system may be inconsistent, but…
           • The log has exactly the right information to fix the problem
Approach-2: Journaling
•   Advantages of journaling
      • Robust, fast file system recovery
      • No need to scan the entire journal or file system
      • Relatively straightforward to implement
•   Disadvantages of journaling
      • Write traffic to the disk is doubled
      • Especially the file data, which is probably large
B+ Tree   Use of B+ tree in File Systems
                                           21
inode and it’s problem
•   Recall: inodes use indirection to acquire additional blocks of pointers
•   Problem: inodes are not efficient for large files
•   Example: for a 100MB file, you need 25600 block pointers (assuming 4KB blocks)
•   This is unavoidable if the file is 100% fragmented
•   However, what if large groups of blocks are contiguous?
      • Extents are better suited for contiguous files
•   B Trees are widely used for file system representation (WAFL, ZFS, BTRFS)
      • Logarithmic time key search, insert and remove
      • Well represents sparse files
      • The File System as a large tree made up of fixed size pages
•   Shadowing:
      • Technique to support atomic updates over persistent data structures
      • Implement snapshots, crash recovery, write-batching, RAID
Shadowing
•   To update an on-disk page (the page is in the disk, not available in the memory)
      • Read the entire page in the memory
      • Modify the page
      • Write in an alternate location
•   When a page is shadowed, its location on the disk changes
      • Update (and shadow) the immediate ancestor of the
      • page with the new address
      • Propagates up to the file system root
Copy-on-Write (COW)
•   The core idea behind copy-on-write is to delay the copying of data until it is absolutely necessary,
    saving resources like memory and disk space. Instead of creating a full copy of data when it’s shared
    or duplicated, the system shares the same data with multiple processes or users and only creates a
    copy when someone tries to modify the shared data.
•   Some filesystems, like Btrfs, ZFS, and snapshots in EXT4 (with certain configurations), use
    copy-on-write to efficiently manage file modifications and snapshots.
•   Snapshots: When creating a snapshot, the filesystem doesn’t immediately duplicate the entire file or
    directory. Instead, it marks the snapshot as sharing the original data. If the original data is modified,
    the filesystem only copies the blocks being modified, leaving the unmodified blocks shared between
    the original file and the snapshot.
btrfs
        25
History
•   A file system based on COW principle -- initially designed at Oracle Corporation for use in Linux
•   The development began in 2007, since November 2013 it has been marked as stable
•   Principal Btrfs author: Chris Mason
•   “to let Linux scale for the storage that will be available. Scaling is not just about addressing the storage but
    also means being able to administer and to manage it with a clean interface.”
History & Basics
•   A file system based on COW principle -- initially designed at Oracle Corporation for use in Linux
•   The development began in 2007, since November 2013 it has been marked as stable
•   Principal Btrfs author: Chris Mason
•   “to let Linux scale for the storage that will be available. Scaling is not just about addressing the storage but
    also means being able to administer and to manage it with a clean interface.”
•   Page, block: A 4KB contiguous region on disk and in memory. This is the standard Linux page size.
•   Extent: A contiguous on-disk area. It is page aligned, and its length is a multiple of pages.
•   Copy-on-write (COW): Creating a new version of an extent or a page at a different location
      • The data is loaded from disk to memory, modified, and then written elsewhere
      • Do not update the original location in place, risking a power failure and partial update
BTRFS B-tree
•   A generic structure with three types of data structures: keys, items, and block headers
•   Block header: A fixed size data structure, holds fields like checksums, flags, filesystem ids, generation
    number, etc.
•   Key: describes an object address,
•   Item: is a key with additional offset and size fields.
•   Internal tree nodes hold only [key, block-pointer] pairs
•   Leaf nodes hold arrays of [item, data] pairs.
•   The offset field describes data held in an extent.
BTRFS B-tree
•   A leaf stores
      • an array of items in the beginning
      • a reverse sorted data array at the ends
      • These arrays grow towards each other.
BTRFS: Small files
•   Small files that occupy less than one leaf block are packed into the b-tree inside the extent item
      • Key offset if the byte offset of the data in the file
      • The size field indicates how much data is stored
BTRFS: Large files
•   Large files are stored in extents -- contiguous on-disk areas that hold user-data without additional headers
    or formatting
•   Extent maintains a [disk block, disk num blocks] pair to record the area of disk corresponding to the file.
BTRFS: Large files
•   A directory holds an array of dir_item elements
•   Maps a file name (string) to a 64bit object_id
•   Directory lookup index - [dir_item_key, filename 64bit hash]
BTRFS
•   BTRFS does its own device management
                                              BTRFS
          Traditional File Systems with DMs