KEMBAR78
Linux Kernel - Virtual File System | PDF
Virtual File System (VFS) in Linux Kernel
Adrian Huang | Nov, 2021
* Based on kernel 5.11 (x86_64) – QEMU
* SMP (4 CPUs) and 8GB memory
* Kernel parameter: nokaslr norandmaps
* Userspace: ASLR is disabled
* Legacy BIOS
Agenda
• Kernel components affected by an IO request
• Object relationship in VFS
• Ext4 file system format
• dentry (Directory Entry) and inode (Index Node)
• Example: ext4
• Anatomy of `mount` system call
• Anatomy of `ls` system call
Kernel components affected by an IO request
VFS
Page Cache/Buffer Cache
(fs/buffer.c)
Disk
filesystem
Disk
filesystem
Block Device
File
Mapping Layer
Generic Block Layer
I/O Scheduler Layer
Block Device Driver
Disk Disk Disk
submit_bio()
open(“/mnt/a.txt”, …) open(“/mnt/b.txt”, …)
Process A
File Object File Object File Object
open file table
dentry dentry
dentry cache
inode inode
inode cache
inode table
Kernel - VFS
dentry in data blocks
Disk
Object relationship in VFS
dentry cache list
• Linked via hlist_bl_{head,node} (hash
table with bit lock)
inode cache list
• Linked via hlist_{head,node} (hash table)
Note
Super
Block
Group
Descriptors
Reserved GDT
blocks
Block
bitmap
inode
bitmap
inode table:
inode descriptor
Data
Block
Data
Block
. . .
0 1
Total inode count
Total block count
Free blocks
Free inodes
…
Block Group 0 . . . Block Group N
Boot Sector
Partition
mode
Owner UID, Owner GID
i_atime (last access time)
i_ctime (last inode change time)
i_mtime (last data modification time)
size
ext4_extent_idx
ei_block
ei_leaf_lo
ei_leaf_hi
ei_unused
ext4_extent_header
eh_magic = 0xF30A
eh_entries
eh_max
eh_depth
eh_generation
ext4_extent_idx
ei_block
ei_leaf_lo
ei_leaf_hi
ei_unused
ext4_extent
ee_block
ee_len
ei_start_hi
ei_start_lo
ext4_extent
ee_block
ee_len
ei_start_hi
ei_start_lo
. . .
. . .
root
index node leaf node Disk blocks
Extent Tree
Ext4 file system format
Storage size addressed by a block group:
4096 (bytes) * 8 * 4096 (block size) = 128MB
Super
Block
Group
Descriptors
Reserved GDT
blocks
Block
bitmap
inode
bitmap
inode table:
inode descriptor
Data
Block
Data
Block
. . .
0 1
Total inode count
Total block count
Free blocks
Free inodes
…
Block Group 0 . . . Block Group N
Boot Sector
Partition
mode
Owner UID, Owner GID
i_atime (last access time)
i_ctime (last inode change time)
i_mtime (last data modification time)
size
ext4_extent_idx
ei_block
ei_leaf_lo
ei_leaf_hi
ei_unused
ext4_extent_header
eh_magic = 0xF30A
eh_entries
eh_max
eh_depth
eh_generation
ext4_extent_idx
ei_block
ei_leaf_lo
ei_leaf_hi
ei_unused
ext4_extent
ee_block
ee_len
ei_start_hi
ei_start_lo
ext4_extent
ee_block
ee_len
ei_start_hi
ei_start_lo
. . .
. . .
root
index node leaf node Disk blocks
Extent Tree
Ext4 file system format
inode# Purpose
0 Doesn't exist; there is no inode 0
1 List of defective blocks
2 Root directory
3 User quota
4 Group quota
5 Boot loader
6 Undelete directory
7 Reserved group descriptors inode. ("resize inode")
8 Journal inode
9 The "exclude" inode
10 Replica inode
11 First non-reserved inode. Usually this is the lost+found
directory. See s_first_ino in the superblock.
Relationship between inode and directory entry (dentry)
inode number link count
inode bitmap: 19 * 4096 = 77824 (0x13000)
inode 1-16 inode 17
`dumpe2fs 16MiB.img`
Relationship between inode and directory entry (dentry)
inode entry: 35 * 4096 + 128 * (2 – 1) [inode #2] = 0x23080
`dumpe2fs 16MiB.img`
inode structure data of inode #2
Relationship between inode and directory entry (dentry)
mode
Owner UID, Owner GID
i_atime
i_ctime
i_mtime
size
ext4_extent_header
eh_magic = 0xF30A
eh_entries = 1
eh_max = 4
eh_depth = 0
eh_generation = 0
root
Disk blocks
ext4_extent
ee_block = 0
ee_len = 0x0001
ei_start_hi = 0x0000
ei_start_lo = 0x00000004
Relationship between inode and dentry – root (inode #2)
ext4_inode (inode #2)
leaf node
. 2
name inode
Directory Entry (dentry)
Block #4: A list of struct ext4_dir_entry_2
.. 2 -> 2646185
lost+found 11
a123 12
a789 17
Block #0-3
Block #5
inode structure data of inode #2 Root Directory
dentry in data block of inode #2
type = directory
i_links_count = 5
Relationship between inode and dentry – ‘b456’ (inode #13)
i-node table
. 2
name inode
Directory Entry (dentry) - /
.. 2 -> 2646185
lost+found 11
a123 12
a789 17
2
12
. 12
name inode
.. 2
b456 13
hello.c 15
hello-hard-link.c 15
Directory Entry (dentry) - /a123
13
. 13
name inode
.. 12
test.c 16
test-hard-link.c 16
test-soft-link.c 14
Directory Entry (dentry) - /a123/b456
14
16
Data block pointers
type = directory
i_links_count = 3
Data block pointers
type = directory
i_links_count = 2
Data block pointers
type = symbolic link
i_links_count = 1
Data block pointers
= “test.c”
type = file
i_links_count = 2
Data block pointers
File Data
/a123/b456/test.c
size = 6
size = 100
inode #
type = directory
i_links_count = 5
Relationship between inode and dentry – ‘test.c’ (inode #16)
i-node table
. 2
name inode
Directory Entry (dentry) - /
.. 2 -> 2646185
lost+found 11
a123 12
a789 17
2
12
. 12
name inode
.. 2
b456 13
hello.c 15
hello-hard-link.c 15
Directory Entry (dentry) - /a123
13
. 13
name inode
.. 12
test.c 16
test-hard-link.c 16
test-soft-link.c 14
Directory Entry (dentry) - /a123/b456
14
16
Data block pointers
type = directory
i_links_count = 3
Data block pointers
type = directory
i_links_count = 2
Data block pointers
type = symbolic link
i_links_count = 1
Data block pointers
= “test.c”
type = file
i_links_count = 2
Data block pointers
File Data
/a123/b456/test.c
size = 6
size = 100
inode structure data of inode #16
inode entry: 35 * 4096 + 128 * (16 – 1) [inode #16] = 0x23780
address inode structure of inode #16
Data block pointer represented by
struct ext4_extent
i_links_count
inode #
type = directory
i_links_count = 5
Relationship between inode and dentry – ‘test-soft-link.c’ (inode #14)
i-node table
. 2
name inode
Directory Entry (dentry) - /
.. 2 -> 2646185
lost+found 11
a123 12
a789 17
2
12
. 12
name inode
.. 2
b456 13
hello.c 15
hello-hard-link.c 15
Directory Entry (dentry) - /a123
13
. 13
name inode
.. 12
test.c 16
test-hard-link.c 16
test-soft-link.c 14
Directory Entry (dentry) - /a123/b456
14
16
Data block pointers
type = directory
i_links_count = 3
Data block pointers
type = directory
i_links_count = 2
Data block pointers
type = symbolic link
i_links_count = 1
Data block pointers
= “test.c”
type = file
i_links_count = 2
Data block pointers
File Data
/a123/b456/test.c
size = 6
size = 100
inode structure data of inode #14
inode entry: 35 * 4096 + 128 * (14 – 1) [inode #14] = 0x23680
address inode structure of inode #14
i_links_count
Data block pointer
Hard Link
1. Refer to an inode number
2. Must reside on the same file system
3. Cannot be made to a directory
Soft Link
1. Refer to a filename instead of an inode number
2. Can be linked to a file in a different file system
3. Can be made to a directory
inode #
mount – mount a new ext4 file system
• stat() – get file status from inode
o Name resolution → get dentry → get inode
• openat()
o Name resolution
o Update open file table
• mount()
o Name resolution
o Add a mount entry to ‘mount’ hash table
o Parse file system
stat("/adrian/img/src.img“, …)
user_path_at(): fill a path struct (dentry and mount info) after resolving a pathname
vfsmount
mnt_root
mnt_sb
dentry
d_name
path
mnt
dentry
d_inode
path struct
• stat() – get file status from inode
o Name resolution → get dentry → get inode
stat("/adrian/img/src.img“, …)
1. namei: convert a “name” to an “inode”
2. nameidata: store the current status when walking a path
nameidata
last
name
root
inode
path
task_struct
mm
nameidata
filename
name
uptr (userland pointer)
refcnt
aname
char iname[] =
“/adrian/img/src.img"
char __user *name =
“/adrian/img/src.img”
local variable
stat("/adrian/img/src.img“, …)
path_lookupat():
1. Configure members of nameidata struct based on absolute pathname or relative pathname
2. Name resolution
vfsmount
mnt_root
mnt_sb
dentry
d_name
qstr
name = “/”
u32 hash
u32 len
u64 hash_len
union
nameidata
last (S*)
name
root (S*)
inode
path (S*)
task_struct
mm
nameidata
filename
name
uptr (userland pointer)
refcnt
aname
char iname[] =
“/adrian/img/src.img"
char __user *name =
“/adrian/img/src.img”
local variable
path
mnt
dentry
current->fs->root
d_inode
* Data storage (not a data pointer)
Note
link_path_walk(): name resolution – round 1
vfsmount
mnt_root
mnt_sb
dentry
d_name
qstr
name = “/”
u32 hash
u32 len
u64 hash_len
union
nameidata
last (S*)
name
root (S*)
inode
path (S*)
* Data storage (not a data pointer)
task_struct
mm
nameidata
filename
name
uptr (userland pointer)
refcnt
aname
char iname[] =
“/adrian/img/src.img"
char __user *name =
“/adrian/img/src.img”
path
mnt
dentry
current->fs->root
d_inode
qstr
name =
“adrian/img/src.img”
u32 hash
u32 len = 6
u64 hash_len
union
variable ‘name’ = “adrian/img/src.img”
dentry
d_name
d_inode
path
mnt
dentry
variable ‘name’ = “img/src.img”
hash
1
2
3
4
5
6
Note
local variable
link_path_walk()
• stop until the last dentry
is resolved
• for-loop in this function
qstr
name = “adrian”
u32 hash
u32 len = 6
u64 hash_len
union
link_path_walk(): name resolution – round 1
nameidata
last (S*)
name
root (S*)
inode
path (S*)
* Data storage (not a data pointer)
task_struct
mm
nameidata
filename
name
uptr (userland pointer)
refcnt
aname
char iname[] =
“/adrian/img/src.img"
char __user *name =
“/adrian/img/src.img”
path
mnt
dentry
current->fs->root
qstr
name =
“adrian/img/src.img”
u32 hash
u32 len = 6
u64 hash_len
union
variable ‘name’ = “adrian/img/src.img”
dentry
d_name
d_inode
path
mnt
dentry
variable ‘name’ = “img/src.img”
hash
1
2
3
4
5
6
Note
local variable
lookup_slow()
vfsmount
mnt_root
mnt_sb
dentry
d_name
qstr
name = “/”
u32 hash
u32 len
u64 hash_len
union
nameidata
last (S*)
name
root (S*)
inode
path (S*)
* Data storage (not a data pointer)
task_struct
mm
nameidata
filename
name
uptr (userland pointer)
refcnt
aname
char iname[] =
“/adrian/img/src.img"
char __user *name =
“/adrian/img/src.img”
local variable
path
mnt
dentry
current->fs->root
d_inode
qstr
name =
“img/src.img”
u32 hash
u32 len = 3
u64 hash_len
union
variable ‘name’ = “img/src.img”
dentry
d_name
d_inode
path
mnt
dentry
variable ‘name’ = “src.img”
hash
1
2
3
4
5
6
Note
link_path_walk(): name resolution – round 2
qstr
name = “img”
u32 hash
u32 len = 3
u64 hash_len
union
link_path_walk()
• stop until the last dentry
is resolved
• for-loop in this function
vfsmount
mnt_root
mnt_sb
dentry
d_name
qstr
name = “/”
u32 hash
u32 len
u64 hash_len
union
nameidata
last (S*)
name
root (S*)
inode
path (S*)
* Data storage (not a data pointer)
task_struct
mm
nameidata
filename
name
uptr (userland pointer)
refcnt
aname
char iname[] =
“/adrian/img/src.img"
char __user *name =
“/adrian/img/src.img”
local variable
path
mnt
dentry
current->fs->root
d_inode
qstr
name = “src.img”
u32 hash
u32 len = 3
u64 hash_len
union
variable ‘name’ = “src.img”
dentry
d_name
d_inode
path
mnt
dentry
variable ‘name’ = “”
2
Note
Function return
link_path_walk(): name resolution – round 3
1
hash
qstr
name = “img”
u32 hash
u32 len = 3
u64 hash_len
union
link_path_walk()
• stop until the last dentry
is resolved
• for-loop in this function
vfsmount
mnt_root
mnt_sb
dentry
d_name
qstr
name = “/”
u32 hash
u32 len
u64 hash_len
union
nameidata
last (S*)
name
root (S*)
inode
path (S*)
task_struct
mm
nameidata
filename
name
uptr (userland pointer)
refcnt
aname
char iname[] =
“/adrian/img/src.img"
char __user *name =
“/adrian/img/src.img”
local variable
path
mnt
dentry
current->fs->root
d_inode
qstr
name = “src.img”
u32 hash
u32 len = 6
u64 hash_len
union
variable ‘name’ = “src.img”
dentry
d_name
qstr
name = “src.img”
u32 hash
u32 len = 6
u64 hash_len
union
d_inode
path
mnt
dentry
variable ‘name’ = “”
3
4
5
6
Function return
lookup_last(): name resolution
hash
1
2
* Data storage (not a data pointer)
Note
lookup_last()
• get the corresponding dentry
based on the filename
• “while loop”: Follow the
symlink in the final
component
vfsmount
mnt_root
mnt_sb
dentry
d_name
qstr
name = “/”
u32 hash
u32 len
u64 hash_len
union
nameidata
last (S*)
name
root (S*)
inode
path (S*)
task_struct
mm
nameidata
filename
name
uptr (userland pointer)
refcnt
aname
char iname[] =
“/adrian/img/src.img"
char __user *name =
“/adrian/img/src.img”
local variable
path
mnt
dentry
current->fs->root
d_inode
qstr
name = “src.img”
u32 hash
u32 len = 6
u64 hash_len
union
variable ‘name’ = “src.img”
dentry
d_name
qstr
name = “src.img”
u32 hash
u32 len = 6
u64 hash_len
union
d_inode
path
mnt
dentry
variable ‘name’ = “”
3
4
5
6
Function return
lookup_last(): name resolution
hash
1
2
* Data storage (not a data pointer)
Note
mount – mount a new ext4 file system
hlist_head.first
hlist_head.first
.
.
mount_hashtable mount
mnt_parent
struct dentry
*mnt_mountpoint
struct vfsmount mnt
**pprev *next
const char *mnt_devname =
“none”
mount
mnt_parent
struct dentry
*mnt_mountpoint
**pprev
struct hlist_node mnt_hash
*next
const char *mnt_devname =
“none”
struct hlist_node mnt_hash
dentry
d_name
qstr
name = “/”
u32 hash
u32 len = 1
u64 hash_len
union
super_block
s_blocksize
s_maxbytes
s_op = ramfs_sops
s_root
s_bdev
s_id[32] = “rootfs”
s_fs_info
s_type = rootfs_fs_type
d_parent
mnt_root
mnt_sb
struct vfsmount mnt
mount – before mounting a new ext4 file system
d_sb
hlist_head.first
hlist_head.first
.
.
mount_hashtable mount
mnt_parent
struct dentry
*mnt_mountpoint
struct vfsmount mnt
**pprev *next
const char *mnt_devname =
“none”
mount
mnt_parent
struct dentry
*mnt_mountpoint
**pprev
struct hlist_node mnt_hash
*next
const char *mnt_devname =
“none”
struct hlist_node mnt_hash
dentry
d_name
qstr
name = “/”
u32 hash
u32 len = 1
u64 hash_len
union
super_block
s_blocksize
s_maxbytes
s_op = ramfs_sops
s_root
s_bdev
s_id[32] = “rootfs”
s_fs_info
s_type = rootfs_fs_type
d_parent
mnt_root
mnt_sb
struct vfsmount mnt
mount – after mounting a new ext4 file system
mount
mnt_parent
struct dentry
*mnt_mountpoint
**pprev *next
const char *mnt_devname =
“/dev/loop0”
mnt_root
mnt_sb
dentry
d_name
qstr
name = “/”
u32 hash
u32 len = 1
u64 hash_len
union
struct vfsmount mnt
super_block
s_op = ext4_sops
s_id[32] = “loop0”
s_type = ext4_fs_type
d_parent
struct hlist_node mnt_hash
dentry
d_name
d_parent
qstr
name = “mnt”
u32 hash
u32 len = 3
u64 hash_len
union
/adrian/mnt
d_sb
d_sb
hlist_head.first
hlist_head.first
.
.
mount_hashtable mount
mnt_parent
struct dentry
*mnt_mountpoint
struct vfsmount mnt
**pprev *next
const char *mnt_devname =
“none”
mount
mnt_parent
struct dentry
*mnt_mountpoint
**pprev
struct hlist_node mnt_hash
*next
const char *mnt_devname =
“none”
struct hlist_node mnt_hash
dentry
d_name
qstr
name = “/”
u32 hash
u32 len = 1
u64 hash_len
union
super_block
s_blocksize
s_maxbytes
s_op = ramfs_sops
s_root
s_bdev
s_id[32] = “rootfs”
s_fs_info
s_type = rootfs_fs_type
d_parent
mnt_root
mnt_sb
struct vfsmount mnt
mount – after mounting a new ext4 file system
mount
mnt_parent
struct dentry
*mnt_mountpoint
**pprev *next
const char *mnt_devname =
“/dev/loop0”
mnt_root
mnt_sb
dentry
d_name
qstr
name = “/”
u32 hash
u32 len = 1
u64 hash_len
union
struct vfsmount mnt
super_block
s_op = ext4_sops
s_id[32] = “loop0”
s_type = ext4_fs_type
d_parent
struct hlist_node mnt_hash
dentry
d_name
d_parent
qstr
name = “mnt”
u32 hash
u32 len = 3
u64 hash_len
union
/adrian/mnt
d_sb
d_sb
`ls –color mnt`
`ls –color mnt`: openat(AT_FDCWD, "/adrian/mnt“, …)
`ls –color mnt`: openat(AT_FDCWD, "/adrian/mnt“, …)
`ls –color mnt`: openat(AT_FDCWD, "/adrian/mnt“, …)
path
mnt
dentry
vfsmount
mnt_root
mnt_sb
qstr
name = “mnt”
u32 hash
u32 len = 3
u64 hash_len
union
dentry
d_name
qstr
name = “/”
u32 hash
u32 len = 1
u64 hash_len
union
super_block
s_blocksize
s_maxbytes
s_op = ramfs_ops
s_root
s_bdev
s_id[32] = “rootfs”
s_fs_info
d_inode
dentry
d_name
d_inode
d_parent
s_type = rootfs_fs_type
path
mnt
dentry
vfsmount
mnt_root
mnt_sb
qstr
name = “mnt /”
u32 hash
u32 len = 3 1
u64 hash_len
union
dentry
d_name
qstr
name = “/”
u32 hash
u32 len = 1
u64 hash_len
union
super_block
s_blocksize
s_maxbytes
s_op = ext4_sops
s_root
s_bdev
s_id[32] = “loop0”
s_fs_info
d_inode
dentry
d_name
d_inode
d_parent
s_type = ext4_fs_type
/adrian/mnt data structures in the original file system /adrian/mnt data structures in newly mounted file system
traverse_mounts()
`ls –color mnt`: openat(AT_FDCWD, "/adrian/mnt“, …)
The original mount point is changed the newly one when running name resolution
path
mnt
dentry
vfsmount
mnt_root
mnt_sb
qstr
name = “mnt /”
u32 hash
u32 len = 3 1
u64 hash_len
union
dentry
d_name
qstr
name = “/”
u32 hash
u32 len = 1
u64 hash_len
union
super_block
s_blocksize
s_maxbytes
s_op = ext4_sops
s_root
s_bdev
s_id[32] = “loop0”
s_fs_info
d_inode
dentry
d_name
d_inode
d_parent
s_type = ext4_fs_type
/adrian/mnt data structures in newly mounted file system
flags: path->dentry->d_flags
`ls –color mnt`: openat(AT_FDCWD, "/adrian/mnt“, …)
REF-walk
[Two concurrent mechanisms for name resolution]
1. REF-walk: concurrent management with refcounts and spinlock
2. RCU-walk: faster name resolution lookup

Linux Kernel - Virtual File System

  • 1.
    Virtual File System(VFS) in Linux Kernel Adrian Huang | Nov, 2021 * Based on kernel 5.11 (x86_64) – QEMU * SMP (4 CPUs) and 8GB memory * Kernel parameter: nokaslr norandmaps * Userspace: ASLR is disabled * Legacy BIOS
  • 2.
    Agenda • Kernel componentsaffected by an IO request • Object relationship in VFS • Ext4 file system format • dentry (Directory Entry) and inode (Index Node) • Example: ext4 • Anatomy of `mount` system call • Anatomy of `ls` system call
  • 3.
    Kernel components affectedby an IO request VFS Page Cache/Buffer Cache (fs/buffer.c) Disk filesystem Disk filesystem Block Device File Mapping Layer Generic Block Layer I/O Scheduler Layer Block Device Driver Disk Disk Disk submit_bio()
  • 4.
    open(“/mnt/a.txt”, …) open(“/mnt/b.txt”,…) Process A File Object File Object File Object open file table dentry dentry dentry cache inode inode inode cache inode table Kernel - VFS dentry in data blocks Disk Object relationship in VFS dentry cache list • Linked via hlist_bl_{head,node} (hash table with bit lock) inode cache list • Linked via hlist_{head,node} (hash table) Note
  • 5.
    Super Block Group Descriptors Reserved GDT blocks Block bitmap inode bitmap inode table: inodedescriptor Data Block Data Block . . . 0 1 Total inode count Total block count Free blocks Free inodes … Block Group 0 . . . Block Group N Boot Sector Partition mode Owner UID, Owner GID i_atime (last access time) i_ctime (last inode change time) i_mtime (last data modification time) size ext4_extent_idx ei_block ei_leaf_lo ei_leaf_hi ei_unused ext4_extent_header eh_magic = 0xF30A eh_entries eh_max eh_depth eh_generation ext4_extent_idx ei_block ei_leaf_lo ei_leaf_hi ei_unused ext4_extent ee_block ee_len ei_start_hi ei_start_lo ext4_extent ee_block ee_len ei_start_hi ei_start_lo . . . . . . root index node leaf node Disk blocks Extent Tree Ext4 file system format Storage size addressed by a block group: 4096 (bytes) * 8 * 4096 (block size) = 128MB
  • 6.
    Super Block Group Descriptors Reserved GDT blocks Block bitmap inode bitmap inode table: inodedescriptor Data Block Data Block . . . 0 1 Total inode count Total block count Free blocks Free inodes … Block Group 0 . . . Block Group N Boot Sector Partition mode Owner UID, Owner GID i_atime (last access time) i_ctime (last inode change time) i_mtime (last data modification time) size ext4_extent_idx ei_block ei_leaf_lo ei_leaf_hi ei_unused ext4_extent_header eh_magic = 0xF30A eh_entries eh_max eh_depth eh_generation ext4_extent_idx ei_block ei_leaf_lo ei_leaf_hi ei_unused ext4_extent ee_block ee_len ei_start_hi ei_start_lo ext4_extent ee_block ee_len ei_start_hi ei_start_lo . . . . . . root index node leaf node Disk blocks Extent Tree Ext4 file system format
  • 7.
    inode# Purpose 0 Doesn'texist; there is no inode 0 1 List of defective blocks 2 Root directory 3 User quota 4 Group quota 5 Boot loader 6 Undelete directory 7 Reserved group descriptors inode. ("resize inode") 8 Journal inode 9 The "exclude" inode 10 Replica inode 11 First non-reserved inode. Usually this is the lost+found directory. See s_first_ino in the superblock. Relationship between inode and directory entry (dentry) inode number link count
  • 8.
    inode bitmap: 19* 4096 = 77824 (0x13000) inode 1-16 inode 17 `dumpe2fs 16MiB.img` Relationship between inode and directory entry (dentry)
  • 9.
    inode entry: 35* 4096 + 128 * (2 – 1) [inode #2] = 0x23080 `dumpe2fs 16MiB.img` inode structure data of inode #2 Relationship between inode and directory entry (dentry)
  • 10.
    mode Owner UID, OwnerGID i_atime i_ctime i_mtime size ext4_extent_header eh_magic = 0xF30A eh_entries = 1 eh_max = 4 eh_depth = 0 eh_generation = 0 root Disk blocks ext4_extent ee_block = 0 ee_len = 0x0001 ei_start_hi = 0x0000 ei_start_lo = 0x00000004 Relationship between inode and dentry – root (inode #2) ext4_inode (inode #2) leaf node . 2 name inode Directory Entry (dentry) Block #4: A list of struct ext4_dir_entry_2 .. 2 -> 2646185 lost+found 11 a123 12 a789 17 Block #0-3 Block #5 inode structure data of inode #2 Root Directory dentry in data block of inode #2
  • 11.
    type = directory i_links_count= 5 Relationship between inode and dentry – ‘b456’ (inode #13) i-node table . 2 name inode Directory Entry (dentry) - / .. 2 -> 2646185 lost+found 11 a123 12 a789 17 2 12 . 12 name inode .. 2 b456 13 hello.c 15 hello-hard-link.c 15 Directory Entry (dentry) - /a123 13 . 13 name inode .. 12 test.c 16 test-hard-link.c 16 test-soft-link.c 14 Directory Entry (dentry) - /a123/b456 14 16 Data block pointers type = directory i_links_count = 3 Data block pointers type = directory i_links_count = 2 Data block pointers type = symbolic link i_links_count = 1 Data block pointers = “test.c” type = file i_links_count = 2 Data block pointers File Data /a123/b456/test.c size = 6 size = 100 inode #
  • 12.
    type = directory i_links_count= 5 Relationship between inode and dentry – ‘test.c’ (inode #16) i-node table . 2 name inode Directory Entry (dentry) - / .. 2 -> 2646185 lost+found 11 a123 12 a789 17 2 12 . 12 name inode .. 2 b456 13 hello.c 15 hello-hard-link.c 15 Directory Entry (dentry) - /a123 13 . 13 name inode .. 12 test.c 16 test-hard-link.c 16 test-soft-link.c 14 Directory Entry (dentry) - /a123/b456 14 16 Data block pointers type = directory i_links_count = 3 Data block pointers type = directory i_links_count = 2 Data block pointers type = symbolic link i_links_count = 1 Data block pointers = “test.c” type = file i_links_count = 2 Data block pointers File Data /a123/b456/test.c size = 6 size = 100 inode structure data of inode #16 inode entry: 35 * 4096 + 128 * (16 – 1) [inode #16] = 0x23780 address inode structure of inode #16 Data block pointer represented by struct ext4_extent i_links_count inode #
  • 13.
    type = directory i_links_count= 5 Relationship between inode and dentry – ‘test-soft-link.c’ (inode #14) i-node table . 2 name inode Directory Entry (dentry) - / .. 2 -> 2646185 lost+found 11 a123 12 a789 17 2 12 . 12 name inode .. 2 b456 13 hello.c 15 hello-hard-link.c 15 Directory Entry (dentry) - /a123 13 . 13 name inode .. 12 test.c 16 test-hard-link.c 16 test-soft-link.c 14 Directory Entry (dentry) - /a123/b456 14 16 Data block pointers type = directory i_links_count = 3 Data block pointers type = directory i_links_count = 2 Data block pointers type = symbolic link i_links_count = 1 Data block pointers = “test.c” type = file i_links_count = 2 Data block pointers File Data /a123/b456/test.c size = 6 size = 100 inode structure data of inode #14 inode entry: 35 * 4096 + 128 * (14 – 1) [inode #14] = 0x23680 address inode structure of inode #14 i_links_count Data block pointer Hard Link 1. Refer to an inode number 2. Must reside on the same file system 3. Cannot be made to a directory Soft Link 1. Refer to a filename instead of an inode number 2. Can be linked to a file in a different file system 3. Can be made to a directory inode #
  • 14.
    mount – mounta new ext4 file system • stat() – get file status from inode o Name resolution → get dentry → get inode • openat() o Name resolution o Update open file table • mount() o Name resolution o Add a mount entry to ‘mount’ hash table o Parse file system
  • 15.
    stat("/adrian/img/src.img“, …) user_path_at(): filla path struct (dentry and mount info) after resolving a pathname vfsmount mnt_root mnt_sb dentry d_name path mnt dentry d_inode path struct • stat() – get file status from inode o Name resolution → get dentry → get inode
  • 16.
    stat("/adrian/img/src.img“, …) 1. namei:convert a “name” to an “inode” 2. nameidata: store the current status when walking a path nameidata last name root inode path task_struct mm nameidata filename name uptr (userland pointer) refcnt aname char iname[] = “/adrian/img/src.img" char __user *name = “/adrian/img/src.img” local variable
  • 17.
    stat("/adrian/img/src.img“, …) path_lookupat(): 1. Configuremembers of nameidata struct based on absolute pathname or relative pathname 2. Name resolution vfsmount mnt_root mnt_sb dentry d_name qstr name = “/” u32 hash u32 len u64 hash_len union nameidata last (S*) name root (S*) inode path (S*) task_struct mm nameidata filename name uptr (userland pointer) refcnt aname char iname[] = “/adrian/img/src.img" char __user *name = “/adrian/img/src.img” local variable path mnt dentry current->fs->root d_inode * Data storage (not a data pointer) Note
  • 18.
    link_path_walk(): name resolution– round 1 vfsmount mnt_root mnt_sb dentry d_name qstr name = “/” u32 hash u32 len u64 hash_len union nameidata last (S*) name root (S*) inode path (S*) * Data storage (not a data pointer) task_struct mm nameidata filename name uptr (userland pointer) refcnt aname char iname[] = “/adrian/img/src.img" char __user *name = “/adrian/img/src.img” path mnt dentry current->fs->root d_inode qstr name = “adrian/img/src.img” u32 hash u32 len = 6 u64 hash_len union variable ‘name’ = “adrian/img/src.img” dentry d_name d_inode path mnt dentry variable ‘name’ = “img/src.img” hash 1 2 3 4 5 6 Note local variable link_path_walk() • stop until the last dentry is resolved • for-loop in this function qstr name = “adrian” u32 hash u32 len = 6 u64 hash_len union
  • 19.
    link_path_walk(): name resolution– round 1 nameidata last (S*) name root (S*) inode path (S*) * Data storage (not a data pointer) task_struct mm nameidata filename name uptr (userland pointer) refcnt aname char iname[] = “/adrian/img/src.img" char __user *name = “/adrian/img/src.img” path mnt dentry current->fs->root qstr name = “adrian/img/src.img” u32 hash u32 len = 6 u64 hash_len union variable ‘name’ = “adrian/img/src.img” dentry d_name d_inode path mnt dentry variable ‘name’ = “img/src.img” hash 1 2 3 4 5 6 Note local variable lookup_slow()
  • 20.
    vfsmount mnt_root mnt_sb dentry d_name qstr name = “/” u32hash u32 len u64 hash_len union nameidata last (S*) name root (S*) inode path (S*) * Data storage (not a data pointer) task_struct mm nameidata filename name uptr (userland pointer) refcnt aname char iname[] = “/adrian/img/src.img" char __user *name = “/adrian/img/src.img” local variable path mnt dentry current->fs->root d_inode qstr name = “img/src.img” u32 hash u32 len = 3 u64 hash_len union variable ‘name’ = “img/src.img” dentry d_name d_inode path mnt dentry variable ‘name’ = “src.img” hash 1 2 3 4 5 6 Note link_path_walk(): name resolution – round 2 qstr name = “img” u32 hash u32 len = 3 u64 hash_len union link_path_walk() • stop until the last dentry is resolved • for-loop in this function
  • 21.
    vfsmount mnt_root mnt_sb dentry d_name qstr name = “/” u32hash u32 len u64 hash_len union nameidata last (S*) name root (S*) inode path (S*) * Data storage (not a data pointer) task_struct mm nameidata filename name uptr (userland pointer) refcnt aname char iname[] = “/adrian/img/src.img" char __user *name = “/adrian/img/src.img” local variable path mnt dentry current->fs->root d_inode qstr name = “src.img” u32 hash u32 len = 3 u64 hash_len union variable ‘name’ = “src.img” dentry d_name d_inode path mnt dentry variable ‘name’ = “” 2 Note Function return link_path_walk(): name resolution – round 3 1 hash qstr name = “img” u32 hash u32 len = 3 u64 hash_len union link_path_walk() • stop until the last dentry is resolved • for-loop in this function
  • 22.
    vfsmount mnt_root mnt_sb dentry d_name qstr name = “/” u32hash u32 len u64 hash_len union nameidata last (S*) name root (S*) inode path (S*) task_struct mm nameidata filename name uptr (userland pointer) refcnt aname char iname[] = “/adrian/img/src.img" char __user *name = “/adrian/img/src.img” local variable path mnt dentry current->fs->root d_inode qstr name = “src.img” u32 hash u32 len = 6 u64 hash_len union variable ‘name’ = “src.img” dentry d_name qstr name = “src.img” u32 hash u32 len = 6 u64 hash_len union d_inode path mnt dentry variable ‘name’ = “” 3 4 5 6 Function return lookup_last(): name resolution hash 1 2 * Data storage (not a data pointer) Note lookup_last() • get the corresponding dentry based on the filename • “while loop”: Follow the symlink in the final component
  • 23.
    vfsmount mnt_root mnt_sb dentry d_name qstr name = “/” u32hash u32 len u64 hash_len union nameidata last (S*) name root (S*) inode path (S*) task_struct mm nameidata filename name uptr (userland pointer) refcnt aname char iname[] = “/adrian/img/src.img" char __user *name = “/adrian/img/src.img” local variable path mnt dentry current->fs->root d_inode qstr name = “src.img” u32 hash u32 len = 6 u64 hash_len union variable ‘name’ = “src.img” dentry d_name qstr name = “src.img” u32 hash u32 len = 6 u64 hash_len union d_inode path mnt dentry variable ‘name’ = “” 3 4 5 6 Function return lookup_last(): name resolution hash 1 2 * Data storage (not a data pointer) Note
  • 24.
    mount – mounta new ext4 file system
  • 25.
    hlist_head.first hlist_head.first . . mount_hashtable mount mnt_parent struct dentry *mnt_mountpoint structvfsmount mnt **pprev *next const char *mnt_devname = “none” mount mnt_parent struct dentry *mnt_mountpoint **pprev struct hlist_node mnt_hash *next const char *mnt_devname = “none” struct hlist_node mnt_hash dentry d_name qstr name = “/” u32 hash u32 len = 1 u64 hash_len union super_block s_blocksize s_maxbytes s_op = ramfs_sops s_root s_bdev s_id[32] = “rootfs” s_fs_info s_type = rootfs_fs_type d_parent mnt_root mnt_sb struct vfsmount mnt mount – before mounting a new ext4 file system d_sb
  • 26.
    hlist_head.first hlist_head.first . . mount_hashtable mount mnt_parent struct dentry *mnt_mountpoint structvfsmount mnt **pprev *next const char *mnt_devname = “none” mount mnt_parent struct dentry *mnt_mountpoint **pprev struct hlist_node mnt_hash *next const char *mnt_devname = “none” struct hlist_node mnt_hash dentry d_name qstr name = “/” u32 hash u32 len = 1 u64 hash_len union super_block s_blocksize s_maxbytes s_op = ramfs_sops s_root s_bdev s_id[32] = “rootfs” s_fs_info s_type = rootfs_fs_type d_parent mnt_root mnt_sb struct vfsmount mnt mount – after mounting a new ext4 file system mount mnt_parent struct dentry *mnt_mountpoint **pprev *next const char *mnt_devname = “/dev/loop0” mnt_root mnt_sb dentry d_name qstr name = “/” u32 hash u32 len = 1 u64 hash_len union struct vfsmount mnt super_block s_op = ext4_sops s_id[32] = “loop0” s_type = ext4_fs_type d_parent struct hlist_node mnt_hash dentry d_name d_parent qstr name = “mnt” u32 hash u32 len = 3 u64 hash_len union /adrian/mnt d_sb d_sb
  • 27.
    hlist_head.first hlist_head.first . . mount_hashtable mount mnt_parent struct dentry *mnt_mountpoint structvfsmount mnt **pprev *next const char *mnt_devname = “none” mount mnt_parent struct dentry *mnt_mountpoint **pprev struct hlist_node mnt_hash *next const char *mnt_devname = “none” struct hlist_node mnt_hash dentry d_name qstr name = “/” u32 hash u32 len = 1 u64 hash_len union super_block s_blocksize s_maxbytes s_op = ramfs_sops s_root s_bdev s_id[32] = “rootfs” s_fs_info s_type = rootfs_fs_type d_parent mnt_root mnt_sb struct vfsmount mnt mount – after mounting a new ext4 file system mount mnt_parent struct dentry *mnt_mountpoint **pprev *next const char *mnt_devname = “/dev/loop0” mnt_root mnt_sb dentry d_name qstr name = “/” u32 hash u32 len = 1 u64 hash_len union struct vfsmount mnt super_block s_op = ext4_sops s_id[32] = “loop0” s_type = ext4_fs_type d_parent struct hlist_node mnt_hash dentry d_name d_parent qstr name = “mnt” u32 hash u32 len = 3 u64 hash_len union /adrian/mnt d_sb d_sb
  • 28.
  • 29.
    `ls –color mnt`:openat(AT_FDCWD, "/adrian/mnt“, …)
  • 30.
    `ls –color mnt`:openat(AT_FDCWD, "/adrian/mnt“, …)
  • 31.
    `ls –color mnt`:openat(AT_FDCWD, "/adrian/mnt“, …)
  • 32.
    path mnt dentry vfsmount mnt_root mnt_sb qstr name = “mnt” u32hash u32 len = 3 u64 hash_len union dentry d_name qstr name = “/” u32 hash u32 len = 1 u64 hash_len union super_block s_blocksize s_maxbytes s_op = ramfs_ops s_root s_bdev s_id[32] = “rootfs” s_fs_info d_inode dentry d_name d_inode d_parent s_type = rootfs_fs_type path mnt dentry vfsmount mnt_root mnt_sb qstr name = “mnt /” u32 hash u32 len = 3 1 u64 hash_len union dentry d_name qstr name = “/” u32 hash u32 len = 1 u64 hash_len union super_block s_blocksize s_maxbytes s_op = ext4_sops s_root s_bdev s_id[32] = “loop0” s_fs_info d_inode dentry d_name d_inode d_parent s_type = ext4_fs_type /adrian/mnt data structures in the original file system /adrian/mnt data structures in newly mounted file system traverse_mounts() `ls –color mnt`: openat(AT_FDCWD, "/adrian/mnt“, …) The original mount point is changed the newly one when running name resolution
  • 33.
    path mnt dentry vfsmount mnt_root mnt_sb qstr name = “mnt/” u32 hash u32 len = 3 1 u64 hash_len union dentry d_name qstr name = “/” u32 hash u32 len = 1 u64 hash_len union super_block s_blocksize s_maxbytes s_op = ext4_sops s_root s_bdev s_id[32] = “loop0” s_fs_info d_inode dentry d_name d_inode d_parent s_type = ext4_fs_type /adrian/mnt data structures in newly mounted file system flags: path->dentry->d_flags `ls –color mnt`: openat(AT_FDCWD, "/adrian/mnt“, …) REF-walk [Two concurrent mechanisms for name resolution] 1. REF-walk: concurrent management with refcounts and spinlock 2. RCU-walk: faster name resolution lookup