KEMBAR78
Physical Memory Models.pdf
Physical Memory Models: the ways Linux
kernel addresses physical memory (physical
page frame)
Adrian Huang | June, 2022
* Kernel 5.11 (x86_64)
Agenda
• Four physical memory models
✓Purpose: page descriptor <-> PFN (Page Frame Number)
• Sparse memory model
• Sparse Memory Virtual Memmap: subsection
• page->flags
Four Physical Memory Models
• Flat Memory Model (CONFIG_FLATMEM)
✓UMA (Uniform Memory Access) with mostly continuous physical memory
• Discontinuous Memory Model (CONFIG_DISCONTIGMEM)
✓NUMA (Non-Uniform Memory Access) with mostly continuous physical memory
✓Removed since v5.14 because sparse memory model can cover this scope
• https://lore.kernel.org/linux-mm/20210602105348.13387-1-rppt@kernel.org/
• Sparse Memory (CONFIG_SPARSEMEM)
✓NUMA with discontinuous physical memory
• Sparse Memory Virtual Memmap (CONFIG_SPARSEMEM_VMEMMAP)
✓NUMA with discontinuous physical memory: A quick way to get page struct and
pfn
Memory Model – Flat Memory
struct page #n
....
struct page #1
struct page #0
Dynamic page structure
(Kernel Virtual Address Space)
struct page *mem_map
page frame #n
....
page frame #1
page frame #0
Physical Memory
Note
Page structure array
(Kernel Virtual Address Space)
1. [mem_map] Dynamic page structure: pre-allocate all page structures based on the number of page frames
✓ Allocate/Init page structures based on node’s memory info (struct pglist_data)
▪ Refer from: pglist_data.node_start_pfn & pglist_data.node_spanned_pages
2. Scenario: Continuous page frames (no memory holes) in UMA
3. Drawback
✓ Waste node_mem_map space if memory holes
✓ does not support memory hotplug
4. Check kernel function alloc_node_mem_map() in mm/page_alloc.c
Memory Model – Flat Memory
Memory Model – Discontinuous Memory
struct pglist_data *
page frame #000
....
page frame #1000
Physical Memory
1. [node_mem_map] Dynamic page structure: pre-allocate all page structures based on the
number of page frames
✓ Allocate/Init page structures based on node’s memory info (struct pglist_data)
▪ Refer from: pglist_data.node_start_pfn & pglist_data.node_spanned_pages
2. Scenario: Each node has continuous page frames (no memory holes) in NUMA
3. Drawback
✓ Waste node_mem_map space if memory holes
✓ does not support memory hotplug
NUMA Node Structure
(Kernel Virtual Address Space) struct pglist_data *
struct pglist_data *
…
struct pglist_data *node_data[]
page frame #999
....
page frame #0
struct page #n
....
struct page #0
struct page #n
....
struct page #0
node_mem_map
node_mem_map Node #1
Node #0
Note
Memory Model – Sparse Memory
struct mem_section
page frame
....
page frame
Physical Memory
**mem_section
struct mem_section
struct mem_section
…
struct mem_section *
page frame
....
page frame
....
struct page #0
struct page #n
....
struct page #0
Node #1
(hotplug)
Node #0
…
struct mem_section *
1. [section_mem_map] Dynamic page structure: pre-allocate page structures based on the
number of available page frames
✓ Refer from: memblock structure
2. Support physical memory hotplug
3. Minimum unit: PAGES_PER_SECTION = 32768
✓ Each memory section addresses the memory size: 32768 * 4KB (page size) = 128MB
4. [NUMA] : reduce the memory hole impact due to “struct mem_section”
Note
struct page #m+n-1
struct mem_section
page frame
....
page frame
Physical Memory
struct mem_section
struct mem_section
…
struct mem_section *
page frame
....
page frame
struct page #m+n-1
....
struct page #m
struct page #n
....
struct page #0
Node #1
Node #0
…
struct mem_section *
Memory Model – Sparse Memory Virtual Memmap
vmemmap
Memory Section
(two-dimension array)
Note
1. [section_mem_map] Dynamic page structure: pre-allocate page structures based on the number of
available page frames
✓ Refer from: memblock structure
2. Support physical memory hotplug
3. Minimum unit: PAGES_PER_SECTION = 32768
✓ Each memory section addresses the memory size: 32768 * 4KB (page size) = 128MB
4. [NUMA] : reduce the memory hole impact due to “struct mem_section”
5. Employ virtual memory map (vmemmap/ vmemmap_base) – A quick way to get page struct and pfn
6. Default configuration in Linux kernel
Memory Model – Sparse Memory Virtual Memmap: Detail
SECTIONS_PER_ROOT PAGES_PER_SECTION
0
14
15
22
NR_SECTION_ROOTS
23
33
PFN
63
struct mem_section
Physical Memory
struct mem_section
…
struct mem_section *
page frame
struct page #32767
....
struct page #0
…
struct mem_section *
vmemmap
**mem_section
(two-dimension array)
struct mem_section
struct mem_section
…
. . .
0
0
0
255
255
struct page
....
struct page
struct page
....
struct page
2047
+
…
page frame
page frame
…
page frame
page frame
…
Hot add
Hot add
Hot remove
....
+
page frame
128 MB
PFN
SECTIONS_PER_ROOT PAGES_PER_SECTION
0
14
15
22
NR_SECTION_ROOTS
23
33
PFN
63
struct mem_section
Physical Memory
struct mem_section
…
struct mem_section *
page frame
struct page #32767
....
struct page #0
…
struct mem_section *
vmemmap
**mem_section
(two-dimension array)
struct mem_section
struct mem_section
…
. . .
0
0
0
255
255
struct page
....
struct page
struct page
....
struct page
2047
+
…
page frame
page frame
…
page frame
page frame
…
Hot add
Hot add
Hot remove
....
+
page frame
128 MB
PFN
Memory Model – Sparse Memory Virtual Memmap: Detail
Sparse Memory Model
1. How to know available memory pages in a system?
2. Page Table Configuration for Direct Mapping
3. Sparse Memory Model Initialization – Detail
How to know available memory pages in a system?
BIOS e820 memblock Zone Page Frame Allocator
e820__memblock_setup() __free_pages_core()
[Call Path] memblock frees available memory space to zone page frame allocator
Zone page allocator detail will be discussed in another session:
physical memory management
setup_arch() -- Focus on memory portion
setup_arch
Reserve memblock for kernel code +
data/bss sections, page #0 and init ramdisk
e820__memory_setup
Setup init_mm struct for members
‘start_code’, ‘end_code’, ‘end_data’ and ‘brk’
memblock_x86_reserve_range_setup_data
e820__reserve_setup_data
e820__finish_early_params
efi_init
dmi_setup
e820_add_kernel_range
trim_bios_range
max_pfn = e820__end_of_ram_pfn()
kernel_randomize_memory
e820__memblock_setup
init_mem_mapping
x86_init.paging.pagetable_init
early_alloc_pgt_buf
reserve_brk
init_memory_mapping()
• Create 4-level page table (direct mapping) based on
‘memory’ type of memblock configuration.
x86_init.paging.pagetable_init()
• Init sparse
• Init zone structure
x86 - setup_arch() -- init_mem_mapping() – Page Table
Configuration for Direct Mapping
init_mem_mapping
probe_page_size_mask
setup_pcid
memory_map_top_down(ISA_END_ADDRESS, end)
init_memory_mapping(0, ISA_END_ADDRESS, PAGE_KERNEL)
init_range_memory_mapping(start, last_start)
split_mem_range
kernel_physical_mapping_init
add_pfn_range_mapped
early_ioremap_page_table_range_init [x86 only]
load_cr3(swapper_pg_dir)
__flush_tlb_all
init_memory_mapping() -> kernel_physical_mapping_init()
• Create 4-level page table (direct mapping) based on
‘memory’ type of memblock configuration.
split_mem_range()
• Split different the groups of page size based on the input
memory range (start address and end address)
✓ Try larger page size first
▪ 1G huge page -> 2M huge page -> 4K page
while (last_start > map_start)
init_memory_mapping(start, end, PAGE_KERNEL)
for_each_mem_pfn_range() → memblock stuff
Page Table Configuration for Direct Mapping
Kernel Space
0x0000_7FFF_FFFF_FFFF
0xFFFF_8000_0000_0000
128TB
Page frame direct
mapping (64TB)
ZONE_DMA
ZONE_DMA32
ZONE_NORMAL
page_offset_base
0
16MB
64-bit Virtual Address
Kernel Virtual Address
Physical Memory
0
0xFFFF_FFFF_FFFF_FFFF
Guard hole (8TB)
LDT remap for PTI (0.5TB)
Unused hole (0.5TB)
vmalloc/ioremap (32TB)
vmalloc_base
Unused hole (1TB)
Virtual memory map – 1TB
(store page frame descriptor)
…
vmemmap_base
64TB
*page
…
*page
…
*page
…
Page Frame
Descriptor
vmemmap_base
page_ofset_base = 0xFFFF_8880_0000_0000
vmalloc_base = 0xFFFF_C900_0000_0000
vmemmap_base = 0xFFFF_EA00_0000_0000
* Can be dynamically configured by KASLR (Kernel Address Space Layout Randomization - "arch/x86/mm/kaslr.c")
Default Configuration
Kernel text mapping from
physical address 0
Kernel code [.text, .data…]
Modules
__START_KERNEL_map = 0xFFFF_FFFF_8000_0000
__START_KERNEL = 0xFFFF_FFFF_8100_0000
MODULES_VADDR
0xFFFF_8000_0000_0000
Empty Space
User Space
128TB
1GB or 512MB
1GB or 1.5GB Fix-mapped address space
(Expanded to 4MB: 05ab1d8a4b36) FIXADDR_START
Unused hole (2MB) 0xFFFF_FFFF_FFE0_0000
0xFFFF_FFFF_FFFF_FFFF
FIXADDR_TOP = 0xFFFF_FFFF_FF7F_F000
Reference: Documentation/x86/x86_64/mm.rst
Note: Refer from page #5 in the slide deck Decompressed vmlinux: linux kernel initialization from page table configuration perspective
init_mem_mapping() – Page Table Configuration for Direct Mapping
Note
• 2-socket server with 32GB memory
init_mem_mapping() – Page Table Configuration for Direct Mapping
Note
• 2-socket server with 32GB memory
setup_arch() -- init_mem_mapping() – Page Table
Configuration for Direct Mapping
init_memory_mapping() -> kernel_physical_mapping_init()
• Create 4-level page table (direct mapping) based on
‘memory’ type of the memblock configuration.
x86 - setup_arch() -- x86_init.paging.pagetable_init()
x86_init.paging.pagetable_init
native_pagetable_init
Remove mappings in the end of physical
memory from the boot time page table
paging_init
pagetable_init
__flush_tlb_all
sparse_init
zone_sizes_init
permanent_kmaps_init
x86_init.paging.pagetable_init
native_pagetable_init
paging_init
sparse_init
zone_sizes_init
x86 x86_64
cfg number of pfn for each zone
free_area_init
Sparse Memory Model Initialization: sparse_init()
sparse_init
memblocks_present
pnum_begin = first_present_section_nr();
nid_begin = sparse_early_nid(__nr_to_section(pnum_begin));
for_each_mem_pfn_range(..)
memory_present(nid, start, end)
1. for_each_mem_pfn_range(): Walk through available memory range
from memblock subsystem
Allocate pointer array of section root if necessary
for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION)
sparse_index_init
set_section_nid
section_mark_present
cfg ‘ms->section_mem_map’ via
sparse_encode_early_nid()
for_each_present_section_nr(pnum_begin + 1, pnum_end)
sparse_init_nid
sparse_init_nid [Cover last cpu node]
Mark the present bit for each allocated mem_section
cfg ms->section_mem_map flag bits
1. Allocate a mem_section_usage struct
2. cfg ms->section_mem_map with the valid page descriptor
[During boot]
Temporary: Store nid in
ms->section_mem_map
[During boot]
Temporary: get nid in ms->section_mem_map
memblock
bottom_up
current_limit
memory
reserved
memblock_type
cnt
max
total_size
*regions
name
memblock_region #0
base = 0x1000
size = 0x9f000
flags
nid = 0
memory_present()
memblock_type
cnt
max
total_size
*regions
name
memblock_region #1
base = 0x100000
size = 2ff00000
flags
nid = 0
memblock_region #2
base = 0x3004_2000
size = 0x1d6_e000
flags
nid = 0
memblock_region #7
base = 0x4_5000_0000
size = 0x3_ffc0_0000
flags
nid = 1
struct mem_section * #2047
…
struct mem_section * #0
struct mem_section #0
struct mem_section #255
…
**mem_section
Initialized object
Initialized object
Uninitialized object
memblock
bottom_up
current_limit
memory
reserved
memblock_type
cnt
max
total_size
*regions
name
memblock_region #0
base = 0x1000
size = 0x9f000
flags
nid = 0
memory_present()
memblock_type
cnt
max
total_size
*regions
name
memblock_region #1
base = 0x100000
size = 2ff00000
flags
nid = 0
memblock_region #2
base = 0x3004_2000
size = 0x1d6_e000
flags
nid = 0
struct mem_section #255
…
**mem_section
struct mem_section
section_mem_map=0
struct mem_section_usage *usage
O=1
E=0 P=1
M=0
struct mem_section #0
struct mem_section * #2047
…
struct mem_section * #0
memblock_region #7
base = 0x4_5000_0000
size = 0x3_ffc0_0000
flags
nid = 1
. . .
Initialized object
Initialized object
Uninitialized object
P: Present, M: Memory map, O: Online, E: Early
memblock
bottom_up
current_limit
memory
reserved
memblock_type
cnt
max
total_size
*regions
name
memblock_region #0
base = 0x1000
size = 0x9f000
flags
nid = 0
memory_present()
memblock_type
cnt
max
total_size
*regions
name
memblock_region #1
base = 0x100000
size = 2ff00000
flags
nid = 0
memblock_region #2
base = 0x3004_2000
size = 0x1d6_e000
flags
nid = 0
…
struct mem_section #255
…
**mem_section
0
struct mem_section
section_mem_map=0
struct mem_section_usage *usage
O=1
E=0 P=1
M=0
struct mem_section
section_mem_map=0
struct mem_section_usage *usage
O=1
E=0 P=1
M=0
struct mem_section #5
struct mem_section #0
. . .
struct mem_section * #2047
…
struct mem_section * #0
memblock_region #7
base = 0x4_5000_0000
size = 0x3_ffc0_0000
flags
nid = 1
. . .
Initialized object
Initialized object
Uninitialized object
memblock
bottom_up
current_limit
memory
reserved
memblock_type
cnt
max
total_size
*regions
name
memblock_region #0
base = 0x1000
size = 0x9f000
flags
nid = 0
memory_present()
memblock_type
cnt
max
total_size
*regions
name
memblock_region #1
base = 0x100000
size = 2ff00000
flags
nid = 0
memblock_region #2
base = 0x3004_2000
size = 0x1d6_e000
flags
nid = 0
…
struct mem_section #255
…
**mem_section
0
struct mem_section
section_mem_map=0
struct mem_section_usage *usage
O=1
E=0 P=1
M=0
struct mem_section
section_mem_map=0
struct mem_section_usage *usage
O=1
E=0 P=1
M=0
struct mem_section #5
struct mem_section #0
. . .
struct mem_section #6
struct mem_section
section_mem_map=0
struct mem_section_usage *usage
O=1
E=0 P=1
M=0
struct mem_section * #2047
…
struct mem_section * #0
memblock_region #7
base = 0x4_5000_0000
size = 0x3_ffc0_0000
flags
nid = 1
. . .
Initialized object
Initialized object
Uninitialized object
memblock
bottom_up
current_limit
memory
reserved
memblock_type
cnt
max
total_size
*regions
name
memblock_region #0
base = 0x1000
size = 0x9f000
flags
nid = 0
memory_present()
memblock_type
cnt
max
total_size
*regions
name
memblock_region #1
base = 0x100000
size = 2ff00000
flags
nid = 0
memblock_region #2
base = 0x3004_2000
size = 0x1d6_e000
flags
nid = 0
memblock_region #7
base = 0x4_5000_0000
size = 0x3_ffc0_0000
flags
nid = 1
struct mem_section * #2047
…
struct mem_section * #0
…
struct mem_section #255
…
**mem_section
struct mem_section
section_mem_map=0
struct mem_section_usage *usage
O=1
E=0 P=1
M=0
struct mem_section #5
struct mem_section #0
. . .
struct mem_section #6
struct mem_section
section_mem_map=0
struct mem_section_usage *usage
O=1
E=0 P=1
M=0
…
struct mem_section #138
struct mem_section * #1
…
struct mem_section #9
struct mem_section #0
…
struct mem_section #255
struct mem_section
section_mem_map=0
struct mem_section_usage *usage
O=1
E=0 P=1
M=0
. . .
struct mem_section
section_mem_map=0
struct mem_section_usage *usage
O=1
E=0 P=1
M=0
. . .
Initialized object
Initialized object
Uninitialized object
memblock_region #0
base = 0x1000
size = 0x9f000
flags
nid = 0
sparse_init_nid(): cfg mem_section_map
memblock_region #1
base = 0x100000
size = 2ff00000
flags
nid = 0
memblock_region #2
base = 0x3004_2000
size = 0x1d6_e000
flags
nid = 0
memblock_region #7
base = 0x4_5000_0000
size = 0x3_ffc0_0000
flags
nid = 1
struct mem_section * #2047
…
struct mem_section * #0
…
struct mem_section #255
…
**mem_section
struct mem_section #5
struct mem_section #0
struct mem_section #6
…
struct mem_section #138
struct mem_section * #1
…
struct mem_section #9
struct mem_section #0
…
struct mem_section #255
. . .
struct page #65535
struct page #32767
struct page #0
struct page #32768
…
....
...
vmemmap = VMEMMAP_START =
vmemmap_base
section #0
section_roots #0
section #1
section_roots #0
struct mem_section_usage #n
…
struct mem_section_usage #0
Per-node basis
Number of available
‘struct mem_section
(map_count)’.
Initialized object
Uninitialized object
Allocate page structs for each
mem_section and map them to the page
table (Virtual Memory Map)
Note
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=1
M=1
. . .
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=1
M=1
. . .
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=1
M=1
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=0
M=1
memblock_region #0
base = 0x1000
size = 0x9f000
flags
nid = 0
memblock_region #1
base = 0x100000
size = 2ff00000
flags
nid = 0
memblock_region #2
base = 0x3004_2000
size = 0x1d6_e000
flags
nid = 0
memblock_region #7
base = 0x4_5000_0000
size = 0x3_ffc0_0000
flags
nid = 1
struct mem_section * #2047
…
struct mem_section * #0
…
struct mem_section #255
…
**mem_section
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=1
M=1
struct mem_section #5
struct mem_section #0
. . .
struct mem_section #6
…
struct mem_section #138
struct mem_section * #1
…
struct mem_section #9
struct mem_section #0
…
struct mem_section #255
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=1
M=1
. . .
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=1
M=1
. . .
struct page #0
vmemmap = VMEMMAP_START =
vmemmap_base
section #0,
section_roots #0
section #1,
section_roots #0
struct mem_section_usage #n
…
struct mem_section_usage #0
Per-node basis
Number of available
‘struct mem_section
(map_count)’.
…
struct page #32767
struct page #32768
…
struct page #65535
…
struct page #229375
…
…
struct page #4521984
…
struct page #8388607
struct page #8388608
…
struct page #8683520
section #2-6,
section_roots #0
section #138-255,
section_roots #0
…
section #0-9,
section_roots #1
Initialized object
Allocated & Uninitialized object
Unallocated object
sparse_init_nid(): cfg mem_section_map
Allocate page structs for each
mem_section and map them to the
page table (Virtual Memory Map)
Note
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=0
M=1
64-bit Virtual Address
Kernel Space
0x0000_7FFF_FFFF_FFFF
0xFFFF_8000_0000_0000
128TB
Page frame direct
mapping (64TB)
ZONE_DMA
ZONE_DMA32
ZONE_NORMAL
page_offset_base
0
16MB
64-bit Virtual Address
Kernel Virtual Address
Physical Memory
0
0xFFFF_FFFF_FFFF_FFFF
Guard hole (8TB)
LDT remap for PTI (0.5TB)
Unused hole (0.5TB)
vmalloc/ioremap (32TB)
vmalloc_base
Unused hole (1TB)
Virtual memory map – 1TB
(store page frame descriptor)
…
vmemmap_base
64TB
*page
…
*page
…
*page
…
Page Frame
Descriptor
vmemmap_base
page_ofset_base = 0xFFFF_8880_0000_0000
vmalloc_base = 0xFFFF_C900_0000_0000
vmemmap_base = 0xFFFF_EA00_0000_0000
* Can be dynamically configured by KASLR (Kernel Address Space Layout Randomization - "arch/x86/mm/kaslr.c")
Default Configuration
Kernel text mapping from
physical address 0
Kernel code [.text, .data…]
Modules
__START_KERNEL_map = 0xFFFF_FFFF_8000_0000
__START_KERNEL = 0xFFFF_FFFF_8100_0000
MODULES_VADDR
0xFFFF_8000_0000_0000
Empty Space
User Space
128TB
1GB or 512MB
1GB or 1.5GB Fix-mapped address space
(Expanded to 4MB: 05ab1d8a4b36) FIXADDR_START
Unused hole (2MB) 0xFFFF_FFFF_FFE0_0000
0xFFFF_FFFF_FFFF_FFFF
FIXADDR_TOP = 0xFFFF_FFFF_FF7F_F000
Reference: Documentation/x86/x86_64/mm.rst
Note: Refer from page #5 in the slide deck Decompressed vmlinux: linux kernel initialization from page table configuration perspective
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=1
M=1
. . .
struct page #0
vmemmap = VMEMMAP_START =
vmemmap_base
section #0,
section_roots #0
section #1,
section_roots #0
…
struct page #32767
struct page #32768
…
struct page #65535
…
struct page #229375
…
…
struct page #4521984
…
struct page #8388607
struct page #8388608
…
struct page #8683520
section #2-6,
section_roots #0
section #138-255,
section_roots #0
…
section #0-9,
section_roots #1
Re-visit sparse memory
Sparse Memory: Refer to section_mem_map
Sparse Memory with vmemmap: Refer to vmemmap
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=0
M=1
Sparse Memory Virtual Memmap:
subsection
1. Introduction
2. Subsection users?
3. pageblock_flags: pageblock migration type
Sparse Memory Virtual Memmap: subsection (1/4)
SECTIONS_PER_ROOT PAGES_PER_SECTION
0
14
15
22
NR_SECTION_ROOTS
23
33
PFN
63
SECTION_SIZE_BITS = 27
PAGES_PER_SUBSECTION
SUBSECTIONS_PER
_SECTION
14 9 0
8
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=1
M=1
subsection #63
…
subsection #0
struct mem_section_usage
subsection_map[1] (bitmap)
pageblock_flags[0]
struct page #0
…
struct page #511
…
…
struct page #32767
struct page #32256
subsection
subsection
section
…
…
• subsection_map: bitmap to indicate if the corresponding subsection is valid
• pageblock_flags: pages of a subsection have the same flag (migration type)
sparsemem vmemmap *only*
Sparse Memory Virtual Memmap: subsection (2/4)
Some macros are expanded manually
Note
Sparse Memory Virtual Memmap: subsection (3/4)
SECTIONS_PER_ROOT PAGES_PER_SECTION
0
14
15
22
NR_SECTION_ROOTS
23
33
PFN
63
SECTION_SIZE_BITS = 27
PAGES_PER_SUBSECTION
SUBSECTIONS_PER
_SECTION
14 9 0
8
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=1
M=1
subsection #63
…
subsection #0
struct mem_section_usage
subsection_map[1] (bitmap)
pageblock_flags[0]
struct page #0
…
struct page #511
…
…
struct page #32767
struct page #32256
subsection
subsection
section
…
…
• PAGES_PER_SUBSECTION = 512 pages
✓ 512 pages * 4KB = 2MB → 2MB huge page
in x86_64
Sparse Memory Virtual Memmap: subsection (4/4)
• SUBSECTION_SIZE
✓ (1UL << 21) = 2MB → 2MB huge
page in x86_64.
SECTIONS_PER_ROOT PAGES_PER_SECTION
0
14
15
22
NR_SECTION_ROOTS
23
33
PFN
63
SECTION_SIZE_BITS = 27
PAGES_PER_SUBSECTION
SUBSECTIONS_PER
_SECTION
14 9 0
8
Some macros are expanded manually
Note
subsection: subsection_map users?
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=1
M=1
subsection #63
…
subsection #0
struct mem_section_usage
subsection_map[1] (bitmap)
pageblock_flags[0]
struct page #0
…
struct page #511
…
…
struct page #32767
struct page #32256
subsection
subsection
section
…
…
• init stage
✓ paging_init -> zone_sizes_init -> free_area_init -> subsection_map_init -> subsection_mask_set
➢ Set the corresponding bit map for the specific subsection
• Reference stage
✓ pfn_section_valid(struct mem_section *ms, unsigned long pfn)
➢ Users
▪ [mm/page_alloc.c: 5089] free_pages -> virt_addr_valid -> __virt_addr_valid -> pfn_valid -> pfn_section_valid
▪ [drivers/char/mem.c: 416] mmap_kmem -> pfn_valid -> pfn_section_valid ➔ /dev/mem (`man mem`)
▪ …
subsection_map users
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=1
M=1
subsection #63
…
subsection #0
struct mem_section_usage
subsection_map[1] (bitmap)
pageblock_flags[0]
struct page #0
…
struct page #511
…
…
struct page #32767
struct page #32256
subsection
subsection
section
…
…
• Hotplug stage
✓ Add
➢ #A1 [drivers/acpi/acpi_memhotplug.c: 311] acpi_memory_device_add -> acpi_memory_enable_device ->
__add_memory -> add_memory_resource -> arch_add_memory -> add_pages -> __add_pages -> sparse_add_section
-> section_activate -> fill_subsection_map -> subsection_mask_set
➢ #A2 [drivers/dax/kmem.c: 43] dev_dax_kmem_probe -> add_memory_driver_managed -> add_memory_resource ->
same with #A1
✓ Remove
➢ #R1 [drivers/acpi/acpi_memhotplug.c: 311] acpi_memory_device_remove -> __remove_memory ->
try_remove_memory -> arch_remove_memory -> __remove_pages -> __remove_section -> sparse_remove_section ->
section_deactivate -> clear_subsection_map
➢ #R2 [drivers/dax/kmem.c: 139] dev_dax_kmem_remove -> remove_memory -> try_remove_memory -> same with #R1
subsection_map users
subsection: subsection_map users?
pageblock_flags: pageblock migration type
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=1
M=1
subsection #63
…
subsection #0
struct mem_section_usage
subsection_map[1] (bitmap)
pageblock_flags[0]
struct page #0
…
struct page #511
…
…
struct page #32767
struct page #32256
subsection
subsection
section
…
…
unsigned long pageblock_flags[4]
4-bit MT . . . 4-bit MT
4-bit MT
4-bit MT . . . 4-bit MT
4-bit MT
4-bit MT . . . 4-bit MT
4-bit MT
4-bit MT . . . 4-bit MT
4-bit MT
[0]
Dynamically allocated
[1]
[2]
[3]
subsection #0: Migration Type
subsection #16: Migration Type
subsection #32: Migration Type
subsection #48: Migration Type
Migration type is configured in setup_arch -> … -> memmap_init_zone
pageblock_flags: pageblock migration type
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=1
M=1
subsection #63
…
subsection #0
struct mem_section_usage
subsection_map[1] (bitmap)
pageblock_flags[0]
struct page #0
…
struct page #511
…
…
struct page #32767
struct page #32256
subsection
subsection
section
…
…
unsigned long pageblock_flags[4]
4-bit MT . . . 4-bit MT
4-bit MT
4-bit MT . . . 4-bit MT
4-bit MT
4-bit MT . . . 4-bit MT
4-bit MT
4-bit MT . . . 4-bit MT
4-bit MT
[0]
Dynamically allocated
[1]
[2]
[3]
subsection #0: Migration Type
subsection #16: Migration Type
subsection #32: Migration Type
subsection #48: Migration Type
pageblock: set migration type
free_area_init
print zone ranges and early memory node ranges
for_each_mem_pfn_range(..)
print memory range for each memblock
subsection_map_init
mminit_verify_pageflags_layout
setup_nr_node_ids
init_unavailable_mem
for_each_online_node(nid)
free_area_init_node
node_set_state
check_for_memory
get_pfn_range_for_nid
calculate_node_totalpages
pgdat_set_deferred_range
free_area_init_core
free_area_init_core
memmap_init
for (j = 0; j < MAX_NR_ZONES; j++)
memmap_init_zone
subsection_map_init
subsection_mask_set
for (nr = start_sec; nr <= end_sec; nr++)
bitmap_set
calculate arch_zone_{lowest, highest}_possible_pfn[]
for (pfn = start_pfn; pfn < end_pfn;)
set_pageblock_migratetype
__init_single_page
set_pageblock_migratetype
• [System init stage] each pageblock is initialized to MIGRATE_MOVABLE
zone
present_pages = 1311744
Page
. . .
pageblock #0
Page
pageblock #1
Page
pageblock #N
CONFIG_HUGETLB_PAGE Number of Pages
Y 512 = Huge page size
N 1024 (MAX_ORDER - 1)
pageblock size
N = round_up(present_pages / pageblock_size) - 1
Example
pageblocks = round_up(1311744 / 512) = 2562
pageblock
16 + 2544 + 2 = 2562
1
1
2
2
pageblock_flags: pageblock migration type
struct mem_section
section_mem_map
struct mem_section_usage *usage
O=1
E=1 P=1
M=1
subsection #63
…
subsection #0
struct mem_section_usage
subsection_map[1] (bitmap)
pageblock_flags[0]
struct page #0
…
struct page #511
…
…
struct page #32767
struct page #32256
subsection
subsection
section
…
…
unsigned long pageblock_flags[4]
4-bit MT . . . 4-bit MT
4-bit MT
4-bit MT . . . 4-bit MT
4-bit MT
4-bit MT . . . 4-bit MT
4-bit MT
4-bit MT . . . 4-bit MT
4-bit MT
[0]
Dynamically allocated
[1]
[2]
[3]
subsection #0: Migration Type
subsection #16: Migration Type
subsection #32: Migration Type
subsection #48: Migration Type
[CONFIG_HUGETLB_PAGE=y]
pages of subsection = pages of pageblock = 512 pages (order = 9)
page->flags layout
Node Zone … flags
Node Zone … flags
LAST_CPUPID
Node Zone … flags
Section
Node Zone flags
Section
Zone … flags
Section
…
LAST_CPUPID
No sparsemem or sparsemem
vmemmap
No sparsemem or sparsemem
vmemmap + last_cpupid
sparsemem
sparsemem + last_cpupid
sparsemem wo/ node
1. last_cpupid: Support for NUMA balancing (NUMA-optimizing scheduler)
2. sparsemem: Enabled by CONFIG_SPARSEMEM
Note
…
page->flags layout
0
63
page->flags layout: sparsemem vmemmap + last_cpupid
Kernel Configuration: qemu – v5.11 kernel
...
CONFIG_NUMA_BALANCING=y
CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
…
CONFIG_NR_CPUS=64
…
CONFIG_NODES_SHIFT=10
…
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_NEED_MULTIPLE_NODES=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_VMEMMAP=y
…
# CONFIG_KASAN is not set
Node Zone … flags (enum pageflags)
LAST_CPUPID
0
22
38
52
54
63
23-bit pageflags
2-bit zone
page->flags layout - sparsemem vmemmap + last_cpupid
Kernel Configuration: qemu – v5.11 kernel
...
CONFIG_NUMA_BALANCING=y
CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
…
CONFIG_NR_CPUS=64
…
CONFIG_NODES_SHIFT=10
…
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_NEED_MULTIPLE_NODES=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_VMEMMAP=y
…
# CONFIG_KASAN is not set
Node Zone … flags (enum pageflags)
LAST_CPUPID
0
22
38
52
54
63
Node Zone flags
Section …
LAST_CPUPID
sparsemem + last_cpupid
page->flags: section field (sparsemem wo/ vmemmap)
Sparse Memory: Refer to section_mem_map
Memory Model – Sparse Memory (sparsemem wo/ vmemmap)
struct mem_section
page frame
....
page frame
Physical Memory
**mem_section
struct mem_section
struct mem_section
…
struct mem_section *
page frame
....
page frame
....
struct page #0
struct page #n
....
struct page #0
Node #1
(hotplug)
Node #0
…
struct mem_section *
1. [section_mem_map] Dynamic page structure: pre-allocate page structures based on the
number of available page frames
✓ Refer from: memblock structure
2. Support physical memory hotplug
3. Minimum unit: mem_section - PAGES_PER_SECTION = 32768
✓ Each memory section addresses the memory size: 32768 * 4KB (page size) = 128MB
4. [NUMA] : reduce the memory hole impact due to “struct mem_section”
Note
struct page #m+n-1
Reference
• https://www.kernel.org/doc/html/v5.17/vm/memory-model.html
Backup
/sys/devices/system/memory/block_size_bytes
/sys/devices/system/memory/block_size_bytes
System memory
< 64GB?
block_size_bytes = 0x800_0000
(MIN_MEMORY_BLOCK_SIZE = 128 MB)
block_size_bytes = 0x8000_0000
(MAX_BLOCK_SIZE = 2 GB)
* Ignore SGI UV system platform
!X86_FEATURE_HYPERVISOR?
Find the largest allowed block size that
aligns to memory end (check ‘max_pfn’)
Range: 0x8000_0000 - 0x800_0000
Y
N
Y
N
* Source code: arch/x86/mm/init_64.c: probe_memory_block_size()
/sys/devices/system/memory/block_size_bytes
System memory
< 64GB?
block_size_bytes = 0x800_0000
(MIN_MEMORY_BLOCK_SIZE = 128 MB)
block_size_bytes = 0x8000_0000
(MAX_BLOCK_SIZE = 2 GB)
* Ignore SGI UV system platform
!X86_FEATURE_HYPERVISOR?
Find the largest allowed block size that
aligns to memory end (check ‘max_pfn’)
Range: 0x8000_0000 - 0x800_0000
Y
N
Y
N
* Source code: arch/x86/mm/init_64.c: probe_memory_block_size()
/sys/devices/system/memory/block_size_bytes
System memory
< 64GB?
block_size_bytes = 0x800_0000
(MIN_MEMORY_BLOCK_SIZE = 128 MB)
block_size_bytes = 0x8000_0000
(MAX_BLOCK_SIZE = 2 GB)
* Ignore SGI UV system platform
!X86_FEATURE_HYPERVISOR?
Find the largest allowed block size that
aligns to memory end (check ‘max_pfn’)
Range: 0x8000_0000 - 0x800_0000
Y
N
Y
N
QEMU – Guest OS
* Source code: arch/x86/mm/init_64.c: probe_memory_block_size()
/sys/devices/system/memory/block_size_bytes
System memory
< 64GB?
block_size_bytes = 0x800_0000
(MIN_MEMORY_BLOCK_SIZE = 128 MB)
block_size_bytes = 0x8000_0000
(MAX_BLOCK_SIZE = 2 GB)
* Ignore SGI UV system platform
!X86_FEATURE_HYPERVISOR?
Find the largest allowed block size that
aligns to memory end (check ‘max_pfn’)
Range: 0x8000_0000 - 0x800_0000
Y
N
Y
N
QEMU – Guest OS
* Source code: arch/x86/mm/init_64.c: probe_memory_block_size()

Physical Memory Models.pdf

  • 1.
    Physical Memory Models:the ways Linux kernel addresses physical memory (physical page frame) Adrian Huang | June, 2022 * Kernel 5.11 (x86_64)
  • 2.
    Agenda • Four physicalmemory models ✓Purpose: page descriptor <-> PFN (Page Frame Number) • Sparse memory model • Sparse Memory Virtual Memmap: subsection • page->flags
  • 3.
    Four Physical MemoryModels • Flat Memory Model (CONFIG_FLATMEM) ✓UMA (Uniform Memory Access) with mostly continuous physical memory • Discontinuous Memory Model (CONFIG_DISCONTIGMEM) ✓NUMA (Non-Uniform Memory Access) with mostly continuous physical memory ✓Removed since v5.14 because sparse memory model can cover this scope • https://lore.kernel.org/linux-mm/20210602105348.13387-1-rppt@kernel.org/ • Sparse Memory (CONFIG_SPARSEMEM) ✓NUMA with discontinuous physical memory • Sparse Memory Virtual Memmap (CONFIG_SPARSEMEM_VMEMMAP) ✓NUMA with discontinuous physical memory: A quick way to get page struct and pfn
  • 4.
    Memory Model –Flat Memory struct page #n .... struct page #1 struct page #0 Dynamic page structure (Kernel Virtual Address Space) struct page *mem_map page frame #n .... page frame #1 page frame #0 Physical Memory Note Page structure array (Kernel Virtual Address Space) 1. [mem_map] Dynamic page structure: pre-allocate all page structures based on the number of page frames ✓ Allocate/Init page structures based on node’s memory info (struct pglist_data) ▪ Refer from: pglist_data.node_start_pfn & pglist_data.node_spanned_pages 2. Scenario: Continuous page frames (no memory holes) in UMA 3. Drawback ✓ Waste node_mem_map space if memory holes ✓ does not support memory hotplug 4. Check kernel function alloc_node_mem_map() in mm/page_alloc.c
  • 5.
    Memory Model –Flat Memory
  • 6.
    Memory Model –Discontinuous Memory struct pglist_data * page frame #000 .... page frame #1000 Physical Memory 1. [node_mem_map] Dynamic page structure: pre-allocate all page structures based on the number of page frames ✓ Allocate/Init page structures based on node’s memory info (struct pglist_data) ▪ Refer from: pglist_data.node_start_pfn & pglist_data.node_spanned_pages 2. Scenario: Each node has continuous page frames (no memory holes) in NUMA 3. Drawback ✓ Waste node_mem_map space if memory holes ✓ does not support memory hotplug NUMA Node Structure (Kernel Virtual Address Space) struct pglist_data * struct pglist_data * … struct pglist_data *node_data[] page frame #999 .... page frame #0 struct page #n .... struct page #0 struct page #n .... struct page #0 node_mem_map node_mem_map Node #1 Node #0 Note
  • 7.
    Memory Model –Sparse Memory struct mem_section page frame .... page frame Physical Memory **mem_section struct mem_section struct mem_section … struct mem_section * page frame .... page frame .... struct page #0 struct page #n .... struct page #0 Node #1 (hotplug) Node #0 … struct mem_section * 1. [section_mem_map] Dynamic page structure: pre-allocate page structures based on the number of available page frames ✓ Refer from: memblock structure 2. Support physical memory hotplug 3. Minimum unit: PAGES_PER_SECTION = 32768 ✓ Each memory section addresses the memory size: 32768 * 4KB (page size) = 128MB 4. [NUMA] : reduce the memory hole impact due to “struct mem_section” Note struct page #m+n-1
  • 8.
    struct mem_section page frame .... pageframe Physical Memory struct mem_section struct mem_section … struct mem_section * page frame .... page frame struct page #m+n-1 .... struct page #m struct page #n .... struct page #0 Node #1 Node #0 … struct mem_section * Memory Model – Sparse Memory Virtual Memmap vmemmap Memory Section (two-dimension array) Note 1. [section_mem_map] Dynamic page structure: pre-allocate page structures based on the number of available page frames ✓ Refer from: memblock structure 2. Support physical memory hotplug 3. Minimum unit: PAGES_PER_SECTION = 32768 ✓ Each memory section addresses the memory size: 32768 * 4KB (page size) = 128MB 4. [NUMA] : reduce the memory hole impact due to “struct mem_section” 5. Employ virtual memory map (vmemmap/ vmemmap_base) – A quick way to get page struct and pfn 6. Default configuration in Linux kernel
  • 9.
    Memory Model –Sparse Memory Virtual Memmap: Detail SECTIONS_PER_ROOT PAGES_PER_SECTION 0 14 15 22 NR_SECTION_ROOTS 23 33 PFN 63 struct mem_section Physical Memory struct mem_section … struct mem_section * page frame struct page #32767 .... struct page #0 … struct mem_section * vmemmap **mem_section (two-dimension array) struct mem_section struct mem_section … . . . 0 0 0 255 255 struct page .... struct page struct page .... struct page 2047 + … page frame page frame … page frame page frame … Hot add Hot add Hot remove .... + page frame 128 MB PFN
  • 10.
    SECTIONS_PER_ROOT PAGES_PER_SECTION 0 14 15 22 NR_SECTION_ROOTS 23 33 PFN 63 struct mem_section PhysicalMemory struct mem_section … struct mem_section * page frame struct page #32767 .... struct page #0 … struct mem_section * vmemmap **mem_section (two-dimension array) struct mem_section struct mem_section … . . . 0 0 0 255 255 struct page .... struct page struct page .... struct page 2047 + … page frame page frame … page frame page frame … Hot add Hot add Hot remove .... + page frame 128 MB PFN Memory Model – Sparse Memory Virtual Memmap: Detail
  • 11.
    Sparse Memory Model 1.How to know available memory pages in a system? 2. Page Table Configuration for Direct Mapping 3. Sparse Memory Model Initialization – Detail
  • 12.
    How to knowavailable memory pages in a system? BIOS e820 memblock Zone Page Frame Allocator e820__memblock_setup() __free_pages_core() [Call Path] memblock frees available memory space to zone page frame allocator Zone page allocator detail will be discussed in another session: physical memory management
  • 13.
    setup_arch() -- Focuson memory portion setup_arch Reserve memblock for kernel code + data/bss sections, page #0 and init ramdisk e820__memory_setup Setup init_mm struct for members ‘start_code’, ‘end_code’, ‘end_data’ and ‘brk’ memblock_x86_reserve_range_setup_data e820__reserve_setup_data e820__finish_early_params efi_init dmi_setup e820_add_kernel_range trim_bios_range max_pfn = e820__end_of_ram_pfn() kernel_randomize_memory e820__memblock_setup init_mem_mapping x86_init.paging.pagetable_init early_alloc_pgt_buf reserve_brk init_memory_mapping() • Create 4-level page table (direct mapping) based on ‘memory’ type of memblock configuration. x86_init.paging.pagetable_init() • Init sparse • Init zone structure
  • 14.
    x86 - setup_arch()-- init_mem_mapping() – Page Table Configuration for Direct Mapping init_mem_mapping probe_page_size_mask setup_pcid memory_map_top_down(ISA_END_ADDRESS, end) init_memory_mapping(0, ISA_END_ADDRESS, PAGE_KERNEL) init_range_memory_mapping(start, last_start) split_mem_range kernel_physical_mapping_init add_pfn_range_mapped early_ioremap_page_table_range_init [x86 only] load_cr3(swapper_pg_dir) __flush_tlb_all init_memory_mapping() -> kernel_physical_mapping_init() • Create 4-level page table (direct mapping) based on ‘memory’ type of memblock configuration. split_mem_range() • Split different the groups of page size based on the input memory range (start address and end address) ✓ Try larger page size first ▪ 1G huge page -> 2M huge page -> 4K page while (last_start > map_start) init_memory_mapping(start, end, PAGE_KERNEL) for_each_mem_pfn_range() → memblock stuff
  • 15.
    Page Table Configurationfor Direct Mapping Kernel Space 0x0000_7FFF_FFFF_FFFF 0xFFFF_8000_0000_0000 128TB Page frame direct mapping (64TB) ZONE_DMA ZONE_DMA32 ZONE_NORMAL page_offset_base 0 16MB 64-bit Virtual Address Kernel Virtual Address Physical Memory 0 0xFFFF_FFFF_FFFF_FFFF Guard hole (8TB) LDT remap for PTI (0.5TB) Unused hole (0.5TB) vmalloc/ioremap (32TB) vmalloc_base Unused hole (1TB) Virtual memory map – 1TB (store page frame descriptor) … vmemmap_base 64TB *page … *page … *page … Page Frame Descriptor vmemmap_base page_ofset_base = 0xFFFF_8880_0000_0000 vmalloc_base = 0xFFFF_C900_0000_0000 vmemmap_base = 0xFFFF_EA00_0000_0000 * Can be dynamically configured by KASLR (Kernel Address Space Layout Randomization - "arch/x86/mm/kaslr.c") Default Configuration Kernel text mapping from physical address 0 Kernel code [.text, .data…] Modules __START_KERNEL_map = 0xFFFF_FFFF_8000_0000 __START_KERNEL = 0xFFFF_FFFF_8100_0000 MODULES_VADDR 0xFFFF_8000_0000_0000 Empty Space User Space 128TB 1GB or 512MB 1GB or 1.5GB Fix-mapped address space (Expanded to 4MB: 05ab1d8a4b36) FIXADDR_START Unused hole (2MB) 0xFFFF_FFFF_FFE0_0000 0xFFFF_FFFF_FFFF_FFFF FIXADDR_TOP = 0xFFFF_FFFF_FF7F_F000 Reference: Documentation/x86/x86_64/mm.rst Note: Refer from page #5 in the slide deck Decompressed vmlinux: linux kernel initialization from page table configuration perspective
  • 16.
    init_mem_mapping() – PageTable Configuration for Direct Mapping Note • 2-socket server with 32GB memory
  • 17.
    init_mem_mapping() – PageTable Configuration for Direct Mapping Note • 2-socket server with 32GB memory
  • 18.
    setup_arch() -- init_mem_mapping()– Page Table Configuration for Direct Mapping init_memory_mapping() -> kernel_physical_mapping_init() • Create 4-level page table (direct mapping) based on ‘memory’ type of the memblock configuration.
  • 19.
    x86 - setup_arch()-- x86_init.paging.pagetable_init() x86_init.paging.pagetable_init native_pagetable_init Remove mappings in the end of physical memory from the boot time page table paging_init pagetable_init __flush_tlb_all sparse_init zone_sizes_init permanent_kmaps_init x86_init.paging.pagetable_init native_pagetable_init paging_init sparse_init zone_sizes_init x86 x86_64 cfg number of pfn for each zone free_area_init
  • 20.
    Sparse Memory ModelInitialization: sparse_init() sparse_init memblocks_present pnum_begin = first_present_section_nr(); nid_begin = sparse_early_nid(__nr_to_section(pnum_begin)); for_each_mem_pfn_range(..) memory_present(nid, start, end) 1. for_each_mem_pfn_range(): Walk through available memory range from memblock subsystem Allocate pointer array of section root if necessary for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) sparse_index_init set_section_nid section_mark_present cfg ‘ms->section_mem_map’ via sparse_encode_early_nid() for_each_present_section_nr(pnum_begin + 1, pnum_end) sparse_init_nid sparse_init_nid [Cover last cpu node] Mark the present bit for each allocated mem_section cfg ms->section_mem_map flag bits 1. Allocate a mem_section_usage struct 2. cfg ms->section_mem_map with the valid page descriptor [During boot] Temporary: Store nid in ms->section_mem_map [During boot] Temporary: get nid in ms->section_mem_map
  • 21.
    memblock bottom_up current_limit memory reserved memblock_type cnt max total_size *regions name memblock_region #0 base =0x1000 size = 0x9f000 flags nid = 0 memory_present() memblock_type cnt max total_size *regions name memblock_region #1 base = 0x100000 size = 2ff00000 flags nid = 0 memblock_region #2 base = 0x3004_2000 size = 0x1d6_e000 flags nid = 0 memblock_region #7 base = 0x4_5000_0000 size = 0x3_ffc0_0000 flags nid = 1 struct mem_section * #2047 … struct mem_section * #0 struct mem_section #0 struct mem_section #255 … **mem_section Initialized object Initialized object Uninitialized object
  • 22.
    memblock bottom_up current_limit memory reserved memblock_type cnt max total_size *regions name memblock_region #0 base =0x1000 size = 0x9f000 flags nid = 0 memory_present() memblock_type cnt max total_size *regions name memblock_region #1 base = 0x100000 size = 2ff00000 flags nid = 0 memblock_region #2 base = 0x3004_2000 size = 0x1d6_e000 flags nid = 0 struct mem_section #255 … **mem_section struct mem_section section_mem_map=0 struct mem_section_usage *usage O=1 E=0 P=1 M=0 struct mem_section #0 struct mem_section * #2047 … struct mem_section * #0 memblock_region #7 base = 0x4_5000_0000 size = 0x3_ffc0_0000 flags nid = 1 . . . Initialized object Initialized object Uninitialized object P: Present, M: Memory map, O: Online, E: Early
  • 23.
    memblock bottom_up current_limit memory reserved memblock_type cnt max total_size *regions name memblock_region #0 base =0x1000 size = 0x9f000 flags nid = 0 memory_present() memblock_type cnt max total_size *regions name memblock_region #1 base = 0x100000 size = 2ff00000 flags nid = 0 memblock_region #2 base = 0x3004_2000 size = 0x1d6_e000 flags nid = 0 … struct mem_section #255 … **mem_section 0 struct mem_section section_mem_map=0 struct mem_section_usage *usage O=1 E=0 P=1 M=0 struct mem_section section_mem_map=0 struct mem_section_usage *usage O=1 E=0 P=1 M=0 struct mem_section #5 struct mem_section #0 . . . struct mem_section * #2047 … struct mem_section * #0 memblock_region #7 base = 0x4_5000_0000 size = 0x3_ffc0_0000 flags nid = 1 . . . Initialized object Initialized object Uninitialized object
  • 24.
    memblock bottom_up current_limit memory reserved memblock_type cnt max total_size *regions name memblock_region #0 base =0x1000 size = 0x9f000 flags nid = 0 memory_present() memblock_type cnt max total_size *regions name memblock_region #1 base = 0x100000 size = 2ff00000 flags nid = 0 memblock_region #2 base = 0x3004_2000 size = 0x1d6_e000 flags nid = 0 … struct mem_section #255 … **mem_section 0 struct mem_section section_mem_map=0 struct mem_section_usage *usage O=1 E=0 P=1 M=0 struct mem_section section_mem_map=0 struct mem_section_usage *usage O=1 E=0 P=1 M=0 struct mem_section #5 struct mem_section #0 . . . struct mem_section #6 struct mem_section section_mem_map=0 struct mem_section_usage *usage O=1 E=0 P=1 M=0 struct mem_section * #2047 … struct mem_section * #0 memblock_region #7 base = 0x4_5000_0000 size = 0x3_ffc0_0000 flags nid = 1 . . . Initialized object Initialized object Uninitialized object
  • 25.
    memblock bottom_up current_limit memory reserved memblock_type cnt max total_size *regions name memblock_region #0 base =0x1000 size = 0x9f000 flags nid = 0 memory_present() memblock_type cnt max total_size *regions name memblock_region #1 base = 0x100000 size = 2ff00000 flags nid = 0 memblock_region #2 base = 0x3004_2000 size = 0x1d6_e000 flags nid = 0 memblock_region #7 base = 0x4_5000_0000 size = 0x3_ffc0_0000 flags nid = 1 struct mem_section * #2047 … struct mem_section * #0 … struct mem_section #255 … **mem_section struct mem_section section_mem_map=0 struct mem_section_usage *usage O=1 E=0 P=1 M=0 struct mem_section #5 struct mem_section #0 . . . struct mem_section #6 struct mem_section section_mem_map=0 struct mem_section_usage *usage O=1 E=0 P=1 M=0 … struct mem_section #138 struct mem_section * #1 … struct mem_section #9 struct mem_section #0 … struct mem_section #255 struct mem_section section_mem_map=0 struct mem_section_usage *usage O=1 E=0 P=1 M=0 . . . struct mem_section section_mem_map=0 struct mem_section_usage *usage O=1 E=0 P=1 M=0 . . . Initialized object Initialized object Uninitialized object
  • 26.
    memblock_region #0 base =0x1000 size = 0x9f000 flags nid = 0 sparse_init_nid(): cfg mem_section_map memblock_region #1 base = 0x100000 size = 2ff00000 flags nid = 0 memblock_region #2 base = 0x3004_2000 size = 0x1d6_e000 flags nid = 0 memblock_region #7 base = 0x4_5000_0000 size = 0x3_ffc0_0000 flags nid = 1 struct mem_section * #2047 … struct mem_section * #0 … struct mem_section #255 … **mem_section struct mem_section #5 struct mem_section #0 struct mem_section #6 … struct mem_section #138 struct mem_section * #1 … struct mem_section #9 struct mem_section #0 … struct mem_section #255 . . . struct page #65535 struct page #32767 struct page #0 struct page #32768 … .... ... vmemmap = VMEMMAP_START = vmemmap_base section #0 section_roots #0 section #1 section_roots #0 struct mem_section_usage #n … struct mem_section_usage #0 Per-node basis Number of available ‘struct mem_section (map_count)’. Initialized object Uninitialized object Allocate page structs for each mem_section and map them to the page table (Virtual Memory Map) Note struct mem_section section_mem_map struct mem_section_usage *usage O=1 E=1 P=1 M=1 . . . struct mem_section section_mem_map struct mem_section_usage *usage O=1 E=1 P=1 M=1 . . . struct mem_section section_mem_map struct mem_section_usage *usage O=1 E=1 P=1 M=1 struct mem_section section_mem_map struct mem_section_usage *usage O=1 E=1 P=0 M=1
  • 27.
    memblock_region #0 base =0x1000 size = 0x9f000 flags nid = 0 memblock_region #1 base = 0x100000 size = 2ff00000 flags nid = 0 memblock_region #2 base = 0x3004_2000 size = 0x1d6_e000 flags nid = 0 memblock_region #7 base = 0x4_5000_0000 size = 0x3_ffc0_0000 flags nid = 1 struct mem_section * #2047 … struct mem_section * #0 … struct mem_section #255 … **mem_section struct mem_section section_mem_map struct mem_section_usage *usage O=1 E=1 P=1 M=1 struct mem_section #5 struct mem_section #0 . . . struct mem_section #6 … struct mem_section #138 struct mem_section * #1 … struct mem_section #9 struct mem_section #0 … struct mem_section #255 struct mem_section section_mem_map struct mem_section_usage *usage O=1 E=1 P=1 M=1 . . . struct mem_section section_mem_map struct mem_section_usage *usage O=1 E=1 P=1 M=1 . . . struct page #0 vmemmap = VMEMMAP_START = vmemmap_base section #0, section_roots #0 section #1, section_roots #0 struct mem_section_usage #n … struct mem_section_usage #0 Per-node basis Number of available ‘struct mem_section (map_count)’. … struct page #32767 struct page #32768 … struct page #65535 … struct page #229375 … … struct page #4521984 … struct page #8388607 struct page #8388608 … struct page #8683520 section #2-6, section_roots #0 section #138-255, section_roots #0 … section #0-9, section_roots #1 Initialized object Allocated & Uninitialized object Unallocated object sparse_init_nid(): cfg mem_section_map Allocate page structs for each mem_section and map them to the page table (Virtual Memory Map) Note struct mem_section section_mem_map struct mem_section_usage *usage O=1 E=1 P=0 M=1
  • 28.
    64-bit Virtual Address KernelSpace 0x0000_7FFF_FFFF_FFFF 0xFFFF_8000_0000_0000 128TB Page frame direct mapping (64TB) ZONE_DMA ZONE_DMA32 ZONE_NORMAL page_offset_base 0 16MB 64-bit Virtual Address Kernel Virtual Address Physical Memory 0 0xFFFF_FFFF_FFFF_FFFF Guard hole (8TB) LDT remap for PTI (0.5TB) Unused hole (0.5TB) vmalloc/ioremap (32TB) vmalloc_base Unused hole (1TB) Virtual memory map – 1TB (store page frame descriptor) … vmemmap_base 64TB *page … *page … *page … Page Frame Descriptor vmemmap_base page_ofset_base = 0xFFFF_8880_0000_0000 vmalloc_base = 0xFFFF_C900_0000_0000 vmemmap_base = 0xFFFF_EA00_0000_0000 * Can be dynamically configured by KASLR (Kernel Address Space Layout Randomization - "arch/x86/mm/kaslr.c") Default Configuration Kernel text mapping from physical address 0 Kernel code [.text, .data…] Modules __START_KERNEL_map = 0xFFFF_FFFF_8000_0000 __START_KERNEL = 0xFFFF_FFFF_8100_0000 MODULES_VADDR 0xFFFF_8000_0000_0000 Empty Space User Space 128TB 1GB or 512MB 1GB or 1.5GB Fix-mapped address space (Expanded to 4MB: 05ab1d8a4b36) FIXADDR_START Unused hole (2MB) 0xFFFF_FFFF_FFE0_0000 0xFFFF_FFFF_FFFF_FFFF FIXADDR_TOP = 0xFFFF_FFFF_FF7F_F000 Reference: Documentation/x86/x86_64/mm.rst Note: Refer from page #5 in the slide deck Decompressed vmlinux: linux kernel initialization from page table configuration perspective
  • 29.
    struct mem_section section_mem_map struct mem_section_usage*usage O=1 E=1 P=1 M=1 . . . struct page #0 vmemmap = VMEMMAP_START = vmemmap_base section #0, section_roots #0 section #1, section_roots #0 … struct page #32767 struct page #32768 … struct page #65535 … struct page #229375 … … struct page #4521984 … struct page #8388607 struct page #8388608 … struct page #8683520 section #2-6, section_roots #0 section #138-255, section_roots #0 … section #0-9, section_roots #1 Re-visit sparse memory Sparse Memory: Refer to section_mem_map Sparse Memory with vmemmap: Refer to vmemmap struct mem_section section_mem_map struct mem_section_usage *usage O=1 E=1 P=0 M=1
  • 30.
    Sparse Memory VirtualMemmap: subsection 1. Introduction 2. Subsection users? 3. pageblock_flags: pageblock migration type
  • 31.
    Sparse Memory VirtualMemmap: subsection (1/4) SECTIONS_PER_ROOT PAGES_PER_SECTION 0 14 15 22 NR_SECTION_ROOTS 23 33 PFN 63 SECTION_SIZE_BITS = 27 PAGES_PER_SUBSECTION SUBSECTIONS_PER _SECTION 14 9 0 8 struct mem_section section_mem_map struct mem_section_usage *usage O=1 E=1 P=1 M=1 subsection #63 … subsection #0 struct mem_section_usage subsection_map[1] (bitmap) pageblock_flags[0] struct page #0 … struct page #511 … … struct page #32767 struct page #32256 subsection subsection section … … • subsection_map: bitmap to indicate if the corresponding subsection is valid • pageblock_flags: pages of a subsection have the same flag (migration type) sparsemem vmemmap *only*
  • 32.
    Sparse Memory VirtualMemmap: subsection (2/4) Some macros are expanded manually Note
  • 33.
    Sparse Memory VirtualMemmap: subsection (3/4) SECTIONS_PER_ROOT PAGES_PER_SECTION 0 14 15 22 NR_SECTION_ROOTS 23 33 PFN 63 SECTION_SIZE_BITS = 27 PAGES_PER_SUBSECTION SUBSECTIONS_PER _SECTION 14 9 0 8 struct mem_section section_mem_map struct mem_section_usage *usage O=1 E=1 P=1 M=1 subsection #63 … subsection #0 struct mem_section_usage subsection_map[1] (bitmap) pageblock_flags[0] struct page #0 … struct page #511 … … struct page #32767 struct page #32256 subsection subsection section … … • PAGES_PER_SUBSECTION = 512 pages ✓ 512 pages * 4KB = 2MB → 2MB huge page in x86_64
  • 34.
    Sparse Memory VirtualMemmap: subsection (4/4) • SUBSECTION_SIZE ✓ (1UL << 21) = 2MB → 2MB huge page in x86_64. SECTIONS_PER_ROOT PAGES_PER_SECTION 0 14 15 22 NR_SECTION_ROOTS 23 33 PFN 63 SECTION_SIZE_BITS = 27 PAGES_PER_SUBSECTION SUBSECTIONS_PER _SECTION 14 9 0 8 Some macros are expanded manually Note
  • 35.
    subsection: subsection_map users? structmem_section section_mem_map struct mem_section_usage *usage O=1 E=1 P=1 M=1 subsection #63 … subsection #0 struct mem_section_usage subsection_map[1] (bitmap) pageblock_flags[0] struct page #0 … struct page #511 … … struct page #32767 struct page #32256 subsection subsection section … … • init stage ✓ paging_init -> zone_sizes_init -> free_area_init -> subsection_map_init -> subsection_mask_set ➢ Set the corresponding bit map for the specific subsection • Reference stage ✓ pfn_section_valid(struct mem_section *ms, unsigned long pfn) ➢ Users ▪ [mm/page_alloc.c: 5089] free_pages -> virt_addr_valid -> __virt_addr_valid -> pfn_valid -> pfn_section_valid ▪ [drivers/char/mem.c: 416] mmap_kmem -> pfn_valid -> pfn_section_valid ➔ /dev/mem (`man mem`) ▪ … subsection_map users
  • 36.
    struct mem_section section_mem_map struct mem_section_usage*usage O=1 E=1 P=1 M=1 subsection #63 … subsection #0 struct mem_section_usage subsection_map[1] (bitmap) pageblock_flags[0] struct page #0 … struct page #511 … … struct page #32767 struct page #32256 subsection subsection section … … • Hotplug stage ✓ Add ➢ #A1 [drivers/acpi/acpi_memhotplug.c: 311] acpi_memory_device_add -> acpi_memory_enable_device -> __add_memory -> add_memory_resource -> arch_add_memory -> add_pages -> __add_pages -> sparse_add_section -> section_activate -> fill_subsection_map -> subsection_mask_set ➢ #A2 [drivers/dax/kmem.c: 43] dev_dax_kmem_probe -> add_memory_driver_managed -> add_memory_resource -> same with #A1 ✓ Remove ➢ #R1 [drivers/acpi/acpi_memhotplug.c: 311] acpi_memory_device_remove -> __remove_memory -> try_remove_memory -> arch_remove_memory -> __remove_pages -> __remove_section -> sparse_remove_section -> section_deactivate -> clear_subsection_map ➢ #R2 [drivers/dax/kmem.c: 139] dev_dax_kmem_remove -> remove_memory -> try_remove_memory -> same with #R1 subsection_map users subsection: subsection_map users?
  • 37.
    pageblock_flags: pageblock migrationtype struct mem_section section_mem_map struct mem_section_usage *usage O=1 E=1 P=1 M=1 subsection #63 … subsection #0 struct mem_section_usage subsection_map[1] (bitmap) pageblock_flags[0] struct page #0 … struct page #511 … … struct page #32767 struct page #32256 subsection subsection section … … unsigned long pageblock_flags[4] 4-bit MT . . . 4-bit MT 4-bit MT 4-bit MT . . . 4-bit MT 4-bit MT 4-bit MT . . . 4-bit MT 4-bit MT 4-bit MT . . . 4-bit MT 4-bit MT [0] Dynamically allocated [1] [2] [3] subsection #0: Migration Type subsection #16: Migration Type subsection #32: Migration Type subsection #48: Migration Type Migration type is configured in setup_arch -> … -> memmap_init_zone
  • 38.
    pageblock_flags: pageblock migrationtype struct mem_section section_mem_map struct mem_section_usage *usage O=1 E=1 P=1 M=1 subsection #63 … subsection #0 struct mem_section_usage subsection_map[1] (bitmap) pageblock_flags[0] struct page #0 … struct page #511 … … struct page #32767 struct page #32256 subsection subsection section … … unsigned long pageblock_flags[4] 4-bit MT . . . 4-bit MT 4-bit MT 4-bit MT . . . 4-bit MT 4-bit MT 4-bit MT . . . 4-bit MT 4-bit MT 4-bit MT . . . 4-bit MT 4-bit MT [0] Dynamically allocated [1] [2] [3] subsection #0: Migration Type subsection #16: Migration Type subsection #32: Migration Type subsection #48: Migration Type
  • 39.
    pageblock: set migrationtype free_area_init print zone ranges and early memory node ranges for_each_mem_pfn_range(..) print memory range for each memblock subsection_map_init mminit_verify_pageflags_layout setup_nr_node_ids init_unavailable_mem for_each_online_node(nid) free_area_init_node node_set_state check_for_memory get_pfn_range_for_nid calculate_node_totalpages pgdat_set_deferred_range free_area_init_core free_area_init_core memmap_init for (j = 0; j < MAX_NR_ZONES; j++) memmap_init_zone subsection_map_init subsection_mask_set for (nr = start_sec; nr <= end_sec; nr++) bitmap_set calculate arch_zone_{lowest, highest}_possible_pfn[] for (pfn = start_pfn; pfn < end_pfn;) set_pageblock_migratetype __init_single_page set_pageblock_migratetype • [System init stage] each pageblock is initialized to MIGRATE_MOVABLE
  • 40.
    zone present_pages = 1311744 Page .. . pageblock #0 Page pageblock #1 Page pageblock #N CONFIG_HUGETLB_PAGE Number of Pages Y 512 = Huge page size N 1024 (MAX_ORDER - 1) pageblock size N = round_up(present_pages / pageblock_size) - 1 Example pageblocks = round_up(1311744 / 512) = 2562 pageblock 16 + 2544 + 2 = 2562 1 1 2 2
  • 41.
    pageblock_flags: pageblock migrationtype struct mem_section section_mem_map struct mem_section_usage *usage O=1 E=1 P=1 M=1 subsection #63 … subsection #0 struct mem_section_usage subsection_map[1] (bitmap) pageblock_flags[0] struct page #0 … struct page #511 … … struct page #32767 struct page #32256 subsection subsection section … … unsigned long pageblock_flags[4] 4-bit MT . . . 4-bit MT 4-bit MT 4-bit MT . . . 4-bit MT 4-bit MT 4-bit MT . . . 4-bit MT 4-bit MT 4-bit MT . . . 4-bit MT 4-bit MT [0] Dynamically allocated [1] [2] [3] subsection #0: Migration Type subsection #16: Migration Type subsection #32: Migration Type subsection #48: Migration Type [CONFIG_HUGETLB_PAGE=y] pages of subsection = pages of pageblock = 512 pages (order = 9)
  • 42.
  • 43.
    Node Zone …flags Node Zone … flags LAST_CPUPID Node Zone … flags Section Node Zone flags Section Zone … flags Section … LAST_CPUPID No sparsemem or sparsemem vmemmap No sparsemem or sparsemem vmemmap + last_cpupid sparsemem sparsemem + last_cpupid sparsemem wo/ node 1. last_cpupid: Support for NUMA balancing (NUMA-optimizing scheduler) 2. sparsemem: Enabled by CONFIG_SPARSEMEM Note … page->flags layout 0 63
  • 44.
    page->flags layout: sparsememvmemmap + last_cpupid Kernel Configuration: qemu – v5.11 kernel ... CONFIG_NUMA_BALANCING=y CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y … CONFIG_NR_CPUS=64 … CONFIG_NODES_SHIFT=10 … CONFIG_SPARSEMEM_MANUAL=y CONFIG_SPARSEMEM=y CONFIG_NEED_MULTIPLE_NODES=y CONFIG_SPARSEMEM_EXTREME=y CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y CONFIG_SPARSEMEM_VMEMMAP=y … # CONFIG_KASAN is not set Node Zone … flags (enum pageflags) LAST_CPUPID 0 22 38 52 54 63 23-bit pageflags 2-bit zone
  • 45.
    page->flags layout -sparsemem vmemmap + last_cpupid Kernel Configuration: qemu – v5.11 kernel ... CONFIG_NUMA_BALANCING=y CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y … CONFIG_NR_CPUS=64 … CONFIG_NODES_SHIFT=10 … CONFIG_SPARSEMEM_MANUAL=y CONFIG_SPARSEMEM=y CONFIG_NEED_MULTIPLE_NODES=y CONFIG_SPARSEMEM_EXTREME=y CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y CONFIG_SPARSEMEM_VMEMMAP=y … # CONFIG_KASAN is not set Node Zone … flags (enum pageflags) LAST_CPUPID 0 22 38 52 54 63
  • 46.
    Node Zone flags Section… LAST_CPUPID sparsemem + last_cpupid page->flags: section field (sparsemem wo/ vmemmap) Sparse Memory: Refer to section_mem_map
  • 47.
    Memory Model –Sparse Memory (sparsemem wo/ vmemmap) struct mem_section page frame .... page frame Physical Memory **mem_section struct mem_section struct mem_section … struct mem_section * page frame .... page frame .... struct page #0 struct page #n .... struct page #0 Node #1 (hotplug) Node #0 … struct mem_section * 1. [section_mem_map] Dynamic page structure: pre-allocate page structures based on the number of available page frames ✓ Refer from: memblock structure 2. Support physical memory hotplug 3. Minimum unit: mem_section - PAGES_PER_SECTION = 32768 ✓ Each memory section addresses the memory size: 32768 * 4KB (page size) = 128MB 4. [NUMA] : reduce the memory hole impact due to “struct mem_section” Note struct page #m+n-1
  • 48.
  • 49.
  • 50.
  • 51.
    /sys/devices/system/memory/block_size_bytes System memory < 64GB? block_size_bytes= 0x800_0000 (MIN_MEMORY_BLOCK_SIZE = 128 MB) block_size_bytes = 0x8000_0000 (MAX_BLOCK_SIZE = 2 GB) * Ignore SGI UV system platform !X86_FEATURE_HYPERVISOR? Find the largest allowed block size that aligns to memory end (check ‘max_pfn’) Range: 0x8000_0000 - 0x800_0000 Y N Y N * Source code: arch/x86/mm/init_64.c: probe_memory_block_size()
  • 52.
    /sys/devices/system/memory/block_size_bytes System memory < 64GB? block_size_bytes= 0x800_0000 (MIN_MEMORY_BLOCK_SIZE = 128 MB) block_size_bytes = 0x8000_0000 (MAX_BLOCK_SIZE = 2 GB) * Ignore SGI UV system platform !X86_FEATURE_HYPERVISOR? Find the largest allowed block size that aligns to memory end (check ‘max_pfn’) Range: 0x8000_0000 - 0x800_0000 Y N Y N * Source code: arch/x86/mm/init_64.c: probe_memory_block_size()
  • 53.
    /sys/devices/system/memory/block_size_bytes System memory < 64GB? block_size_bytes= 0x800_0000 (MIN_MEMORY_BLOCK_SIZE = 128 MB) block_size_bytes = 0x8000_0000 (MAX_BLOCK_SIZE = 2 GB) * Ignore SGI UV system platform !X86_FEATURE_HYPERVISOR? Find the largest allowed block size that aligns to memory end (check ‘max_pfn’) Range: 0x8000_0000 - 0x800_0000 Y N Y N QEMU – Guest OS * Source code: arch/x86/mm/init_64.c: probe_memory_block_size()
  • 54.
    /sys/devices/system/memory/block_size_bytes System memory < 64GB? block_size_bytes= 0x800_0000 (MIN_MEMORY_BLOCK_SIZE = 128 MB) block_size_bytes = 0x8000_0000 (MAX_BLOCK_SIZE = 2 GB) * Ignore SGI UV system platform !X86_FEATURE_HYPERVISOR? Find the largest allowed block size that aligns to memory end (check ‘max_pfn’) Range: 0x8000_0000 - 0x800_0000 Y N Y N QEMU – Guest OS * Source code: arch/x86/mm/init_64.c: probe_memory_block_size()