The document provides a comprehensive overview of the Linux kernel, covering its architecture, compilation, booting process, and the role of various subsystems. Key topics include kernel source organization, initialization processes, system calls, and memory management. It also addresses the methods for contributing to the kernel, such as creating patches and engaging with the community.
Topics to becovered
o Introduction
o Kernel Source Organization
o Compilation Process
o Booting Process
o Loading of Kernel
o Initialization Process
o Working of Kernel
o Subsystem of Kernel
o Introduction to common Kernel API's
o Kernel Symbols usage
o Introduction to mailing List & How to contribute to kernel tree
- (creating a patch and submitting)
Compilation Process
After configurations,when the user types 'make zImage' or 'make
bzImage' resulting bootable kernel image is stored as
arch/i386/boot/zImage or bzimage .
Here is how the image is built
16.
Compilation Process
I. Cand assembly source files are compiled into ELF relocatable object format
(.o) and some of them are grouped logically into archives (.a) using ar(1).
II. Using ld(1), the above .o and .a are linked into vmlinux which is a statically
linked, non-stripped ELF 32-bit LSB 80386 executable file.
III. System.map is produced by nm vmlinux, irrelevant or uninteresting symbols
are grepped out.
IV. Enter directory arch/i386/boot.
V. Bootsector asm code bootsect.S is preprocessed either with or without -
D__BIG_KERNEL__, depending on whether the target is bzImage or zImage,
into bbootsect.s or bootsect.s respectively.
VI. bbootsect.s is assembled and then converted into 'raw binary' form called
bbootsect (or bootsect.s assembled and raw-converted into bootsect for
zImage).
VII. Setup code setup.S (setup.S includes video.S) is preprocessed into bsetup.s for
bzImage or setup.s for zImage. In the same way as the bootsector code, the
difference is marked by -D__BIG_KERNEL__ present for bzImage. The result is
then converted into 'raw binary' form called bsetup.
17.
Compilation Process cont.
VIII.Enterdirectory arch/i386/boot/compressed and convert
/usr/src/linux/vmlinux to $tmppiggy (tmp filename) in raw binary format,
removing .note and .comment ELF sections.
IX. gzip -9 < $tmppiggy > $tmppiggy.gz
X. Link $tmppiggy.gz into ELF relocatable (ld -r) piggy.o.
XI. Compile compression routines head.S and misc.c (still in
arch/i386/boot/compressed directory) into ELF objects head.o and misc.o.
XII. Link together head.o, misc.o and piggy.o into bvmlinux (or vmlinux for
zImage, don't mistake this for /usr/src/linux/vmlinux!). Note the difference
between -Ttext 0x1000 used for vmlinux and -Ttext 0x100000 for bvmlinux,
i.e. for bzImage compression loader is high-loaded.
XIII.Convert bvmlinux to 'raw binary' bvmlinux.out removing .note and .comment
ELF sections.
XIV.Go back to arch/i386/boot directory and, using the program tools/build, cat
together bbootsect, bsetup and compressed/bvmlinux.out into bzImage
(delete extra 'b' above for zImage). This writes important variables like
setup_sects and root_dev at the end of the bootsector.
Let us seehow this kernel is working
Lets start from boot process
21.
Booting Process
I. BIOSselects the boot device.
II. BIOS loads the bootsector from the boot device.
III. Bootsector loads setup, decompression routines and
compressed kernel image.
IV. The kernel is uncompressed in protected mode.
V. Low-level initialization is performed by asm code.
VI. High-level C initialization.
Initializations – asm
I.Initialize segment values.
II. Initialize page tables.
III. Enable paging by setting PG bit in %cr0.
IV. Zero-clean BSS (on SMP, only first CPU does this).
V. Copy the first 2k of bootup parameters (kernel command
line).
VI. Check CPU type using EFLAGS and, if possible, cpuid, able
to detect 386 and higher.
VII. The first CPU calls start_kernel(), all others call
arch/i386/kernel/smpboot.c:initialize_secondary() if
ready=1, which just reloads esp/eip and doesn't return.
24.
Initializations – highlevel
I. Take a global kernel lock (it is needed so that only one
CPU goes through initialization).
II. Perform arch-specific setup (memory layout analysis,
copying boot command line again, etc.).
III. Print Linux kernel "banner" containing the version.
IV. Initialize traps.
V. Initialize irqs.
25.
Initializations – highlevel
VI. Initialize data required for scheduler.
VII. Initialize time keeping data.
VIII.Initialize softirq subsystem.
IX. Parse boot commandline options.
X. Initialize console.
XI. If module support was compiled into the kernel, initialize
dynamical module loading facility.
XII. If "profile=" command line was supplied, initialize
profiling buffers.
XIII.kmem_cache_init(), initialize most of slab allocator.
XIV.Enable interrupts.
26.
Initializations – highlevel
XV. Calculate BogoMips value for this CPU.
XVI. Call mem_init() which calculates max_mapnr,
totalram_pages and high_memory and prints out the
"Memory: ..." line.
XVII. kmem_cache_sizes_init(), finish slab allocator
initialization.
XVIII. Initialize data structures used by procfs.
XIX. fork_init(), create uid_cache, initialise max_threads
based on the amount of memory available and configure
RLIMIT_NPROC for init_task to be max_threads/2.
XX. Create various slab caches needed for VFS, VM, buffer
cache, etc.
27.
Initializations – highlevel
XXI.If System V IPC support is compiled in, initialise the IPC
subsystem. Note that for System V shm, this includes
mounting an internal (in-kernel) instance of shmfs
filesystem.
XXII. If quota support is compiled into the kernel, create and
initialise a special slab cache for it.
XXIII. Perform arch-specific "check for bugs" and, whenever
possible, activate workaround for processor/bus/etc
bugs. Comparing various architectures reveals that "ia64
has no bugs" and "ia32 has quite a few bugs", good
example is "f00f bug" which is only checked if kernel is
compiled for less than 686 and worked around
accordingly.
28.
Initializations – highlevel
Finally the kernel is ready to move_to_user_mode()
XXIV. Set a flag to indicate that a schedule should be invoked
at "next opportunity" and create a kernel thread init()
which execs execute_command if supplied via "init=" boot
parameter, or tries to exec /sbin/init, /etc/init,
/bin/init, /bin/sh in this order; if all these fail, panic
with "suggestion" to use "init=" parameter.
XXV. Go into the idle loop, this is an idle thread with pid=0.
29.
Working of Kernel
Afterexec()ing the init program from one of the
standard places the kernel has no direct control on
the program flow.
Its role, from now on is to provide processes with
system calls, as well as servicing asynchronous
events.
Multitasking has been setup, and it is now init which
manages multiuser access by fork()ing system
daemons and login processes.
System Call Implementation
•The mechanism to signal the kernel is a software interrupt.
• Incur an exception and then the system will switch to kernel mode and
execute the exception handler/System call handler.
• The defined software interrupt on x86 is the int $0x80 instruction.
• It triggers a switch to kernel mode and the execution of exception
vector 128, which is the system call handler.
• The system call handler is the aptly named function system_call(). It is
architecture dependent and typically implemented in assembly in
entry.S.
• x86 processors added a feature known as sysenter. This feature
provides a faster, more specialized way of trapping into a kernel to
execute a system call than using the int interrupt instruction.
32.
System Call Implementation
Denotingthe Correct System Call
• On x86, the syscall number is fed to the kernel via the eax register.
• Before causing the trap into the kernel, user-space sticks in eax the
number corresponding to the desired system call.
• The system call handler then reads the value from eax.
• The system_call() function checks the validity of the given system call
number by comparing it to NR_syscalls.
• If it is larger than or equal to NR_syscalls, the function returns -
ENOSYS. Otherwise, the specified system call is invoked:
• call *sys_call_table(,%eax,4)
Because each element in the system call table is 32 bits (four bytes), the
kernel multiplies the given system call number by four to arrive at its
location in the system call table.
33.
System Call Implementation
ParameterPassing
In addition to the system call number, most syscalls require that
one or more parameters be passed to them. The easiest way to
do this is via the same means that the syscall number is passed:
• The parameters are stored in registers. On x86, the registers
ebx, ecx, edx, esi, and edi contain, in order, the first five
arguments.
• In the unlikely case of six or more arguments, a single register
is used to hold a pointer to user-space where all the
parameters are stored.
The return value is sent to user-space also via register. On x86,
it is written into the eax register.
34.
We have seenhow system calls are
implemented. But what about the
system calls?.
System calls are the calls to the subsystems of the kernel.
Now let us understand about Subsystems of kernel.
35.
Subsystem of Kernel
Human Interface
 System Interface
 Process Management
 Memory Management
 Storage Handling
 Networking
36.
Human Interface
Subsystem ofKernel Required to handle input output of
system
It controls the functionality of:
• Keyboard
• Console screen
• Mouse
• Etc.
37.
System Interface
Device Driversare the part of system Interface.
Which is responsible to interface the system with the
peripherals and system Hardware Components
Types of drivers:
• Character Drivers
• Block Drivers
• USB Drivers
• Network Drivers
38.
Process Management
From thekernel point of view, a process is an entry in the process table.
Nothing more.
The process table, then, is one of the most important data structures
within the system, together with the memory-management tables and the
buffer cache. The individual item in the process table is the task_struct
structure, defined in include/linux/sched.h.
The process table is both an array and a double-linked list, as well as a
tree. The physical implementation is a static array of pointers, whose
length is NR_TASKS, a constant defined in include/linux/tasks.h, and each
structure resides in a reserved memory page. The list structure is
achieved through the pointers next_task and prev_task.
39.
Process Management Cont.
Afterbooting is over, the kernel is always working on behalf of one of the
processes, and the global variable current, a pointer to a task_struct
item, is used to record the running one. current is only changed by the
scheduler, in kernel/sched.c. When, however, all processes must be
looked at, the macro for_each_task is used. It is considerably faster than
a sequential scan of the array, when the system is lightly loaded.
A process is always running in either ``user mode'' or ``kernel mode''. The
main body of a user program is executed in user mode and system calls
are executed in kernel mode.
System calls, within the kernel, exist as C language functions, their
`official' name being prefixed by `sys_'. A system call named, for
example, burnout invokes the kernel function sys_burnout().
40.
Process Management
Creating processes
Aunix system creates a process though the fork() system call, and process
termination is performed either by exit() or by receiving a signal.
The Linux implementation for them resides in kernel/fork.c and
kernel/exit.c.
Fork’s main task is filling the data structure for the new process. Relevant
steps, apart from filling fields, are:
• getting a free page to hold the task_struct
• finding an empty process slot (find_empty_process())
• getting another free page for the kernel_stack_page
• copying the father's LDT to the child
• duplicating mmap information of the father
sys_fork() also manages file descriptors and inodes.
41.
Process Management
Destroying processes
Exitingfrom a process is trickier, because the parent process must be
notified about any child who exits.
Moreover, a process can exit by being kill()ed by another process (these
are Unix features).
The file exit.c is therefore the home of sys_kill() and the various flavors
of sys_wait(), in addition to sys_exit().
42.
Process Management
Executing programs
•After fork()ing, two copies of the same program are running. One of them
usually exec()s another program.
• The exec() system call must locate the binary image of the executable file,
load and run it.
• The Linux implementation of exec() supports different binary formats. This is
accomplished through the linux_binfmt structure.
• Loading of shared libraries is implemented in the same source file as exec() is,
but let's stick to exec() itself.
• The Unix systems provide the programmer with six flavors of the exec()
function. All but one of them can be implemented as library functions, and the
Linux kernel implements sys_execve() alone.
It performs quite a simple task: loading the head of the executable, and trying to
execute it. If the first two bytes are ``#!'', then the first line is parsed and an
interpreter is invoked, otherwise the registered binary formats are sequentially
tried.
43.
Process Management
State
As aprocess executes it changes state according to its circumstances.
Linux processes have the following states:
• Running: The process is either running or it is ready to run
• Waiting: The process is waiting for an event or for a resource. Linux
differentiates between two types of waiting process; interruptible and
uninterruptible.
• Stopped: The process has been stopped, usually by receiving a signal.
A process that is being debugged can be in a stopped state.
• Zombie: This is a halted process which, for some reason, still has a
task_struct data structure in the task vector. It is what it sounds like, a
dead process.
The scheduler needs this information in order to fairly decide which process in
the system most deserves to run
44.
Process Management
Process Handling- Schedulers
History of Schedulers
• O(n) - before – 2.6
• O(1) - Ingo Molnar - 2.6 to 2.6.23
• Rotating Staircase Deadline Scheduler - Con Kolivas
• Complete Fair Scheduler - Ingo Molnar - 2.6.23 to 3.18
• Brain Fuck Scheduler - Con Kolivas – 3.18.1
Processes System Calls
Scheduler
45.
Memory ManagementLinux usessegmentation + pagination, which simplifies notation.
Linux uses only 4 segments:
2 segments (code and data/stack) for KERNEL SPACE (3 GB) to (4 GB)
2 segments (code and data/stack) for USER SPACE from (0 GB) to (3 GB)
Storage Handling
The VirtualFilesystem (sometimes called the Virtual File Switch or more
commonly simply the VFS) is the subsystem of the kernel that implements
the file and filesystem-related
interfaces provided to user-space programs.
The VFS is the glue that enables system calls such as open(), read(), and
write() to work regardless of the filesystem or underlying physical
medium.
49.
Networking
This Layer isResponsible for handling the network Packets.
Protocol stacks required, are implemented here.
It is also responsible for decrypting / encrypting the network
Packets.
51.
How To Program
Howto use the features of kernel or change existing thing in kernel.
52.
Kernel Common API's
KernelAPI’s are documented here
https://www.kernel.org/doc/htmldocs/kernel-api/
• Data Types
• Basic C Library Functions
• Basic Kernel Library Functions
• Memory Management in Linux
• Kernel IPC facilities
• FIFO Buffer
• relay interface support
• Module Support
• Hardware Interfaces
• Firmware Interfaces
• ……. Etc.
53.
Kernel Symbol Usage
Whenmodules are loaded, they are dynamically linked into the kernel. As with
user-space, dynamically linked binaries can call only into external functions that
are explicitly exported for use. In the kernel, this is handled via special directives
called EXPORT_ SYMBOL() and EXPORT_SYMBOL_GPL().
Functions that are exported are available for use by modules. Functions that are
not exported cannot be invoked from modules.
The set of kernel symbols that are exported are known as the exported kernel
interfaces or even the kernel API.
Exporting a symbol is easy. After the function is declared, it is usually followed by
an EXPORT_SYMBOL(). For example,
int get_pirate_beard_color(void)
{
return pirate->beard->color;
}
EXPORT_SYMBOL(get_pirate_beard_color);
54.
Introduction to mailing
List& How to contribute
---------------------------------------------------------------------------------------
git diff
git commit
git show
git format-patch
git send-email
55.
References
The Linux DocumentProject – TLPD
http://www.tldp.org/LDP/lki/lki.html
Kernelnewbies.org
http://kernelnewbies.org/Documentation/Subsystems
Free-electrons
http://free-electrons.com
http://lxr.free-electrons.com
Kernel Map
http://www.makelinux.net/kernel_map/
#24 Control Registers of x86 http://en.wikipedia.org/wiki/Control_register ; http://www.eecg.toronto.edu/~amza/www.mindsec.com/files/x86regs.html
BSS - Block Started by Symbol
General registers EAX EBX ECX EDX
Segment registers CS DS ES FS GS SS
Index and pointers ESI EDI EBP EIP ESP
Indicator EFLAGS
#26 http://man7.org/linux/man-pages/man7/bootparam.7.html
It is possible to enable a kernel profiling function, if one wishes to find out where the kernel is spending its CPU cycles.
#32 It is not possible for user-space applications to execute kernel code directly. They cannot simply make a function call to a method existing in kernel-space because the kernel exists in a protected memory space. If applications could directly read and write to the kernel's address space, system security and stability would go out the window.
Instead, user-space applications must somehow signal the kernel that they want to execute a system call and have the system switch to kernel mode, where the system call can be executed in kernel-space by the kernel on behalf of the application.
#33 Simply entering kernel-space alone is not sufficient because there are multiple system calls, all of which enter the kernel in the same manner. Thus, the system call number must be passed into the kernel.
#40 The stack used by the process in the two execution modes is different--a conventional stack segment is used for user mode, while a fixed-size stack (one page, owned by the process) is used in kernel mode. The kernel stack page is never swapped out, because it must be available whenever a system call is entered.
#41 The Local Descriptor Table (LDT) is a memory table used in the x86 architecture in protected mode and containing memory segment descriptors: start in linear memory, size, executability, writability, access privilege, actual presence in memory, etc.
#44 Interruptible waiting processes can be interrupted by signals whereas uninterruptible waiting processes are waiting directly on hardware conditions and cannot be interrupted under any circumstances.
#45 The main idea behind the CFS is to maintain balance (fairness) in providing processor time to tasks.