KEMBAR78
Linux-Insides Interrupts Linux-Interrupts Part 4 | PDF | Computer Architecture | Software
0% found this document useful (0 votes)
73 views4 pages

Linux-Insides Interrupts Linux-Interrupts Part 4

This document summarizes the set_intr_gate macro used to initialize non-early interrupt gates in the Linux kernel. The macro takes a vector number and address of an interrupt handler. It checks that the vector number is below 255 and calls the _set_gate function to fill the IDT entry. For example, it is used to set the page fault handler with vector number X86_TRAP_PF and handler address page_fault. The page_fault handler is defined in assembly and saves processor state before calling exception_enter to notify context tracking of the exception.

Uploaded by

Avinash Pal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views4 pages

Linux-Insides Interrupts Linux-Interrupts Part 4

This document summarizes the set_intr_gate macro used to initialize non-early interrupt gates in the Linux kernel. The macro takes a vector number and address of an interrupt handler. It checks that the vector number is below 255 and calls the _set_gate function to fill the IDT entry. For example, it is used to set the page fault handler with vector number X86_TRAP_PF and handler address page_fault. The page_fault handler is defined in assembly and saves processor state before calling exception_enter to notify context tracking of the exception.

Uploaded by

Avinash Pal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

This macro defined in the arch/x86/include/asm/desc.h.

We already saw macros like this in


0xAX / linux-insides Public
the previous part - set_system_intr_gate and set_intr_gate_ist . This macro checks that
given vector number is not greater than 255 (maximum vector number) and calls
Code Issues 27 Pull requests 7 Actions Security Insights _set_gate function as set_system_intr_gate and set_intr_gate_ist did it:

linux-insides / Interrupts / linux-interrupts-4.md #define set_intr_gate(n, addr) \


do { \
BUG_ON((unsigned)n > 0xFF); \
renaudgermain copyedit: interrupts chapter last year _set_gate(n, GATE_INTERRUPT, (void *)addr, 0, 0, \
__KERNEL_CS); \
_trace_set_gate(n, GATE_INTERRUPT, (void *)trace_##addr,\
465 lines (358 loc) · 26.4 KB
0, 0, __KERNEL_CS); \
} while (0)
Preview Code Blame Raw

The set_intr_gate macro takes two parameters:

Interrupts and Interrupt Handling. Part 4. vector number of a interrupt;


address of an interrupt handler;

Initialization of non-early interrupt gates In our case they are:

X86_TRAP_PF - 14 ;
This is fourth part about an interrupts and exceptions handling in the Linux kernel and in
page_fault - the interrupt handler entry point.
the previous part we saw first early #DB and #BP exceptions handlers from the
arch/x86/kernel/traps.c. We stopped on the right after the early_trap_init function that The X86_TRAP_PF is the element of enum which defined in the
called in the setup_arch function which defined in the arch/x86/kernel/setup.c. In this part arch/x86/include/asm/traprs.h:
we will continue to dive into an interrupts and exceptions handling in the Linux kernel for
x86_64 and continue to do it from the place where we left off in the last part. First thing
enum {
which is related to the interrupts and exceptions handling is the setup of the #PF or page ...
fault handler with the early_trap_pf_init function. Let's start from it. ...
...

Early page fault handler


...
X86_TRAP_PF, /* 14, Page Fault */
...
The early_trap_pf_init function defined in the arch/x86/kernel/traps.c. It uses ...
set_intr_gate macro that fills Interrupt Descriptor Table with the given entry: ...
}

void __init early_trap_pf_init(void)


{ When the early_trap_pf_init will be called, the set_intr_gate will be expanded to the
#ifdef CONFIG_X86_64 call of the _set_gate which will fill the IDT with the handler for the page fault. Now let's
set_intr_gate(X86_TRAP_PF, page_fault);
look on the implementation of the page_fault handler. The page_fault handler defined
#endif
}
in the arch/x86/entry/entry_64.S assembly source code file as all exceptions handlers. Let's
look on it:

This register contains a linear address which caused page fault . In the next step we make
trace_idtentry page_fault do_page_fault has_error_code=1
a call of the exception_enter function from the include/linux/context_tracking.h. The
exception_enter and exception_exit are functions from context tracking subsystem in
We saw in the previous part how #DB and #BP handlers defined. They were defined with the Linux kernel used by the RCU to remove its dependency on the timer tick while a
the idtentry macro, but here we can see trace_idtentry . This macro defined in the processor runs in userspace. Almost in every exception handler we will see similar code:
same source code file and depends on the CONFIG_TRACING kernel configuration option:
enum ctx_state prev_state;
#ifdef CONFIG_TRACING prev_state = exception_enter();
.macro trace_idtentry sym do_sym has_error_code:req ...
idtentry trace(\sym) trace(\do_sym) has_error_code=\has_error_code ... // exception handler here
idtentry \sym \do_sym has_error_code=\has_error_code ...
.endm exception_exit(prev_state);
#else
.macro trace_idtentry sym do_sym has_error_code:req
idtentry \sym \do_sym has_error_code=\has_error_code The exception_enter function checks that context tracking is enabled with the
.endm context_tracking_is_enabled and if it is in enabled state, we get previous context with the
#endif this_cpu_read (more about this_cpu_* operations you can read in the Documentation).
After this it calls context_tracking_user_exit function which informs the context tracking
We will not dive into exceptions Tracing now. If CONFIG_TRACING is not set, we can see that that the processor is exiting userspace mode and entering the kernel:
trace_idtentry macro just expands to the normal idtentry . We already saw
implementation of the idtentry macro in the previous part, so let's start from the static inline enum ctx_state exception_enter(void)
page_fault exception handler. {
enum ctx_state prev_ctx;
As we can see in the idtentry definition, the handler of the page_fault is do_page_fault
if (!context_tracking_is_enabled())
function which defined in the arch/x86/mm/fault.c and as all exceptions handlers it takes
return 0;
two arguments:
prev_ctx = this_cpu_read(context_tracking.state);
regs - pt_regs structure that holds state of an interrupted process;
context_tracking_user_exit();
error_code - error code of the page fault exception.
return prev_ctx;
Let's look inside this function. First of all we read content of the cr2 control register: }

dotraplinkage void notrace The state can be one of the:


do_page_fault(struct pt_regs *regs, unsigned long error_code)
{
unsigned long address = read_cr2(); enum ctx_state {
... IN_KERNEL = 0,
... IN_USER,
... } state;
}

And in the end we return previous context. Between the exception_enter and
exception_exit we call actual page fault handler:
You can often find these macros in the code of the Linux kernel. Main purpose of these
__do_page_fault(regs, error_code, address);
macros is optimization. Sometimes this situation is that we need to check the condition of
the code and we know that it will rarely be true or false . With these macros we can tell
The __do_page_fault is defined in the same source code file as do_page_fault - to the compiler about this. For example
arch/x86/mm/fault.c. In the beginning of the __do_page_fault we check state of the
kmemcheck checker. The kmemcheck detects warns about some uses of uninitialized static int proc_root_readdir(struct file *file, struct dir_context *ctx)
memory. We need to check it because page fault can be caused by kmemcheck: {
if (ctx->pos < FIRST_PROCESS_ENTRY) {
int error = proc_readdir(file, ctx);
if (kmemcheck_active(regs))
if (unlikely(error <= 0))
kmemcheck_hide(regs);
return error;
prefetchw(&mm->mmap_sem);
...
...
After this we can see the call of the prefetchw which executes instruction with the same ...
}
name which fetches X86_FEATURE_3DNOW to get exclusive cache line. The main purpose
of prefetching is to hide the latency of a memory access. In the next step we check that we
got page fault not in the kernel space with the following condition: Here we can see proc_root_readdir function which will be called when the Linux VFS
needs to read the root directory contents. If condition marked with unlikely , compiler
if (unlikely(fault_in_kernel_space(address))) { can put false code right after branching. Now let's back to the our address check.
... Comparison between the given address and the 0x00007ffffffff000 will give us to know,
... was page fault in the kernel mode or user mode. After this check we know it. After this
... __do_page_fault routine will try to understand the problem that provoked page fault
}
exception and then will pass address to the appropriate routine. It can be kmemcheck fault,
spurious fault, kprobes fault and etc. Will not dive into implementation details of the page
where fault_in_kernel_space is: fault exception handler in this part, because we need to know many different concepts
which are provided by the Linux kernel, but will see it in the chapter about the memory
static int fault_in_kernel_space(unsigned long address) management in the Linux kernel.
{
return address >= TASK_SIZE_MAX;
} Back to start_kernel
There are many different function calls after the early_trap_pf_init in the setup_arch
The TASK_SIZE_MAX macro expands to the:
function from different kernel subsystems, but there are no one interrupts and exceptions
handling related. So, we have to go back where we came from - start_kernel function
#define TASK_SIZE_MAX ((1UL << 47) - PAGE_SIZE)
from the init/main.c. The first things after the setup_arch is the trap_init function from
the arch/x86/kernel/traps.c. This function makes initialization of the remaining exceptions
or 0x00007ffffffff000 . Pay attention on unlikely macro. There are two macros in the handlers (remember that we already setup 3 handlers for the #DB - debug exception, #BP
Linux kernel: - breakpoint exception and #PF - page fault exception). The trap_init function starts
from the check of the Extended Industry Standard Architecture:
#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0) #ifdef CONFIG_EISA
void __iomem *p = early_ioremap(0x0FFFD9, 4);

In the next step we set the interrupt gate for the #DF or Double fault exception:
if (readl(p) == 'E' + ('I'<<8) + ('S'<<16) + ('A'<<24))
EISA_bus = 1;
early_iounmap(p, 4); set_intr_gate_ist(X86_TRAP_DF, &double_fault, DOUBLEFAULT_STACK);
#endif

This exception occurs when processor detected a second exception while calling an
Note that it depends on the CONFIG_EISA kernel configuration parameter which represents exception handler for a prior exception. In usual way when the processor detects another
EISA support. Here we use early_ioremap function to map I/O memory on the page exception while trying to call an exception handler, the two exceptions can be handled
tables. We use readl function to read first 4 bytes from the mapped region and if they serially. If the processor cannot handle them serially, it signals the double-fault or #DF
are equal to EISA string we set EISA_bus to one. In the end we just unmap previously exception.
mapped region. More about early_ioremap you can read in the part which describes Fix-
Mapped Addresses and ioremap. The following set of the interrupt gates is:

After this we start to fill the Interrupt Descriptor Table with the different interrupt gates. set_intr_gate(X86_TRAP_OLD_MF, &coprocessor_segment_overrun);
First of all we set #DE or Divide Error and #NMI or Non-maskable Interrupt : set_intr_gate(X86_TRAP_TS, &invalid_TSS);
set_intr_gate(X86_TRAP_NP, &segment_not_present);
set_intr_gate_ist(X86_TRAP_SS, &stack_segment, STACKFAULT_STACK);
set_intr_gate(X86_TRAP_DE, divide_error);
set_intr_gate(X86_TRAP_GP, &general_protection);
set_intr_gate_ist(X86_TRAP_NMI, &nmi, NMI_STACK);
set_intr_gate(X86_TRAP_SPURIOUS, &spurious_interrupt_bug);
set_intr_gate(X86_TRAP_MF, &coprocessor_error);
set_intr_gate(X86_TRAP_AC, &alignment_check);
We use set_intr_gate macro to set the interrupt gate for the #DE exception and
set_intr_gate_ist for the #NMI . You can remember that we already used these macros
when we have set the interrupts gates for the page fault handler, debug handler and etc, Here we can see setup for the following exception handlers:
you can find explanation of it in the previous part. After this we setup exception gates for
#CSO or Coprocessor Segment Overrun - this exception indicates that math
the following exceptions:
coprocessor of an old processor detected a page or segment violation. Modern
processors do not generate this exception
set_system_intr_gate(X86_TRAP_OF, &overflow);
set_intr_gate(X86_TRAP_BR, bounds); #TS or Invalid TSS exception - indicates that there was an error related to the Task
set_intr_gate(X86_TRAP_UD, invalid_op); State Segment.
set_intr_gate(X86_TRAP_NM, device_not_available);
#NP or Segment Not Present exception indicates that the present flag of a
segment or gate descriptor is clear during attempt to load one of cs , ds , es , fs ,
Here we can see: or gs register.
#SS or Stack Fault exception indicates one of the stack related conditions was
#OF or Overflow exception. This exception indicates that an overflow trap occurred
detected, for example a not-present stack segment is detected when attempting to
when an special INTO instruction was executed;
load the ss register.
#BR or BOUND Range exceeded exception. This exception indicates that a BOUND-
#GP or General Protection exception indicates that the processor detected one of a
range-exceed fault occurred when a BOUND instruction was executed;
class of protection violations called general-protection violations. There are many
#UD or Invalid Opcode exception. Occurs when a processor attempted to execute
different conditions that can cause general-protection exception. For example loading
invalid or reserved opcode, processor attempted to execute instruction with invalid the ss , ds , es , fs , or gs register with a segment selector for a system segment,
operand(s) and etc; writing to a code segment or a read-only data segment, referencing an entry in the
#NM or Device Not Available exception. Occurs when the processor tries to execute Interrupt Descriptor Table (following an interrupt or exception) that is not an
x87 FPU floating point instruction while EM flag in the control register cr0 was set. interrupt, trap, or task gate and many many more.
Spurious Interrupt - a hardware interrupt that is unwanted.
for (i = 0; i < FIRST_EXTERNAL_VECTOR; i++)
#MF or x87 FPU Floating-Point Error exception caused when the x87 FPU has set_bit(i, used_vectors)
detected a floating point error.
#AC or Alignment Check exception Indicates that the processor detected an
where FIRST_EXTERNAL_VECTOR is:
unaligned memory operand when alignment checking was enabled.

After that we setup this exception gates, we can see setup of the Machine-Check #define FIRST_EXTERNAL_VECTOR 0x20

exception:
After this we setup the interrupt gate for the ia32_syscall and add 0x80 to the
#ifdef CONFIG_X86_MCE used_vectors bitmap:
set_intr_gate_ist(X86_TRAP_MC, &machine_check, MCE_STACK);
#endif
#ifdef CONFIG_IA32_EMULATION
set_system_intr_gate(IA32_SYSCALL_VECTOR, ia32_syscall);
Note that it depends on the CONFIG_X86_MCE kernel configuration option and indicates set_bit(IA32_SYSCALL_VECTOR, used_vectors);
#endif
that the processor detected an internal machine error or a bus error, or that an external
agent detected a bus error. The next exception gate is for the SIMD Floating-Point
exception: There is CONFIG_IA32_EMULATION kernel configuration option on x86_64 Linux kernels. This
option provides ability to execute 32-bit processes in compatibility-mode. In the next parts
set_intr_gate(X86_TRAP_XF, &simd_coprocessor_error); we will see how it works, in the meantime we need only to know that there is yet another
interrupt gate in the IDT with the vector number 0x80 . In the next step we maps IDT to
the fixmap area:
which indicates the processor has detected an SSE or SSE2 or SSE3 SIMD floating-point
exception. There are six classes of numeric exception conditions that can occur while
__set_fixmap(FIX_RO_IDT, __pa_symbol(idt_table), PAGE_KERNEL_RO);
executing an SIMD floating-point instruction:
idt_descr.address = fix_to_virt(FIX_RO_IDT);

Invalid operation
Divide-by-zero and write its address to the idt_descr.address (more about fix-mapped addresses you
Denormal operand can read in the second part of the Linux kernel memory management chapter). After this
we can see the call of the cpu_init function that defined in the
Numeric overflow
arch/x86/kernel/cpu/common.c. This function makes initialization of the all per-cpu state.
Numeric underflow
In the beginning of the cpu_init we do the following things: First of all we wait while
Inexact result (Precision) current cpu is initialized and than we call the cr4_init_shadow function which stores
shadow copy of the cr4 control register for the current cpu and load CPU microcode if
In the next step we fill the used_vectors array which defined in the
need with the following function calls:
arch/x86/include/asm/desc.h header file and represents bitmap :

wait_for_master_cpu(cpu);
DECLARE_BITMAP(used_vectors, NR_VECTORS);
cr4_init_shadow();
load_ucode_ap();
of the first 32 interrupts (more about bitmaps in the Linux kernel you can read in the part
which describes cpumasks and bitmaps) Next we get the Task State Segment for the current cpu and orig_ist structure which
represents origin Interrupt Stack Table values with the:

where set_tss_desc macro from the arch/x86/include/asm/desc.h writes given descriptor


t = &per_cpu(cpu_tss, cpu);
to the Global Descriptor Table of the given processor:
oist = &per_cpu(orig_ist, cpu);

#define set_tss_desc(cpu, addr) __set_tss_desc(cpu, GDT_ENTRY_TSS, addr)


As we got values of the Task State Segment and Interrupt Stack Table for the current
static inline void __set_tss_desc(unsigned cpu, unsigned int entry, void *addr
processor, we clear following bits in the cr4 control register: {
struct desc_struct *d = get_cpu_gdt_table(cpu);
tss_desc tss;
cr4_clear_bits(X86_CR4_VME|X86_CR4_PVI|X86_CR4_TSD|X86_CR4_DE);
set_tssldt_descriptor(&tss, (unsigned long)addr, DESC_TSS,
IO_BITMAP_OFFSET + IO_BITMAP_BYTES +
with this we disable vm86 extension, virtual interrupts, timestamp (RDTSC can only be sizeof(unsigned long) - 1);
write_gdt_entry(d, entry, &tss, DESC_TSS);
executed with the highest privilege) and debug extension. After this we reload the Global
}
Descriptor Table and Interrupt Descriptor table with the:

switch_to_new_gdt(cpu);
and load_TR_desc macro expands to the ltr or Load Task Register instruction:
loadsegment(fs, 0);
load_current_idt();
#define load_TR_desc() native_load_tr_desc()
static inline void native_load_tr_desc(void)
After this we setup array of the Thread-Local Storage Descriptors, configure NX and load {
CPU microcode. Now is time to setup and load per-cpu Task State Segments. We are asm volatile("ltr %w0"::"q" (GDT_ENTRY_TSS*8));
going in a loop through the all exception stack which is N_EXCEPTION_STACKS or 4 and fill }
it with Interrupt Stack Tables :
In the end of the trap_init function we can see the following code:
if (!oist->ist[0]) {
char *estacks = per_cpu(exception_stacks, cpu);
set_intr_gate_ist(X86_TRAP_DB, &debug, DEBUG_STACK);
set_system_intr_gate_ist(X86_TRAP_BP, &int3, DEBUG_STACK);
for (v = 0; v < N_EXCEPTION_STACKS; v++) {
...
estacks += exception_stack_sizes[v];
...
oist->ist[v] = t->x86_tss.ist[v] =
...
(unsigned long)estacks;
#ifdef CONFIG_X86_64
if (v == DEBUG_STACK-1)
memcpy(&nmi_idt_table, &idt_table, IDT_ENTRIES * 16);
per_cpu(debug_stack_addr, cpu) = (unsigned lon
set_nmi_gate(X86_TRAP_DB, &debug);
}
set_nmi_gate(X86_TRAP_BP, &int3);
}
#endif

Here we copy idt_table to the nmi_dit_table and setup exception handlers for the #DB
As we have filled Task State Segments with the Interrupt Stack Tables we can set TSS
or Debug exception and #BR or Breakpoint exception . You can remember that we
descriptor for the current processor and load it with the:
already set these interrupt gates in the previous part, so why do we need to setup it again?
We setup it again because when we initialized it before in the early_trap_init function,
set_tss_desc(cpu, t);
the Task State Segment was not ready yet, but now it is ready after the call of the
load_TR_desc();
cpu_init function.
That's all. Soon we will consider all handlers of these interrupts/exceptions. MCE exception
SIMD
Conclusion cpumasks and bitmaps
NX
It is the end of the fourth part about interrupts and interrupt handling in the Linux kernel.
Task State Segment
We saw the initialization of the Task State Segment in this part and initialization of the
Previous part
different interrupt handlers as Divide Error , Page Fault exception and etc. You can note
that we saw just initialization stuff, and will dive into details about handlers for these
exceptions. In the next part we will start to do it.

If you have any questions or suggestions write me a comment or ping me at twitter.

Please note that English is not my first language, And I am really sorry for any
inconvenience. If you find any mistakes please send me PR to linux-insides.

Links
page fault
Interrupt Descriptor Table
Tracing
cr2
RCU
this_cpu_* operations
kmemcheck
prefetchw
3DNow
CPU caches
VFS
Linux kernel memory management
Fix-Mapped Addresses and ioremap
Extended Industry Standard Architecture
INT isntruction
INTO
BOUND
opcode
control register
x87 FPU

You might also like