Linux Insides - Interrupts - Linux Interrupts 1
Linux Insides - Interrupts - Linux Interrupts 1
The first - Local APIC is located on each CPU core. The local APIC is responsible for
    Introduction                                                                                     The second - I/O APIC provides multi-processor interrupt management. It is used to
                                                                                                     distribute external interrupts among the CPU cores. More about the local and I/O APICs
    This is the first part of the new chapter of the linux insides book. We have come a long way     will be covered later in this chapter. As you can understand, interrupts can occur at any
    in the previous chapter of this book. We started from the earliest steps of kernel               time. When an interrupt occurs, the operating system must handle it immediately. But what
    initialization and finished with the launch of the first init process. Yes, we saw several       does it mean to handle an interrupt ? When an interrupt occurs, the operating system
    initialization steps which are related to the various kernel subsystems. But we did not dig      must ensure the following steps:
    deep into the details of these subsystems. With this chapter, we will try to understand how
    the various kernel subsystems work and how they are implemented. As you can already                    The kernel must pause execution of the current process; (preempt current task);
    understand from the chapter's title, the first subsystem will be interrupts.                           The kernel must search for the handler of the interrupt and transfer control (execute
                                                                                                           interrupt handler);
    What is an Interrupt?                                                                                  After the interrupt handler completes execution, the interrupted process can resume
                                                                                                           execution.
    We have already heard of the word interrupt in several parts of this book. We even saw a
                                                                                                     Of course there are numerous intricacies involved in this procedure of handling interrupts.
    couple of examples of interrupt handlers. In the current chapter we will start from the
                                                                                                     But the above 3 steps form the basic skeleton of the procedure.
    theory, i.e.
                                                                                                     Addresses of each of the interrupt handlers are maintained in a special location referred to
           What are interrupts ?
                                                                                                     as the - Interrupt Descriptor Table or IDT . The processor uses a unique number for
           What are interrupt handlers ?
                                                                                                     recognizing the type of interruption or exception. This number is called - vector number .
    We will then continue to dig deeper into the details of interrupts and how the Linux             A vector number is an index in the IDT . There is a limited amount of the vector numbers
    kernel handles them.                                                                             and it can be from 0 to 255 . You can note the following range-check upon the vector
                                                                                                     number within the Linux kernel source-code:
                                                                                                     Also, we already know from the previous part that interrupts can be classified as maskable
       BUG_ON((unsigned)n > 0xFF);
                                                                                                     and non-maskable . Maskable interrupts are interrupts which can be blocked with the two
                                                                                                     following instructions for x86_64 - sti and cli . We can find them in the Linux kernel
    You can find this check within the Linux kernel source code related to interrupt setup (e.g.     source code:
    The set_intr_gate in arch/x86/kernel/idt.c). The first 32 vector numbers from 0 to 31
    are reserved by the processor and used for the processing of architecture-defined                  static inline void native_irq_disable(void)
    exceptions and interrupts. You can find the table with the description of these vector             {
    numbers in the second part of the Linux kernel initialization process - Early interrupt and                asm volatile("cli": : :"memory");
    exception handling. Vector numbers from 32 to 255 are designated as user-defined                   }
    interrupts and are not reserved by the processor. These interrupts are generally assigned
    to external I/O devices to enable those devices to send interrupts to the processor.             and
    Now let's talk about the types of interrupts. Broadly speaking, we can split interrupts into 2
                                                                                                       static inline void native_irq_enable(void)
    major classes:
                                                                                                       {
                                                                                                               asm volatile("sti": : :"memory");
           External or hardware generated interrupts
                                                                                                       }
           Software-generated interrupts
    The first - external interrupts are received through the Local APIC or pins on the               These two instructions modify the IF flag bit within the interrupt register. The sti
    processor which are connected to the Local APIC . The second - software-generated                instruction sets the IF flag and the cli instruction clears this flag. Non-maskable
    interrupts are caused by an exceptional condition in the processor itself (sometimes using       interrupts are always reported. Usually any failure in the hardware is mapped to such non-
    special architecture-specific instructions). A common example of an exceptional condition        maskable interrupts.
    is division by zero . Another example is exiting a program with the syscall instruction.
                                                                                                     If multiple exceptions or interrupts occur at the same time, the processor handles them in
    As mentioned earlier, an interrupt can occur at any time for a reason which the code and         order of their predefined priorities. We can determine the priorities from the highest to the
    CPU have no control over. On the other hand, exceptions are synchronous with program             lowest in the following table:
    execution and can be classified into 3 categories:
                                                                                                       +----------------------------------------------------------------+
           Faults                                                                                      |              |                                                 |
           Traps                                                                                       |   Priority   | Description                                     |
                                                                                                       |              |                                                 |
           Aborts
                                                                                                       +--------------+-------------------------------------------------+
                                                                                                       |              | Hardware Reset and Machine Checks               |
    A fault is an exception reported before the execution of a "faulty" instruction (which can         |     1        | - RESET                                         |
    then be corrected). If correct, it allows the interrupted program to resume.                       |              | - Machine Check                                 |
                                                                                                       +--------------+-------------------------------------------------+
    Next a trap is an exception, which is reported immediately following the execution of the          |              | Trap on Task Switch                             |
    trap instruction. Traps also allow the interrupted program to be continued just as a               |     2        | - T flag in TSS is set                          |
     fault does.                                                                                       |              |                                                 |
                                                                                                       +--------------+-------------------------------------------------+
    Finally, an abort is an exception that does not always report the exact instruction which          |              | External Hardware Interventions                 |
                                                                                                       |              | - FLUSH                                         |
    caused the exception and does not allow the interrupted program to be resumed.
                                                                                                       |     3        | - STOPCLK                                       |
                                                                                                       |              | - SMI                                           |
                                                                                                       |              | - INIT                                          |
  +--------------+-------------------------------------------------+                            In the x86 architecture. Only long mode interrupt gates and trap gates can be referenced
  |              | Traps on the Previous Instruction               |
                                                                                                in the x86_64 . Like the Global Descriptor Table , the Interrupt Descriptor table is an
  |     4        | - Breakpoints                                   |
  |              | - Debug Trap Exceptions                         |
                                                                                                array of 8-byte gates on x86 and an array of 16-byte gates on x86_64 . We can remember
  +--------------+-------------------------------------------------+                            from the second part of the Kernel booting process, that Global Descriptor Table must
  |     5        | Nonmaskable Interrupts                          |                            contain NULL descriptor as its first element. Unlike the Global Descriptor Table , the
  +--------------+-------------------------------------------------+                             Interrupt Descriptor Table may contain a gate; it is not mandatory. For example, you
  |     6        | Maskable Hardware Interrupts                    |
                                                                                                may remember that we have loaded the Interrupt Descriptor table with the NULL gates
  +--------------+-------------------------------------------------+
  |     7        | Code Breakpoint Fault                           |                            only in the earlier part while transitioning into protected mode:
  +--------------+-------------------------------------------------+
  |     8        | Faults from Fetching Next Instruction           |
                                                                                                  /*
  |              | Code-Segment Limit Violation                    |
                                                                                                   * Set up the IDT
  |              | Code Page Fault                                 |
                                                                                                   */
  +--------------+-------------------------------------------------+
                                                                                                  static void setup_idt(void)
  |              | Faults from Decoding the Next Instruction       |
                                                                                                  {
  |              | Instruction length > 15 bytes                   |
                                                                                                          static const struct gdt_ptr null_idt = {0, 0};
  |     9        | Invalid Opcode                                  |
                                                                                                          asm volatile("lidtl %0" : : "m" (null_idt));
  |              | Coprocessor Not Available                       |
                                                                                                  }
  |              |                                                 |
  +--------------+-------------------------------------------------+
  |     10       | Faults on Executing an Instruction              |                            From the arch/x86/boot/pm.c. The Interrupt Descriptor table can be located anywhere
  |              | Overflow                                        |
                                                                                                in the linear address space and the base address of it must be aligned on an 8-byte
  |              | Bound error                                     |
  |              | Invalid TSS                                     |
                                                                                                boundary on x86 or 16-byte boundary on x86_64 . The base address of the IDT is stored
  |              | Segment Not Present                             |                            in the special register - IDTR . There are two instructions on x86 -compatible processors to
  |              | Stack fault                                     |                            modify the IDTR register:
  |              | General Protection                              |
  |              | Data Page Fault                                 |                                 LIDT
  |              | Alignment Check                                 |
                                                                                                     SIDT
  |              | x87 FPU Floating-point exception                |
  |              | SIMD floating-point exception                   |
                                                                                                The first instruction LIDT is used to load the base-address of the IDT i.e., the specified
  |              | Virtualization exception                        |
  +--------------+-------------------------------------------------+                            operand into the IDTR . The second instruction SIDT is used to read and store the
                                                                                                contents of the IDTR into the specified operand. The IDTR register is 48-bits on the x86
                                                                                                and contains the following information:
Now that we know a little about the various types of interrupts and exceptions, it is time to
move on to a more practical part. We start with the description of the Interrupt
                                                                                                  +-----------------------------------+----------------------+
Descriptor Table . As mentioned earlier, the IDT stores entry points of the interrupts and
                                                                                                  |                                   |                      |
exceptions handlers. The IDT is similar in structure to the Global Descriptor Table which         |     Base address of the IDT       |   Limit of the IDT   |
we saw in the second part of the Kernel booting process. But of course it has some                |                                   |                      |
differences. Instead of descriptors , the IDT entries are called gates . It can contain one       +-----------------------------------+----------------------+
of the following gates:                                                                           47                                16 15                    0
    Interrupt gates
                                                                                                Looking at the implementation of setup_idt , we have prepared a null_idt and loaded it
    Task gates                                                                                  to the IDTR register with the lidt instruction. Note that null_idt has gdt_ptr type
    Trap gates.                                                                                 which is defined as:
                                                                                                  31                                   16 15
  struct gdt_ptr {                                                                                0
          u16 len;                                                                                +--------------------------------------------------------------------------
          u32 ptr;                                                                                -----+
  } __attribute__((packed));                                                                      |                                      |
                                                                                                  |
                                                                                                  |          Segment Selector            |                 Offset 15..0
Here we can see the definition of the structure with the two fields of 2-bytes and 4-bytes        |
each (a total of 48-bits) as we can see in the diagram. Now let's look at the IDT entries         |                                      |
structure. The IDT entries structure is an array of the 16-byte entries which are called          |
gates in the x86_64 . They have the following structure:                                          +--------------------------------------------------------------------------
                                                                                                  -----+
  127
  96
  +--------------------------------------------------------------------------                   To form an index into the IDT, the processor scales the exception or interrupt vector by
  -----+                                                                                        sixteen. The processor handles the occurrence of exceptions and interrupts just like it
  |
                                                                                                handles calls of a procedure when it sees the call instruction. A processor uses a unique
  |
                                                                                                number or vector number of the interrupt or the exception as the index to find the
  |                                 Reserved
  |                                                                                             necessary Interrupt Descriptor Table entry. Now let's take a closer look at an IDT entry.
  |
  |                                                                                             As we can see, IDT entry on the diagram consists of the following fields:
  +--------------------------------------------------------------------------
  ------                                                                                             0-15 bits - offset from the segment selector which is used by the processor as the
  95                                                                                                base address of the entry point of the interrupt handler;
  64
                                                                                                     16-31 bits - base address of the segment select which contains the entry point of the
  +--------------------------------------------------------------------------
  -----+                                                                                            interrupt handler;
  |                                                                                                  IST - a new special mechanism in the x86_64 , which is described below;
  |
                                                                                                     DPL - Descriptor Privilege Level;
  |                                Offset 63..32
  |                                                                                                  P - Segment Present flag;
  |
                                                                                                     48-63 bits - the second part of the handler base address;
  |
  +--------------------------------------------------------------------------                        64-95 bits - the third part of the base address of the handler;
  -----+                                                                                             96-127 bits - and the last bits are reserved by the CPU.
  63                                48 47      46 44    42    39
  34     32                                                                                     And the last Type field describes the type of the IDT entry. There are three different
  +--------------------------------------------------------------------------
                                                                                                kinds of handlers for interrupts:
  -----+
  |                                   |       | D |     |     |      |   |   |
                                                                                                    Interrupt gate
  |
  |        Offset 31..16              |   P   | P | 0 |Type |0 0 0 | 0 | 0 |                        Trap gate
  IST |
                                                                                                    Task gate
  |                                   |       | L |     |     |      |   |   |
  |
    --------------------------------------------------------------------------
  -----+
The IST or Interrupt Stack Table is a new mechanism in the x86_64 . It is used as an           The PAGE_SIZE is 4096 -bytes and the THREAD_SIZE_ORDER depends on the
alternative to the legacy stack-switch mechanism. Previously the x86 architecture               KASAN_STACK_ORDER . As we can see, the KASAN_STACK depends on the CONFIG_KASAN kernel
provided a mechanism to automatically switch stack frames in response to an interrupt.         configuration parameter and is defined as:
The IST is a modified version of the x86 Stack switching mode. This mechanism
unconditionally switches stacks when it is enabled and can be enabled for any interrupt in       #ifdef CONFIG_KASAN
the IDT entry related with the certain interrupt (we will soon see it). From this we can             #define KASAN_STACK_ORDER 1
understand that IST is not necessary for all interrupts. Some interrupts can continue to         #else
                                                                                                     #define KASAN_STACK_ORDER 0
use the legacy stack switching mode. The IST mechanism provides up to seven IST
                                                                                                 #endif
pointers in the Task State Segment or TSS which is the special structure which contains
information about a process. The TSS is used for stack switching during the execution of
an interrupt or exception handler in the Linux kernel. Each pointer is referenced by an        KASan is a runtime memory debugger. Thus, the THREAD_SIZE will be 16384 bytes if
interrupt gate from the IDT .                                                                   CONFIG_KASAN is disabled or 32768 if this kernel configuration option is enabled. These
                                                                                               stacks contain useful data as long as a thread is alive or in a zombie state. While the thread
The Interrupt Descriptor Table represented by the array of the gate_desc structures:           is in user-space, the kernel stack is empty except for the thread_info structure (details
                                                                                               about this structure are available in the fourth part of the Linux kernel initialization
  extern gate_desc idt_table[];                                                                process) at the end of the stack. The active or zombie threads aren't the only threads with
                                                                                               their own stack. There also exist specialized stacks that are associated with each available
                                                                                               CPU. These stacks are active when the kernel is executing on that CPU. When the user-
where gate_struct is defined as: /arch/x86/include/asm/desc_defs.h
                                                                                               space is executing on the CPU, these stacks do not contain any useful information. Each
                                                                                               CPU has a few special per-cpu stacks as well. The first is the interrupt stack used for the
  struct gate_struct {
                                                                                               external hardware interrupts. Its size is determined as follows:
          u16             offset_low;
          u16             segment;
          struct idt_bits bits;                                                                  #define IRQ_STACK_ORDER (2 + KASAN_STACK_ORDER)
          u16             offset_middle;                                                         #define IRQ_STACK_SIZE (PAGE_SIZE << IRQ_STACK_ORDER)
  #ifdef CONFIG_X86_64
          u32             offset_high;
          u32             reserved;                                                            Or 16384 bytes. The per-cpu interrupt stack is represented by the irq_stack struct and
  #endif                                                                                       the fixed_percpu_data struct in the Linux kernel for x86_64 :
  } __attribute__((packed));
  #define   PAGE_SHIFT        12
  #define   PAGE_SIZE         (_AC(1,UL) << PAGE_SHIFT)
                                                                                                 #ifdef CONFIG_X86_64
  ...
                                                                                                 struct fixed_percpu_data {
  ...
                                                                                                         /*
  ...
                                                                                                          * GCC hardcodes the stack canary as %gs:40. Since the
  #define   THREAD_SIZE_ORDER      (2 + KASAN_STACK_ORDER)
                                                                                                          * irq_stack is the object at %gs:0, we reserve the bottom
  #define   THREAD_SIZE (PAGE_SIZE << THREAD_SIZE_ORDER)
                                                                                                          * 48 bytes of the irq stack for the canary.
                                                                                                          */
                                                                                                         char            gs_base[40];
The irq_stack struct contains a 16 kilobytes array. Also, you can see that the                 We can see its definition in the code:
fixed_percpu_data contains two fields:
                                                                                                 DECLARE_PER_CPU_FIRST(struct fixed_percpu_data, fixed_percpu_data) __visible;
    gs_base - The gs register always points to the bottom of the fixed_percpu_data .
    On the x86_64 , the gs register is shared by per-cpu area and stack canary (more
    about per-cpu variables you can read in the special part). All per-cpu symbols are
    zero-based and the gs points to the base of the per-cpu area. You already know that        Now, it's time to look at the initialization of the fixed_percpu_data . Besides the
                                                                                                fixed_percpu_data definition, we can see the definition of the following per-cpu variables
    segmented memory model is abolished in the long mode, but we can set the base
    address for the two segment registers - fs and gs with the Model specific registers        in the arch/x86/include/asm/processor.h:
    and these registers can be still be used as address registers. If you remember the first
    part of the Linux kernel initialization process, you can remember that we have set the       DECLARE_PER_CPU(struct irq_stack *, hardirq_stack_ptr);
                                                                                                 ...
     gs register:
                                                                                                 DECLARE_PER_CPU(unsigned int, irq_count);
                                                                                                 ...
            movl     $MSR_GS_BASE,%ecx                                                           /* Per CPU softirq stack pointer */
            movl     initial_gs(%rip),%eax                                                       DECLARE_PER_CPU(struct irq_stack *, softirq_stack_ptr);
            movl     initial_gs+4(%rip),%edx
            wrmsr
                                                                                               The first and third are the stack pointers for hardware and software interrupts. It is obvious
                                                                                               from the name of the variables, that these point to the top of stacks. The second -
where initial_gs points to the fixed_percpu_data :                                              irq_count is used to check if a CPU is already on an interrupt stack or not. Initialization of
                                                                                               the hardirq_stack_ptr is located in the irq_init_percpu_irqstack function in
  SYM_DATA(initial_gs,        .quad INIT_PER_CPU_VAR(fixed_percpu_data))                       arch/x86/kernel/irq_64.c:
     stack_canary - Stack canary for the interrupt stack is a stack protector to verify          int irq_init_percpu_irqstack(unsigned int cpu)
    that the stack hasn't been overwritten. Note that gs_base is a 40 bytes array. GCC           {
                                                                                                         if (per_cpu(hardirq_stack_ptr, cpu))
    requires that stack canary will be on the fixed offset from the base of the gs and its
                                                                                                                 return 0;
    value must be 40 for the x86_64 and 20 for the x86 .                                                 return map_irq_stack(cpu);
                                                                                                 }
The fixed_percpu_data is the first datum in the percpu area, we can see it in the
 System.map :
                                                                                               Here we go over all the CPUs one-by-one and setup the hardirq_stack_ptr .
                                                                                               Where map_irq_stack is called to initialize the hardirq_stack_ptr ,
  0000000000000000   D   __per_cpu_start
  0000000000000000   D   fixed_percpu_data                                                     to point onto the irq_stack_backing_store of the current CPU with an offset of
  00000000000001e0   A   kexec_control_code_size                                               IRQ_STACK_SIZE,
  0000000000001000   D   cpu_debug_store                                                       either with guard pages or without when KASan is enabled.
  0000000000002000   D   irq_stack_backing_store
  0000000000006000   D   cpu_tss_rw                                                            After the initialization of the interrupt stack, we need to initialize the gs register within
  0000000000009000   D   gdt_page
                                                                                               arch/x86/kernel/cpu/common.c:
                                                                                                  #define DEBUG_STACK 3
     void load_percpu_segment(int cpu)                                                            #define MCE_STACK 4
     {
             ...
             ...                                                                                All interrupt-gate descriptors, which switch to a new stack with the IST , are initialized
             ...                                                                                within the idt_setup_from_table function. That function initializes every gate descriptor
             __loadsegment_simple(gs, 0);
                                                                                                within the struct idt_data def_idts[] array. For example:
             wrmsrl(MSR_GS_BASE, cpu_kernelmode_gs_base(cpu));
             ...
             load_stack_canary_segment();                                                         static const __initconst struct idt_data def_idts[] = {
     }                                                                                                ...
                                                                                                          INTG(X86_TRAP_NMI,              nmi),
                                                                                                      ...
and as we already know the gs register points to the bottom of the interrupt stack.                       INTG(X86_TRAP_DF,               double_fault),
              movl     $MSR_GS_BASE,%ecx
                                                                                                where nmi and double_fault are entry points created at arch/x86/kernel/entry_64.S:
              movl     initial_gs(%rip),%eax
              movl     initial_gs+4(%rip),%edx
              wrmsr                                                                               idtentry double_fault                      do_double_fault                     has_er
                                                                                                  ...
         SYM_DATA(initial_gs,                                                                     ...
         .quad INIT_PER_CPU_VAR(fixed_percpu_data))                                               ...
                                                                                                  SYM_CODE_START(nmi)
                                                                                                  ...
Here we can see the wrmsr instruction, which loads the data from edx:eax into the                 ...
Model specific register pointed by the ecx register. In our case the model specific register      ...
is MSR_GS_BASE , which contains the base address of the memory segment pointed to by              SYM_CODE_END(nmi)
the gs register. edx:eax points to the address of the initial_gs, which is the base
address of our fixed_percpu_data .
                                                                                                for the the given interrupt handlers declared at arch/x86/include/asm/traps.h:
We already know that x86_64 has a feature called Interrupt Stack Table or IST and
this feature provides the ability to switch to a new stack for events like a non-maskable
                                                                                                  asmlinkage void nmi(void);
interrupt, double fault, etc. There can be up to seven IST entries per-cpu. Some of them
                                                                                                  asmlinkage void double_fault(void);
are:
       DOUBLEFAULT_STACK                                                                        When an interrupt or an exception occurs, the new ss selector is forced to NULL and the
       NMI_STACK                                                                                 ss selector’s rpl field is set to the new cpl . The old ss , rsp , register flags, cs , rip
       DEBUG_STACK                                                                              are pushed onto the new stack. In 64-bit mode, the size of interrupt stack-frame pushes is
                                                                                                fixed at 8-bytes, so that we will get the following stack:
       MCE_STACK
or                                                                                                +---------------+
                                                                                                  |               |
                                                                                                  |      SS       |    40
     #define DOUBLEFAULT_STACK 1
                                                                                                  |      RSP      |    32
     #define NMI_STACK 2
                                                                                                  |     RFLAGS    |    24
                                                                                                  |      CS       |    16
That's all.
Conclusion
It is the end of the first part of Interrupts and Interrupt Handling in the Linux kernel. We
covered some theory and the first steps of initialization of stuff related to interrupts and
exceptions. In the next part we will continue to dive into the more practical aspects of
interrupts and interrupt handling.
Please note that English is not my first language, And I am really sorry for any
inconvenience. If you find any mistakes please send me a PR to linux-insides.
Links
       PIC
       Advanced Programmable Interrupt Controller
       protected mode
       long mode
       kernel stacks
       Task State Segment