The Intel Assembly Manual - CodeProject
The Intel Assembly Manual - CodeProject
All in one: x86, x64, Virtualization, multiple cores, along with new additions
Introduction
  This is my full and final article about the Intel Assembly, it includes all the previous hardware articles (Internals, Virtualization,
  Multicore, DMMI) along with some new information (HIMEM.SYS, Flat mode, EMM386.EXE, Expanded Memory, DPMI
  information).
  Reading this through will enable you to understand how the operating systems work, how the memory is allocated and addressed
  and, perhaps how to make your own OS-level drivers and applications.
  To help you understand what's happening, the github project includes many aspects of the article (and I 'm still adding stuff). It's a
  ready to be run tool which includes a Bochs binary, VMWare and VirtualBox configurations and a Visual Studio solution. The entire
  project is build in assembly using Flat Assembler.
Assemblers like TASM or MASM will not work, for they only support specific architectures.
  Bochs is the best environment to experiment, because it includes a hardware GUI debugger (I'm proud of developing it myself)
  which can help you understand the internals. Debugging without Bochs is impossible, because the debuggers are either real mode
  only (like MSDOS Debug) and assume you will always have some sort of control (which is not the case in most debugging areas),
  or are able to run only in an existing environment (like Visual Studio).
  If you have good C knowledge, then this will be a benefit in understanding the internals. Asesmbly knowledge is recommended, but
  you can follow the article even if you know nothing about assembly.
Generic Information
https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print                                                       1/28
10/01/2019                                                  The Intel Assembly Manual - CodeProject
  The CPU is the unit that executes assembly instructions. The way they are executed depends on the running mode of the
  processor, and there are 4 modes:
         Real mode
         Protected mode (in two vresions, segmented and flat)
         Long mode
         Virtualization (not exactly a mode, but we will talk about it later)
The next paragraphs in this chapter discuss various elements of the assembly language in general.
  Memory
  Physically, the memory is one big array. If you have 4GB, you could describe it as unsigned char mem[4294967295].
  However, the way it is used greatly differs depending on the processor mode and the configuration of the operating system.
  Therefore, you do not access it as a big array.
This is (oversimplified for now) what approximately happens in assembly with a function:
   x:
   mov   ax,[first stack element]
   mov   bx,[second stack element]
   add   ax,bx
   ret   4
   main:
   push 5
   push 10 ; the order is different, but let's forget about that now
   call x
   ; ax contains the resuln
  The variables "a" and "b" are "pushed" to temporary memory (which is now 4 bytes less if int = 16 bits). The function is called, and
  then it returns with the stack cleared and ax containing the return value. Note that the above is a big oversimplification of what the
  assembly code actually looks like, but let's pass for now.
  Registers
  In addition to memory, each CPU has some auxilliary places to store data, called registers. What registers are available depends
  on the current running mode. Some registers have special meanings, some are for generic purposes.
  Interrupts
  An interrupt is a piece of code that interrupts other running code. For the moment, just assume it's a function that can run while you
  are inside another function. There are interrupts that are automatically generated by the CPU, and interrupts that are "called" by
  software. The way they work depends on the running mode, and there can be a maximum of 255 interrupts.
  Exceptions
  An exception is an interrupt triggered by either the CPU (for example, when a divide by zero occurs in your C++ code, int 00
  functions are executed), or by using the API (via the throw keyword, for example), which generates a software interrupt. In the
https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print                                                   2/28
10/01/2019                                                The Intel Assembly Manual - CodeProject
  lower level we are discussing, there is no difference between exceptions and interrupts.
Now that we have an idea of the basics, let's proceed to CPU modes.
Real Mode
  Architecture
  Real mode is the oldest mode. DOS runs in it. Windows 3.0 also runs in it when started with the /r switch. Everything is 16 bit. It is
  the weakest mode of operation, but not the simplest one. Memory is addressed by an 20 bit controller, making possible to access
  up to 1MB memory. Available memory over this limit is useless in real mode.
  Segmentation
  Memory is not accessed as an array, but in segments. Each pointer is described by a 16 bit segment, which is a memory address
  divided by 16, and an offset, which describes how far from the offset we will go. So we will see some simple (in hex) examples:
  We can see that segments can overlap. Specifying 0ffffh segment and an offset larger than 0010h results in wrapping. A segment
  maximum capacity is 64KB. Although we can go up to a FFFF segment, only the lower 640KB were available for DOS applications,
  because the upper segments (over 0xA000) were reserved for the BIOS.
  All segments have read/write/execute access from anywhere (that is, any program can read/write or execute code within any
  segment). Any application can read from or write to any part of memory, including the part in which the OS resides. That is why a
  real mode OS is a single tasking OS and if one app crashes, you have to reboot.
Registers
         Four generic purpose registers: AX, BX, CX, DX. The upper 8 bit part of them can be accessed as AH, BH, CH, CL and the
         lower part as AL, BL, CL, DL.
         A register to hold the offset of the currently executing code: IP.
         Four registers to be used as pointers: SI, DI, BP, SP. SP points to the end of the available stack memory. Each time we push
         something to the stack, SP decreases. On POP, SP increases. These registers have no 8 bit splits.
         Four registers to contain segments: CS, holding always the segment of the currently executing code, DS,ES and SS. SS
         holds the segment of the stack memory, DS holds the segment of the data, and ES is an auxilliary register.
The 386 CPU adds more registers, also accessible in real mode:
         32 bit extensions to the non segment registers: EAX, EBX, ECX, EDX, ESI, EDI, EBP, ESP, EIP.
         Two more auxilliary segment registers, GS and FS.
         5 control registers, CR0, CR1, CR2, CR3, CR4.
         6 debug registers, DR0, DR1, DR2, DR3, DR6, DR7, used for hardware breakpoints.
https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print                                                   3/28
10/01/2019                                                The Intel Assembly Manual - CodeProject
  ESI, EDI, EBP and ESP can be used as pointers. If their high bits are not zero, then an exception occurs (unless you are in Unreal
  mode, discussed below).
  An EXE file might have multiple segments, so an EXE can be more than 64KB. DS and ES initially point to the PSP. When an EXE
  is loaded, "relocations" are resolved. A relocation is a position within the executable that the assembler leaves as empty, to be filled
  with a segment value which would only be known at run time.
  Interrupts
  All the functions that DOS and BIOS provides are available through real mode software interrupts. In real mode, the first 1024 bytes
  of RAM (Starting at 0000:0000) contain a set of 256 segment:offset pointers to each interrupt. In 286+ this location can be changed
  by the LIDT command, which points to a 6 byte array:
         Bytes 0-1 contain the full length of the IDT, maximum 1KB => 256 entries.
         Bytes 2-5 contain the physical address of the first entry of the IDT, in memory.
Some interrupts are automatically issued by the processor when some event occurs. In real mode, the most significant are:
Software interrupts provide various services to real mode apps. The most important interrupts are:
Using the excellent Ralf Brown Interrupt List you can learn about every interrupt in the world.
  Models
  Because of the segmented memory, different sets of programming models were created, which mostly resulted in incompatibilities
  between compilers and libraries. C pointers were described as near or far, depending on whether they included a segment or not:
         The tiny model. Everything has to be included in a single segment (COM file). Pointers are near.
         The small model. One segment for the code, one for the data. All pointers are near.
         The medium model. One data segment, multiple code segments. Code pointers far, data pointers near.
         The compact model. One code segment, multiple data segments. Code pointers near, data pointers far.
         The large model. Multiple code and data segments, code and data pointers far. Single data structures still limited to 64KB.
         The huge model. Multiple code and data segments, all pointers far.
  Benefits
  The only benefit in real mode is that you have DOS and BIOS functions available as software interrupts. Therefore, all techniques
  used by DOS extenders (which allowed applications to run in protected mode) involved temporarily switching to real mode to call
  DOS.
https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print                                                    4/28
10/01/2019                                                The Intel Assembly Manual - CodeProject
  Here is a quick hello world in tiny model:
  This very simple program calls two DOS functions. The first is function 9 (ah register) which accepts a pointer of the string to be
  written to the screen in DS:DX (DS already has the segment, it's a com file). The second is function 4C, which terminates the
  program.
         ShowMsg:
             mov ax,DATA16
             mov ds,ax                         ; Load DS with our "default data segment"
             mov ax,0900h
             mov dx,Msg
             int 21h;                     ;     Call a DOS function: AX = 0900h (Show Message),
                                          ;     DS:DX = address of a buffer, int 21h = show message
         retf                             ;     FAR return; we were called from
                                          ;     another segment so we must pop IP and CS.
         Main:
             mov ax,CODE16_2
             mov es,ax
             call far [es:ShowMsg] ; Call a procedure in another segment.
                                   ; CS/IP are pushed to the stack.
             mov ax,4c00h          ; Call a DOS function: AX = 4c00h (Exit), int 21h = exit
             int 21h
  How does the assembler know the actual value of the data16, code16, code16_2, and stack16 segments? It doesn't.
  What it does is to put null values, and then creates entries to the EXE file (known as "relocations") so the loader, once it copies the
  code to the memory, writes to the specified address, the true values of the segments. And because this relocation map has a
  header, COM files cannot have multiple segments even if they sum to less than 64KB in total.
  This program calls a function ShowMsg in another segment via a far call, which uses a DOS function (09h, INT            21h) to
  display text.
  Problems
         Any program can overwrite any other program, so no multitasking capability
         Up to 1MB memory only, and the upper 384K were used by BIOS, so only 640K available.
         Mixing far and near pointers between applications and libraries led to incompatibities and, usually, crashes.
https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print                                                    5/28
10/01/2019                                              The Intel Assembly Manual - CodeProject
         If something wrong happens, the PC has to reboot.
  Expanded Memory
  To cope with the 640KB limitation, an additional compatible memory, called expanded memory or EMS memory was created. This
  was not a processor feature, but rather a set of hardware (ISA card) extensions which included a driver to perform bank switching,
  i.e. replace portions of memory installed with memory from that card. It offered up to 32MB more, but it was mapped to one of the
  high segments (A000, B000, C000, D000, E000 or F000), which means that this extra memory could not be available
  simultaneously. The expansion card came with a driver which had to be installed in config.sys and, using the LIM EMS protocol,
  offered the services via interrupt 67h.
   EMSName db 'EMMXXXX0',0
   mov dx,EMSName        ; device driver name
   mov ax,3D00h                 ; open device-access/file sharing mode
   int 21h
   jc   NotThere
   mov bx,ax                    ; put handle in proper place
   mov ax,4407h                 ; IOCTL - get output status
   int 21h
   jc   NotThere
   cmp al,0FFh
   jne NotThere
   mov ah,3Eh                   ; close device
   int 21h
   jmp ItIsThere
Allocating EMS
Release EMS
  A20 line
  We saw that the maximum address is FFFF:0010, because increasing the offset results in wrapping. That is true because the 8088
  CPU has only 20 bits of addressing. However 286+ added the 21th line (known as A20 line) and, when it is enabled, FFFF:0010 to
  FFFF:FFFF can be used without wrapping (an almost 64KB more). This memory (known as High Memory Area, HMA) is now
https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print                                               6/28
10/01/2019                                                The Intel Assembly Manual - CodeProject
  accessible from real mode and it can be used by HIMEM.SYS to load parts of DOS in it and therefore make more low memory
  available for applications.
Enabling or disabling A20 manually requires us to communicate with the keyboard controller:
   WaitKBC:
      mov cx,0ffffh
       A20L:
       in al,64h
      test al,2
       loopnz A20L
   ret
   ChangeA20:
      call WaitKBC
      mov al,0d1h
       out 64h,al
      call WaitKBC
      mov al,0dfh ; use 0dfh to enable and 0ddh to disable.
       out 60h,al
   ret
  Architecture
  Protected mode solves the real mode problems. In particular:
  DOS never ran in protected mode. Windows 3.0 run in 16-bit segmented protected mode, when started with the /s switch. Windows
  95+, Linux and the rest of 32-bit OSes run in flat protected mode, but before checking the flat mode we will immerse in the complex
  mechanisms that protected mode has. Flat mode greatly simplifies many complex things in normal segmented protected mode.
  Protected mode introduces "rings", that is, levels of authorization. There are four rings (Ring 0, 1, 2 and 3), in which the Ring 0 is
  the most authorized, where the Ring 3 is the less authorized. Code running in a less privileged ring cannot access (without the OS
  supervision) code in a higher ring.
  Memory
  Each segment in memory is not anymore fixed, nor it has a fixed 64KB size. A protected mode segment can have any size, from 1
  byte to 4GB. Each segment has its own limitations (read, write, execute access) and its own protection ring.
  Registers
  The same set of registers that exist in real mode are available. Also, every register can be used as an index, for example mov
  ax,[ebx] will work.
  Bits                          Meaning
  0-15                          Limit low 16 bits
  16-31                         Base low 16 bits
  32-39                         Base medium 8 bits
https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print                                                   7/28
10/01/2019                                                 The Intel Assembly Manual - CodeProject
  Bits                           Meaning
  40                             Ac
  41                             RW
  42                             DC
  43                             Ex
  44                             S
  45-46                          Priv
  47                             Pr
  48-51                          Limit upper 4 bits
  52-53                          Reserved (0)
  54                             Sz
  55                             Gr
  56-63                          Base upper 8 bits
          The base is a 32-bit value that indicates the physical memory that this segment starts at.
          The limit is an 20- bit value indicating the length of the segment, depending on the Gr bit. If the Gr bit is 1, then the actual
          limit is the limit value * 4096.
          The Ex flag is 1, to indicate a code segment, or 0, to indicate a data segment.
          The DC flag has different meaning, depending on the Ex flag:
                 For code segment (Ex = 1), if DC is 0 then the segment is non conforming. A non conforming segment can only be
                 called from a segment with the same privilege level. If RW is 1 then the segment is conforming and can be also
                 called from segments with higher privilege. For example, a ring 3 conforming segment can be called from a ring 2
                 segment.
                 For data segment (Ex = 0), if DC is 0 then the data segment expands up, else it expands down. For an expanding
                 down segment, it starts from its limit and ends to its base, with the address going the reverse way. This flag was
                 created so a stack segment could be easily expanded, but it is not used today.
                 For code segment (Ex = 1), if 0, then the segment is not readable. If 1, then the code segment is readable.
                 For data segment (Ex = 0), if 0, segment is read only, else read-write.
                 Note that a code segment is not writable. However, because segment base addresses can overlap, you can create a
                 writable data segment with the same base address and limit of a code segment.
                 0, in which case the default for opcodes is 16-bit. The segment can still execute 32-bit commands (386+) by putting
                 the 0x66 or 0x67 prefix to them.
                 1 (386+), in which case the default for opcodes is 32-bit. The segment can still execute 16-bit commands by putting
                 the 0x66 or 0x67 prefix to them.
  In real mode, the segment registers (CS, DS, ES, SS, FS, GS) specify a real mode segment. And you can put anything to them,
  no matter where it points. And you can read and write and execute from that segment. In protected mode, these registers are
  loaded with selectors. The selectors are indices to the GDT and have the following format:
  Bits    Meaning
  0-2     RPL. Requested protection level, must be equal or lower to the segment PL.
  2       0 to take the entry from GDT, 1 from the LDT (see below)
  3-15    0-based index to the table.
  In protected mode, you can't just select random values to the segment registers like in real mode. You must put valid values or you
  will get an exception. The exception is the first entry in the GDT table, which is always set to 0. CPU does not read information from
  entry 0 and thus it is considered a "dummy" entry. This allows the programmer to put the 0 value to a segment register (DS, ES, FS,
  GS) without causing an exception.
The GDT is loaded to the CPU by executing the LDGT command, which points to a 6-byte array:
Bytes 0-1 contain the full length of the GDT, maximum 4KB => 4096 entries.
https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print                                                         8/28
10/01/2019                                                 The Intel Assembly Manual - CodeProject
         Bytes 2-5 contain the physical address of the first entry of the GDT, in memory.
  Interrupts
  The interrupt table is now 8 bytes long for each defined interrupt, having the following structure:
   struc IDT_STR
   {
     .ofs0_15 dw ofs0_15
     .sel dw sel
     .zero db zero
     .flags db flags                      ; 0 P,1-2 DPL, 3-7 index to the GDT
     .ofs16_31 dw ofs16_31
   }
  Each interrupt also has a protection level. The LIDT command has the same functionality as in real mode, pointing to an 6 byte
  array (containing the size and the physical location of the first entry).
After the LIDT command is executed, real mode interrupts no longer work, so a real mode debugger is useless.
         Task Segments
         Call Gates
         Interrupt Gates
         Trap Gates (same as interrupt gates, with the exception that when a trap occurs, interrupts are still enabled)
         0000 - Reserved
         0001 - Available 16-bit TSS
         0010 - Local Descriptor Table (LDT)
         0011 - Busy 16-bit TSS
         0100 - 16-bit Call Gate
         0101 - Task Gate
         0110 - 16-bit Interrupt Gate
         0111 - 16-bit Trap Gate
         1000 - Reserved
         1001 - Available 32-bit TSS
         1010 - Reserved
         1011 - Busy 32-bit TSS
         1100 - 32-bit Call Gate
         1101 - Reserved
         1110 - 32-bit Interrupt Gate
         1111 - 32-bit Trap Gate
  Call Gates
  Call gates are a mechanism to switch from a low privilege code to a higher one, used for user-level code to call system-level code.
  You specify a 1100 type entry in the GDT with the following format:
https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print                                                  9/28
10/01/2019                                                 The Intel Assembly Manual - CodeProject
  Hide Copy Code
   struct CALLGATE
   {
       unsigned short offs0_15;
       unsigned short selector;
       unsinged short argnum:5; // number of arguments to copy to the stack from the current
   stack
       unsigned char r:3; // Reserved
       unsigned char type:5; // 1100
       unsigned char dpl:2; // DPL of this gate
       unsigned char P:1; // Present bit
       unsigned short offs16_31;
};
  Using CALL FAR with the selector of this callgate (the offset is ignored) will switch to the gate and execute the higher level privilege
  commands. If argnum specifies parameters to be copied, the system copies them to the new stack after pushing SS,ESP,CS,EIP.
   Using RETF will return from the gate call.
Call gates are slow mechanisms to transit between rings in the CPU.
  In addition to the far call and jmp, a context switch can be triggered by a using a Task Gate Descriptor. Unlike TSS Descriptors,
  task-gate descriptors can be in the GDT, LDT or IDT (so you can force a task switching when an interrupt occurs).
         Enable A20
         Set the GDT
         Set the IDT (if you need interrupts in protected mode)
         Enter protected mode with the MSW or the CR0 register.
You use the MSW register (in 286), or, in 386+ CR0:
   ; 386+
   mov eax,cr0
   or eax,1
   mov cr0,eax
   ; 286
   smsw ax
   or al,1
   lmsw ax
  After that, you must execute a far jump to a protected mode code segment in order to clear possible invalid command cache. If this
  code segment is a 16-bit code segment, you must do:
   cli
   mov eax,cr0
   and eax,0ffffffeh
   mov cr0,eax
   mov ax,data16
   mov ds,ax
   mov ax,stack16
   mov ss,ax
   mov sp,1000h ; assuming that stack16 is 1000h bytes in length
   mov bx,RealMemoryInterruptTableSavedWithSidt
   litd [bx]
   sti
   ; (Real mode debugger works here) ...
  In 286, you cannot get back to real mode because a LMSW ax to remove the protected mode flag results in a processor reset,
  keeping the memory intact. 286 forces this reset and puts a routine to be executed after the reset with the following code:
   MOV ax,40h
   MOV es,ax
   MOV di,67h
   MOV al,8fh
   OUT 70h,al
   MOV ax,ShutdownProc
   STOSW
   MOV ax,cs
   STOSW
   MOV al,0ah
   OUT 71h,al
   MOV al,8dh
   OUT 70h,al
Problems
  While you can access all the memory directly, there is still a lot of segmentation and slow task switching or slow movement between
  rings.
https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print                                             11/28
10/01/2019                                                 The Intel Assembly Manual - CodeProject
  Paging
  Paging is the method to redirect a memory address to another address. The requested address is called linear address and the
  target address is called physical address. When a linear address is the same as a physical address, we say that we are in a "see
  through" area.
To accomplish paging, two tables are used: the page directory and the page table.
The Page Directory is an array of 1024 32-bit entries with the following format:
P,R,U,W,D,A,N,S,G,AA,Addr
         P - Page is present in memory. This flag allows the OS to cache the pages back to disk , clear P, and reload them when a
         page fault is generated when software attemps to access the page.
         R - Page is Read Write if set, else Read only. This restriction applies only to ring 3 unless the WP bit in CR0 is set.
         U - If unset, only ring 0 can access this page.
         W - If set, write-through is enabled.
         D - If set, the page will not be cached. The CPU caches the page tables in it's Translation Lookaside Buffer (TLB).
         A - Set when the page is accessed (not automatically, like the GDT bit).
         N - Set to 0.
         S - Set to 0. If Page Size Extensions (PSE) are enabled, S can be 1, in which case the page size is 4MB instead, and the
         pages must be 4MB aligned. This mode is introduced to avoid lots of small pages, at the expense of more memory wasted if
         the needed memory is somewhat larger than 4MB. Fortunately, modes can be mixed.
         G - Set to 0.
         Addr - The upper 20 bits (the lower 12 are ignored because it must be 4096- aligned) of the Page Table entry that this Page
         Directory entry points to.
The Page Table is an array of 1024 32-bit entries with a similar format:
P,R,U,W,C,A,D,N,G,AA,Addr
To enable paging:
         Load CR3 with the address of the first entry in the Page Directory (must be 4096-aligned).
         Set CR0 bit 31. This requires protected mode, with the exception of LOADALL (see below).
  Once the tables are loaded, they are cached into TLB. Reloading the CR3 will reset the cache. 486+ also has an INVLPG
  instruction to reset only a particular page cache, not the entire TLB.
Architecture
The segmented protected mode is very complex. Using paging, protected mode can be "flat", enabling the following:
         All processes get an 4GB virtual address space. Protection is done at the paging level. All segments are 4GB, all segment
         selectors always point to the same segment.
         Programming is way simpler since only "near" pointers are needed.
         The OS can map shared libraries (residing once in physical memory) to multiple virtual destinations per application.
         The application only sees memory paged to its own virtual address space, so processes are protected by hardware.
  In addition, all modern OSes now use only 2 of the 4 protection rings, ring 0 for their kernel and ring 3 for all the user applications.
  Call gates are no more used.
SYSENTER/SYSEXIT
https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print                                                    12/28
10/01/2019                                                The Intel Assembly Manual - CodeProject
  To make transitions between user mode (ring 3) and kernel mode (ring 0) faster, a method other than call gates had to be
  implemented. SYSENTER/SYSEXIT instructions are the current way to switch from ring 3 to ring 0. You will use WRMSR to set
  the new values for CS (0x174) , ESP (0x175) and EIP (0x176). ECX must hold the ring 3 stack pointer for SYSEXIT and EDX
  contains the ring 3 EIP for SYSEXIT. The entry stored for CS must be the index to 4 selectors, the first is the ring 0 code, the
  second is the ring 0 data, the third is the ring 3 code and the fourth is the ring 4 data. These values are fixed, so in order to use
  SYSENTER your GDT table must contain these entries in this format.
  These opcodes only support switching between ring 3 and ring 0, but they are much faster. They are used today instead of the way
  slower call gates.
Software multitasking
  Task gates are no longer used by today's operating systems. Instead, they apply software multitasking to switch between
  processes:
  Because a software scheduler saves only what is necessary for task switching, it is faster than the segmented mode hardware
  switching.
Unreal mode
  Because protected mode cannot call DOS or BIOS interrupts, it is generally not very useful to DOS applications. However, a 'bug' in
  the 386+ processor turned out to be a feature called unreal mode. The unreal mode is a method to access the entire 4GB of
  memory from real mode. This trick is undocumented, however a large number of applications are using it. The trick is based on the
  fact that a segment selector can originally point to a 4GB data segment (set in the GDT), and when it goes back to the real mode its
  "invisible part" remains intact and still having a 4GB limit.
         Enable A20.
         Enter protected mode.
         Load a segment register (ES or FS or GS) with a 4GB data segment.
         Return to real mode.
  286 lacks this capability because to exit protected mode, the CPU has to be reset, so all registers are destroyed (but see LOADALL
  below).
  LOADALL
  At that time, a now non-existent and mostly undocumented instruction existed, LOADALL (0xF 0x5 in 286, 0xF 0x7 in 386).
  LOADALL used, as the name implies, to load all the registers (including the GDTR and IDTR) from one table in memory. In 286
  LOADALL (which was not accessible from 386), this table was fixed at memory address 0x800, whereas in 386 LOADALL it reads
https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print                                                  13/28
10/01/2019                                               The Intel Assembly Manual - CodeProject
  the buffer pointed to by real mode ES:EDI. Because the CPU does not check in any way if any of the values loaded by LOADALL
  are valid, LOADALL was used by many tools at the time, including HIMEM.SYS, for various infamous actions:
         To access the entire memory from real mode without entering protected mode and unreal mode.
         To run real code with paging.
         To run normal 16-bit code inside protected mode without VM86 (which was not there in 286). This was done by trapping
         each memory access (which would lead to GPF because all the segments were marked non-present) and emulating the
         desired result by using another LOADALL. Of course this was too slow, but it led to the creation of the VM86 mode in 386,
         where LOADALL eventually faded out.
LOADALL cannot switch the 286 back to real mode, but using LOADALL removes the need to enter protected mode altogether.
  LOADALL 286 itself was mentioned in the manuals and was partially documented; by contrast, LOADALL 386 was heavily obscure,
  probably to induce the programmers to take advantage of the new VM86 mode.
HIMEM.SYS
  Protected mode is complex and, without a debugger available, it is prone to lots of unsolvable crashes. To help the programmers,
  Microsoft created a driver that was able to manage protected mode from a normal 16-bit DOS application, allowing it to access high
  memory. that time, extended memory was mostly, if not totally, used to cache data from the disk, especially from big apps. HIMEM
  puts the CPU in unreal mode (or it uses LOADALL in 286) and provides a simple interface to the applications that want more
  memory without messing with the protected mode details. By enabling the A20 line, HIMEM allowed a portion of
  DOS COMMAND.COM to reside in the high memory area when config.sys had a DOS=HIGH directive.
Detect HIMEM.SYS
All the following functions are provided from the function at the returned ES:BX from the above interrupt.
Detect/Enable/Disable A20
Allocate HMA
AH = 0x1
Free HMA
AH = 0x2
AH = 0x9
AH = 0xA
AH = 0xB
https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print                                            14/28
10/01/2019                                                The Intel Assembly Manual - CodeProject
  HIMEM.SYS moves memory in order to defragment it. Locking memory is useful when you will access the memory directly, within
  protected mode. Actually, because HIMEM puts the CPU in unreal mode, you can use the very same returned pointers directly.
  VM86 Mode
  Many of the existing applications were real-mode at the time protected mode was introduced. Even today, many (mostly games) are
  played under Windows. To force these applications (which think they own the machine) to cooperate, a special mode should be
  created.
  The VM86 mode is a special flag to the EFlags register, allowing a normal 16-bit DOS memory map of 640KB which is forwarded
  via paging to the actual memory - this makes it possible to run multiple DOS applications at the same time without risking any
  chance for one application to overwrite another. EMM386.EXE puts the processor to that state. The OS performs a step-by-step
  watching to the process, making sure that the process won't execute something illegal. Normally also, you want to map all your
  other critical structures (GDT, IDT etc) above 1MB so they are not visible to any VM86 process.
   mov ebp,esp
   push dword [ebp+4]
   push dword [ebp+8]
   pushfd
   or dword [esp], (1 << 17)                ; set VM flags
   push dword [ebp+12]                   ; cs
   push dword [ebp+16]                   ; eip
   iret
  Once the VM flag is set, you can load a normal "segment" to a segment register. Interrupt calls by DOS applications are caught by
  the OS and emulated through it - if possible. Also, some instructions are ignored, for example, if you do a CLI, the interrupts are not
  actually disabled. The OS sees that you prefer to not be interrupted and acts accordingly, but interrupts are still there.
  All VM86 code executes in PL 3, the lowest privilege level. Ins/Outs to ports are also captured and emulated if possible. The
  interesting thing about VM86 is that there are two interrupt tables, one for the real and one for the protected mode. But only
  protected mode interrupts are executed.
  VM86 was removed from 64-bit mode, so a 64-bit OS cannot execute 16-bit DOS code anymore. In order to execute such code,
  you need an emulator such as DosBox.
  Many applications were also written to take advantage of the expanded memory, but the modern standard was the protected mode.
  EMM386 puts the CPU in VM86 mode and maps via paging memory over 1MB to real mode segments (over 0xA0000), so an
  application that would like to use expanded memory can use it via EMM386.EXE, which provides an LIM EMS int 0x67 interface. In
  addition, EMM386 allowed "devicehigh" and "loadhigh" commands in CONFIG.SYS, allowing applications to get loaded to these
  high segments if possible.
  Enabling PAE (CR4 bit 5) means that now you have 3 paging levels: In addition to Page Directory and the Page Table , you have
  now the PDTD, Page Directory Pointer Table, which has four 64-bit entries. Each of the PDTD entries points to a Page Directory of
  4KB (like in normal paging). Each entry in the new Page Directory is now 64 bit long (so there are 512 entries). Each entry in the
  new Page Directory points to a Page Table of 4KB (like in normal paging), and each entry in the new Page Table is now 64-bit long,
  so there are 512 entries. Because that would allow only a quarter of the original mapping, that's why 4 directory/table entries are
  supported. The first entry maps the first 1GB, the 2nd the 2nd GB, the 3rd the 3rd GB and finally, the 4th entry maps the 4th GB.
  But now the "S" bit in the PDT has a different meaning: If not set, it means that the page entry is 4KB but if set, it means that this
  entry does not point to a PT entry, but it describes itself a 2MB page. So you can have different levels of paging traversal depending
  on the S bit.
There is a new flag in the Page Directory entry as well, the NX bit (Bit 63) which, if set, prevents code execution in that page.
  This system allows the OS to handle memory over 4GB, but since the address space is still 4GB, each process is still limited to
  4GB. The memory can be up to 64GB but a process cannot see the entire memory.
https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print                                                  15/28
10/01/2019                                                 The Intel Assembly Manual - CodeProject
  Direct Memory Access drivers however have a problem, because they don't use paged memory. If working in 32 bits, the driver has
  to manage the paging tables itself in order to be able to manipulate memory over 4GB and this cound mean incompatibilites with
  the operating system, unless a safe DMA API was exposed to the driver. For this reason, PAE quickly faded out in favor of 64-bit
  operating systems, in which it still remains a required paging level.
DPMI
  For DOS applications, unreal mode was not enough, eventually a fully 32-bit capability application had to be created. DPMI (Dos
  Protected Mode Interface) was a driver that provided a (relative complex) interface to applications wishing to run in 32 bit protected
  mode. DOS extenders, based on DPMI, like DOS4GW and DOS32A were created to support applications (mostly games) that
  wanted to run in 32 bit while still having access to DOS interrupts. DPMI catches the interrupt call, switches to real mode, executes
  the interrupt and goes back to protected mode. DPMI even allows multitasking and multiple "virtual" 32 bit machines.
App terminates via 0x4C int 0x21 (as in real mode). The rest of DPMI functions are provided through int 0x31 and include:
Many good games like The Dig were running under DPMI.
Long Mode
Architecture
https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print                                                   16/28
10/01/2019                                                The Intel Assembly Manual - CodeProject
  Whatever methods created to overcome the 4GB limit of the x86, they would eventually lead to full 64-bit processors. Having
  discussed all the protected mode complexities, we are lucky to observe that the x64 CPU architecture is way simpler. The x64 CPU
  has 3 operation modes:
         Real mode
         Protected mode (called legacy mode)
         Long mode, containing two submodes:
                 Compatibility mode, 32 bit. This allows an 64-bit OS to run 32-bit applications natively.
                 64-bit mode
To work in Long mode, the programmer must take into consideration the facts below:
         Unlike Protected mode, which can run with or without paging, long mode runs only with PAE and paging and in flat mode. All
         the segments are flat, from 0 to 0xFFFFFFFFFFFFFFFF and all memory addressing is linear. DS, ES, SS are ignored. The
         "flat" mode is the only valid mode in long mode. No segmentation.
         You can get into long mode directly from real mode, by enabling protected mode and long mode within one instruction (this
         can work because Control Registers are accessible from real mode).
         Although in theory any 64-bit value could be used as an address, in practise we don't need yet 2^64 memory. Therefore,
         current implementations only implement 48-bit addressing, which enforces all pointers to have bits 47-63 either all 0 or all 1.
         This means that you have 2 ranges of valid "canonical" addresses, one from 0 to 0x00007FFF'FFFFFFFF and one from
          0xFFFF8000'00000000 through 0xFFFFFFFF'FFFFFFFF, for a 256TB of total space. Most OSes reserve the upper area for
         the kernel, and the lower area for the user space.
Registers
When running in 64-bit mode, the following 64-bit extensions are available:
These registers are only available in 64-bit mode. In all other modes, including compatibility mode, they are not available.
  GDT/IDT
  Bit 53 of the GDT, previously reserved, is now the "L", bit. When 1, the Sz bit must also be 0, and this indicates an 64-bit code (the
  combination L = 1 and Sz = 1 is reserved and will throw an exception if used). The limits are always 0 to 0xFFFFFFFFFFFFFFFF
  and the base is always 0.
  If your GDT resides in the lower 4GB of memory, you need not change it after entering long mode. However, if you plan to
  call SGDT or LGDT while in long mode, you must now deal with the 10-byte GDTR, which holds two bytes for the length of the
  GDT and 8 bytes for the physical address of it.
  Any selector you might load to access a 64-bit segment is ignored, and DS, ES, SS are not used at all. All the segments are flat,
  and everything is done via paging. However GS and FS can still be used as auxilliary registers and their values are still subject to
  verification from the GDT. In Windows, FS points to the Thread Information Block.
  IDT is similar to the protected mode's, the difference being the fact that each entry is expanded to contain an 64-bit physical
  address to the interrupt:
   struc IDT_STR
   {
     .ofs0_15 dw ofs0_15
     .sel dw sel
     .zero db zero
     .flags db flags                      ; 0 P,1-2 DPL, 3-7 index to the GDT
     .ofs16_31 dw ofs16_31
     .ofs32_63 dd ofs32_63
     .zero dd 0
   }
  There is no LDT, VM86, DPMI, unreal mode or call gates in long mode. Missing VM86 is the reason that 64-bit OSes cannot run 16
  bit software without an emulator.
https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print                                                  17/28
10/01/2019                                                The Intel Assembly Manual - CodeProject
  In long mode the paging system adds a new top level structure, the PML4T which has 512 64-bit long entries which point to one
  PDPT and now the PDPT has 512 entries as well (instead of 4 in the x86 mode). So now you can have 512 PDPTs which means
  that one PT entry manages 4KB, one PDT entry manages 2MB (4KB * 512 PT entries), one PDPT entry manages 1GB (2MB*512
  PDT entries), and one PML4T entry manages 512 GB (1GB * 512 PDPT entries). Since there are 512 PML4T entries, a total of
  256TB (512GB * 512 PML4T entries) can be addressed.
  This is another reason not to use the entire 64-bit for addressing. Using the entire thing would force us to have 6 levels of paging,
  where now four are needed.
  Each of the "S" bits in the PDPT/PDT can be 0 to indicate that there is a lower level structure below, or 1 to indicate that the
  traversal ends here. If the PDPT S flag is 1, then the page size is 1GB.
  There is an Intel draft about PML5, a new top level structure which would allow 5 levels of paging, when the CPUS will support 56
  bits of addressing.
         Turn off paging, if enabled. To do that, you must ensure that you are running in a "see through" area.
         Set PAE, by setting CR4's fifth bit.
         Create the new page tables and load CR3 with them. Because CR3 is still 32-bits before entering Long mode, the page
         table must reside in the lower 4GB.
         Enable Long mode (note, this does not enter Long mode, it just enables it).
         Enable paging. Enabling paging activates and enters Long mode.
  Because the rdmsr/wrmsr opcodes are also available in Real mode, you can activate Long mode from Real mode directly by
  setting both PE and PM bits of CR0 simultaneously.
Entering 64-bit
Now you are in compatibility mode. Enter 64-bit mode by jumping to a 64-bit code segment:
  The initial 64-bit segment must reside in the lower 4GB because compatibility mode does not see 64-bit addresses. Note that you
  must use the linear address, because 64-bit segments always start from 0. Note also that if the current compatibility segment is 16-
  bit default, you have to use the 066h prefix.
The only thing you have to do in 64-bit mode is to reset the RSP:
   mov rsp,STACK64
   shl rsp,4
   add rsp,stack64_end
https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print                                                  18/28
10/01/2019                                                The Intel Assembly Manual - CodeProject
  SS, DS, ES, are not used in 64-bit mode. That is, if you want to access data in another segment, you cannot load DS with that
  segment's selector and access the data. You must specify the linear address of the data. Data and stack are always accessed with
  linear addresses. "Flat" mode is not only the default, it is the only one for 64-bit.
  Once you are in 64-bit mode, the defaults for the opcodes (except from jmp/call) are still 32-bit. So a REX prefix is required
  (0x40 to 0x4F) to mark a 64-bit opcode. Your assembler handles that automatically if it supports a "code64" segment.
  In addition, a 64-bit interrupt table must now be set with a new LIDT instruction, this time taking a 10-byte operator (2 for the
  length and 8 for the location).
  This gets you back to compatibility mode. 64-bit OSs keep jumping from 64-bit to compatibility mode in order to be able to run both
  64-bit and 32-bit applications.
Multiple Cores
https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print                                                  19/28
10/01/2019                                                The Intel Assembly Manual - CodeProject
  A single CPU can execute one instruction at a time. Multitasking in single processors is generally the fast switching (at the software
  level) between different registers/paging for each process running, and this is so fast that it appears that processes run
  simultaneously.
  A multiple core CPU is similar to having many single CPUs that share the same memory. Everything else (Registers, modes, etc)
  are specific to each CPU. That means that if we have an 8 core processor, we have to execute the same procedure 8 times to put it
  e.g. in long mode. We can have one processor to real mode and another processor in protected mode, another processor in long
  mode etc.
Discovery
  The Advanced Programmable Interrupt Controller (APIC) is a set of tables, found in memory, that will provide us the information we
  need. First we discover the presence of APIC:
   mov eax,1
   cpuid
   bt edx,9
   jc ApicFound
  Second, we search for the Advanced Configuration and Power Interface (ACPI) in memory. The ACPI is the first of the APIC tables,
  it resides somewhere in BIOS memory, between physical addresses 0xE0000 and 0xFFFFF and it has the following header:
   struct RSDPDescriptor
   {
    char Signature[8];
     uint8_t Checksum;
    char OEMID[6];
     uint8_t Revision;
     uint32_t RsdtAddress;
  The above RSDP Descriptor contains the signature value which, for the first ACPI table, is 0x2052545020445352. If this signature
  is not found in the memory, then we don't have ACPI and therefore, there are no multiple CPU cores.
Each descriptor also has a checksum, which is verified with the following algorithm:
   IsChecksumValid:
       PUSH ECX
       PUSH EDI
       XOR EAX,EAX
       .St:
       ADD EAX,[FS:EDI]
       INC EDI
       DEC ECX
       JECXZ .End
       JMP .St
       .End:
       TEST EAX,0xFF
       JNZ .F
       MOV EAX,1
       .F:
       POP EDI
       POP ECX
       RETF
https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print                                                 20/28
10/01/2019                                                 The Intel Assembly Manual - CodeProject
  In case we succeed in finding an ACPI 2.0 table and its ExtendedChecksum is verified, then we must use the XsdtAddress (which
  always points to lower 4GB) to find the other tables. If it is an ACPI 1.0 then we use the RsdtAddress.
  Having found the address, we use it to locate the first APIC table. The starting table contains pointers to all the other tables (32 or
  64 bit if APIC 2.x+) after the header. This physical address is over the 1MB and hence it is only accessible from protected (or
  unreal) mode. There are many ACPI tables but we are only interested in a few of them.
   struct ACPISDTHeader
     {
     char Signature[4];
     unsigned long Length;
     unsigned char Revision;
     unsigned char Checksum;
     char OEMID[6];
     char OEMTableID[8];
     unsigned long OEMRevision;
     unsigned long CreatorID;
     unsigned long CreatorRevision;
     };
  The first table that we will find contains the pointers to all other APIC tables after this header. The Length member contains the
  length of the entire table, including the header.
  To find how many processors we have, we find the "MADT" table, a table which has the signature "APIC" in its header. After the
  standard header, we have:
         At offset 0x24, the Local APIC Address, which we will need later.
         At offset 0x2C, the rest of the MADT table contains a sequence of variable length records which enumerate the interrupt
         devices. Each record begins with the 2 header bytes, 1 for the type and one for the length. If the type bype is 0, then the
         bytes following the length byte contain 6 bytes, describing a physical CPU. The first byte is the ACPI Processor ID and the
         second byte is the APIC ID of this processor.
Looping the above table will reveal us all the installed processors along with their ACPI and APIC IDs.
Initial Startup
  A CPU can communicate with another CPU by issuing an "Interprocessor Interrupt" (IPI). To prepare the APIC to manage
  interrupts, we have to enable the "Spurious Interrupt Vector Register", indexed at 0xF0:
  After that, we are ready to send IPIs. An IPI (Interprocessor Interrupt) is sent by using the Interrupt Command Register of the
  Local APIC. This consists of two 32-bit registers, one at offset 0x300 and one at offset 0x310 (All Local APIC registers are aligned
  to 16 bytes):
         The register at 0x310 is what we write it first, and it contains the Local APIC of the processor we want to send the interrupt
         at the bits 24 - 27.
         The register at 0x300 has the following structure:
   struct R300
       {
       unsigned      char   VectorNumber; // Starting page for SIPI
       unsigned      char   DestinationMode:3; // 0 normal, 1 low, 2 SMI, 4 NMI, 5 Init, 6 SIPI
       unsigned      char   DestinationModeType:1; // 0 for physical 1 for logical
       unsigned      char   DeliveryStatus:1; // 0 - message delivered
       unsigned      char   R1:1;
       unsigned      char   InitDeAssertClear:1;
       unsigned      char   InitDeAssertSet:1;
       unsigned      char   R2:2;
       unsigned      char   DestinationType:2; // 0 normal, 1 send to me, 2 send to all, 3 send to all
   except me
https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print                                                   21/28
10/01/2019                                                 The Intel Assembly Manual - CodeProject
        unsigned char R3:12;
        };
  Writing to register 0x300 will actually send the IPI (that is why you must write to 0x310 first). Note that if DestinationType is
  not 0, the Destination target in the register 0x310 is ignored. Under Windows, IPIs are sent with an IRQL level 29.
  As we know, the CPU starts in real mode from 0xFFFF:0xFFF0 position, but this is only true for the first cpu. All other CPUs stay
  "asleep" until woken up, in a special state called Wait-for-SIPI. The main CPU awakes other CPUs by sending a SIPI (Startup
  Inter-Processor Interrupt) which contains the startup address for that CPU. Later on, there are other Inter-processor Interrupts to
  communicate between the CPUs.
  To awake the processor, we send two special IPIs. The first is the "Init" IPI, DestinationMode 5, which stores the starting
  address for the CPU. Remember that the CPU starts in real mode. Because the processor starts in real mode, we have to give it a
  real memory address, stored in VectorNumber. The second IPI is the SIPI, DestinationMode 6, which starts the CPU.
  The starting address must be 4096 aligned.
Later Communication
  Apart from INIT and SIPI, which we saw above, the local APIC can be used to send a normal interrupt, i.e., merely
  executing INT XX in the context of the target CPU. We have to take into consideration the following:
         If the CPU is in HLT state, the interrupt awakes it, and when the interrupt returns the CPU resumes with the instruction after
         the HLT opcode. If there is also a CLI, then we must send a NMI interrupt (A flag in the APIC Interrupt Register) to wake
         the CPU.
         If the CPU is in HLT state and we send again an INIT and a SIPI, the CPU starts all over again from real mode.
         The interrupt must exist in the target processor. For example, in protected mode, the interrupt must have been defined
         in IDT.
         The Local APIC is common to all CPUS (memorywise), therefore, we must lock for write access (mutex) before we can
         issue the interrupt.
         Because the registers cannot be passed from CPU to CPU, we have to write all the registers (that will be used for the
         interrupt, if any) in a separated memory area.
         The interrupt might fail, so, you have to rely on some inter-cpu communication (via shared memory and mutexes) to verify
         the delivery.
         Finally, the handler of the interrupt must tell its own Local APIC that there is an "End of Interrupt". It was similar to int 0x21's
         out 020h,al in the past. Now we write to the EOI register (LocalApic + 0xB0) the value 0 (End Of Interrupt).
Synchronization
  Since the CPUS share the same memory, it is crucial to synchronize write and read accesses to critical parts of it. In Windows of
  course we have mutexes ready to be used, but here some extra work has to be done. We can create our own mutex variable as
  follows:
  Note the pause opcode (equal to rep nop). This is a hint to the cpu that we are inside a spin loop, which greatly enhances
  performance because code prefetching is avoided.
Our problem is to wait for a mutex, then grab it when it is free (similar to WaitForSingleObject()). This code is not going to work:
   .Loop1:
   CMP byte [edi],0xff
   JZ .OutLoop1
   pause
   JMP .Loop1
https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print                                                     22/28
10/01/2019                                                 The Intel Assembly Manual - CodeProject
   .OutLoop1:
   .MutexIsFree:
   DEC [edi]
  The reason is that, between the JZ command (which has verified that the mutex is free) and before the DEC [edi] is executed,
  another CPU might grab the mutex (race condition).
Fortunately for us, the CPU provides a LOCK CMPXCHG opcode which atomically grabs the lock for us:
   .Loop1:
   CMP byte [edi],0xff
   JZ .OutLoop1
   pause
   JMP .Loop1
   .OutLoop1:
   ; Lock is free, can we grab it?
   mov bl,0xfe
   MOV AL,0xFF
   LOCK CMPXCHG [EDI],bl
   JNZ .Loop1 ; Write failed, someone got us
   .OutLoop2: ; Lock Acquired
  We use the CMPXCHG instruction which, along with the LOCK prefix, atomically tests [edi] if it is still 0xFF (the value in AL), and
  if yes, then it writes BL to it and sets the ZF. If another CPU has done the same meanwhile, the ZF is cleared and BL is not
  moved to the [edi].
Virtualization
  Virtualization, techically, is a "system" inside the system. Its a clone of the processor running inside the same processor. It is not
  very much complex to setup and it greatly enhances computing since you are able to run another OS inside an existing OS.
  Each CPU (called Host) can run one Virtual Machine (called guest) at a time. You can configure multiple guests per CPU and
  pause/resume each guest, much like multitasking. If you have 8 CPU cores of course, you can have 8 guests running
  simultaneously.
           mov eax,1
           cpuid
           bt ecx,5
           jc VMX_Supported
           jmp VMX_NotSupported
           mov rax,cr4
           bts rax,13
           mov cr4,rax
         Configure a VMXON structure. This is a 4096-aligned CPU-specific array and its size must be the number we got from
         the IA32_VMX_BASIC register. A VMXON structure contains:
                 4 bytes that are used for VMX Abort data (we will check that later),
                 The rest of the bytes are used by VMCS groups (we will check that later).
  VMCS Groups
  The rest of the VMCS (that is, after the first 8 bytes (revision + VMX Abort) is divided into 6 subgroups:
         Guest State
         Host State
         Non root controls
         VMExit controls
         VMEntry controls
         VMExit information
  Each of the above fields contains important information. We will look at them one by one. To mark a VMCS for further
  reading/writing with VMREAD or VMWRITE, you would first initialize its first 4 bytes to the revision (as with the VMXON structure
  above), and then execute a VMPTRLD with its address.
  Appendix H of the 3B Intel Manual has a list of all indices. For example, the index of the RIP of the guest is 0x681e. To write the
  value 0 to that field, we would use:
   mov rax,0681eh
   mov rbx,0
   vmwrite rax,rbx
  Not all features are always present in all processors. We must check the VMX MSRs for available features before testing them.
  Intel's 3B Appendix G contains all these MSRs. To load a MSR, you put its number to RCX and execute the rdmsr opcode. The
  result is in RAX.
         IA32_VMX_BASIC (0x480): Basic VMX information including revision, VMCS size, memory types and others.
         IA32_VMX_PINBASED_CTLS (0x481): Allowed settings for pin-based VM execution controls.
         IA32_VMX_PROCBASED_CTLS (0x482): Allowed settings for processor based VM execution controls.
         IA32_VMX_PROCBASED_CTLS2 (0x48B): Allowed settings for secondary processor based VM execution controls.
         IA32_VMX_EXIT_CTLS (0x483): Allowed settings for VM Exit controls.
         IA32_VMX_ENTRY_CTLS (0x484): Allowed settings for VM Entry controls.
         IA32_VMX_MISC MSR (0x485): Allowed settings for miscellaneous data, such as RDTSC options, unrestricted guest
         availability, activity state and others.
         IA32_VMX_CR0_FIXED0 (0x486) and IA32_VMX_CR0_FIXED1 (0x487): Indicate the bits that are allowed to be
         0 or to 1 in CR0 in the VMX operation.
         IA32_VMX_CR4_FIXED0 (0x488) and IA32_VMX_CR4_FIXED1 (0x489): Same for CR4.
         IA32_VMX_VMCS_ENUM (0x48A): enumerator helper for VMCS.
         IA32_VMX_EPT_VPID_CAP (0x48C): provides information for capabilities regarding VPIDs and EPT.
This contains the following information (In parentheses, the bit number):
This contains the following information (In parentheses, the bit number):
                 Selector (16)
                 Base address (64)
                 Segment limits (32)
                 Access rights (32)
         IA32_DEBUGCRTL (64)
         IA32_SYSENTER_CS (32)
         IA32_SYSENTER_ESP (64)
         IA32_SYSENTER_EIP (64)
         IA_PERF_GLOBAL_CTRL (64)
         IA32_PAT (64)
         IA32_EFER (64)
         SMBASE (32)
         Activity State (32) - 0 Active , 1 Inactive (HLT executed) , 2 Triple fault occured , 3 waiting for startup IPI (SIPI).
         Interruptibility state (32) - a state that defines some features that should be blocked in the VM.
         Pending debug exceptions (64) - to facilitate hardware breakpoings with DR7.
         VMCS Link pointer (64) - reserved, set to 0xFFFFFFFFFFFFFFFF.
         VMX Preemption timer value (32)
         Page Directory pointer table entries (4x64), pointers to pages.
This group defines how the guest will start. The guest can be started in two modes:
  Starting a guest in protected mode still allows the guest to turn later into long mode. If a guest expects a real mode start but
  unrestricted guest is not available, then you can start in VM86 mode.
  These fields configure what is allowed to be executed in the guest and what is not. Everything not allowed causes a
  VMEXIT. The sections are:
                 Primary: Single Step, TSC HLT INVLPG MWAIT CR3 CR8 DR0 I/O Bitmaps
                 Secondary: EPT, Descriptor Table Change, Unrestricted Guest and others
         Exception bitmap (32b): One bit for each exception. If bit is 1, the exception causes a VMExit.
         I/O bitmap addresses (2x64b): Controls when IN/OUT cause VMExit.
         Time Stamp Counter offset
         CR0/CR4 guest/host masks
         CR3 Targets
         APIC Access
         MSR Bitmaps
https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print                                                 25/28
10/01/2019                                                 The Intel Assembly Manual - CodeProject
  For example, you can configure it so an exception would make it to the host, instead of being caught in the guest. Similarily you
  might not allow GDT changes, Control Register changes etc.
These fields tell the CPU what to load and what to discard in case of a VMExit:
These fields tell the CPU what to inject to the guest in case of an exit:
Basic information
EPT
  An EPT is a mechanism that translates host physical address to guest physical addresses. It is exactly the same as the long mode
  paging mechanism.
Manual Exits
  A guest that knows that is a guest might want to deliberately exchange information with its host. For this reason, the instruction
  VMCALL is provided to manually trigger an exit.
DMMI
  DPMI works, but a long mode driver is also needed. Therefore I have decided to create a TSR service, included in the github
  project. I've called it DOS Multicore Mode Interface. It is a driver which helps you develop 32 and 64 bit applications for DOS,
  using int 0xF0. This interrupt is accessible from both real, protected and long mode. Put the function number to AH.
  To check for existence, check the vector for INT     0xF0. It should not be pointing to 0 or to an IRET, ES:BX+2 should point to
  a dword 'dmmi'.
  Int 0xF0 provides the following functions to all modes (real, protected, long)
         AH = 0, verify existence. Return values, AX = 0xFACE if the driver exists, DL = total CPUs. This function is
         accessible from real, protected and long mode.
         AH = 1, begin thread. BL is the CPU index (1 to max-1). The function creates a thread, depending on AL:
https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print                                                   26/28
10/01/2019                                                  The Intel Assembly Manual - CodeProject
                  0, begin (un)real mode thread. ES:DX = new thread seg:ofs. The thread is run with FS capable of unreal mode
                  addressing, must use RETF to return.
                  1, begin 32 bit protected mode thread. EDX is the linear address of the thread. The thread must return with RETF.
                  2, begin 64 bit long mode thread. EDX holds the linear address of the code to start in 64-bit long mode. The thread
                  must terminate with RET.
                  3, begin virtualized thread. BH contains the virtualization mode (currently only mode 2 = protected mode
                  virtualization is supported), and EDX the virtualized linear stack. The thread must return with RETF or VMCALL.
          AH = 5, mutex functions.
                  AL   =   0 => initialize mutex to ES:DI (real) , EDI linear (protected), RDI linear (long).
                  AL   =   1 => Lock mutex
                  AL   =   2 => Unlock mutex
                  AL   =   3 => Wait for mutex
          AH = 4, execute real mode interrupt. AL is the interrupt number, BP holds the AX value and BX,CX,DX,SI,DI are
          passed to the interrupt. DS and ES are loaded from the high 16 bits of ESI and EDI.
  Now, if you have more than one CPU, your DOS applications/games can now directly access all 2^64 of memory and all your
  CPUs, while still being able to call DOS directly. In order to avoid calling int 0xF0 directly from assembly and to make the
  driver compatible with higher level languages, an INT 0x21 redirection handler is installed. If you call INT 0x21 from the
  main thread, INT 0x21 is executed directly. If you call INT 0x21 from protected or long mode thread, then INT
  0xF0 function AX = 0x0421 is executed automatically.
  The project
  The full github project includes many functions discussed in this article. It's arranged with 4 filters: 16 bit code, 32 bit code, data,
  DMMI client and configuration files.
The fact that you made it to this end means that you are truly decisive. Have fun and good luck!
  References
          http://www.fysnet.net/emsinfo.htm, EMS info
          http://www.ctyme.com/rbrown.htm, Ralf Brown Interrupt List
          http://bochs.sourceforge.net, Bochs
          https://github.com/Himmele/My-Blog-
          Repository/blob/master/Operating%20Systems/Build%20Your%20Own%20OS/Protected%20Mode%20Tutorial.txt,
          Till Gerken PM Tutorial
          https://wiki.osdev.org/Context_Switching, Task Switching
          http://www.sudleyplace.com/dpmione/dpmispec1.0.pdf, DPMI specification
          http://www.delorie.com/djgpp/doc/dpmi/, DJCPP DPMI examples
          http://www.sudleyplace.com/swat/, 386SWAP protected mode debugger
          http://dos32a.narechk.net/index_en.html, DOS32A DPMI extender
  License
  This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)
https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print                                                         27/28
10/01/2019                                                    The Intel Assembly Manual - CodeProject
                                           Michael Chourdakis
                                            Engineer
                                            Greece
I 've a PhD in Digital Signal Processing and Artificial Intelligence and I specialize in Pro Audio and AI applications.
  Permalink | Advertise | Privacy | Cookies | Terms of Use | Mobile                             Article Copyright 2019 by Michael Chourdakis
  Web05 | 2.8.190109.1 | Last Updated 10 Jan 2019                                       Everything else Copyright © CodeProject, 1999-2019
https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print 28/28