KEMBAR78
The Intel Assembly Manual - CodeProject | PDF | Office Equipment | Computer Data Storage
0% found this document useful (0 votes)
245 views28 pages

The Intel Assembly Manual - CodeProject

The Intel Assembly Manual - CodeProject

Uploaded by

Gabriel Gomes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
245 views28 pages

The Intel Assembly Manual - CodeProject

The Intel Assembly Manual - CodeProject

Uploaded by

Gabriel Gomes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

10/01/2019 The Intel Assembly Manual - CodeProject

The Intel Assembly Manual


Michael Chourdakis, 10 Jan 2019

All in one: x86, x64, Virtualization, multiple cores, along with new additions

Github link: https://github.com/WindowsNT/asm. The entire project.

Introduction

This is my full and final article about the Intel Assembly, it includes all the previous hardware articles (Internals, Virtualization,
Multicore, DMMI) along with some new information (HIMEM.SYS, Flat mode, EMM386.EXE, Expanded Memory, DPMI
information).

Reading this through will enable you to understand how the operating systems work, how the memory is allocated and addressed
and, perhaps how to make your own OS-level drivers and applications.

To help you understand what's happening, the github project includes many aspects of the article (and I 'm still adding stuff). It's a
ready to be run tool which includes a Bochs binary, VMWare and VirtualBox configurations and a Visual Studio solution. The entire
project is build in assembly using Flat Assembler.

Assemblers like TASM or MASM will not work, for they only support specific architectures.

Bochs is the best environment to experiment, because it includes a hardware GUI debugger (I'm proud of developing it myself)
which can help you understand the internals. Debugging without Bochs is impossible, because the debuggers are either real mode
only (like MSDOS Debug) and assume you will always have some sort of control (which is not the case in most debugging areas),
or are able to run only in an existing environment (like Visual Studio).

If you have good C knowledge, then this will be a benefit in understanding the internals. Asesmbly knowledge is recommended, but
you can follow the article even if you know nothing about assembly.

Generic Information

Architecture and CPU


Assembly is a language that everything must be done manually. A single printf() call will perhaps take thousands of
assembly instructions to execute. While this article does not attempt to teach you assembly, it would be necessary to bear in mind
that really lots of things are needed even to achieve the smallest result (that is actually why higher level languages were created).
Assembly language is also specific to the architecture (Here, we discuss Intel x86 and x64), whereas a language like C is portable.

Assembly has a small (comparatively) set of commands:

Commands that move data between various places


Commands that execute mathematic algorithms (simple to complex)
Commands that check conditions (like if)
Other commands (to be later discussed)

https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print 1/28
10/01/2019 The Intel Assembly Manual - CodeProject
The CPU is the unit that executes assembly instructions. The way they are executed depends on the running mode of the
processor, and there are 4 modes:

Real mode
Protected mode (in two vresions, segmented and flat)
Long mode
Virtualization (not exactly a mode, but we will talk about it later)

The next paragraphs in this chapter discuss various elements of the assembly language in general.

Memory
Physically, the memory is one big array. If you have 4GB, you could describe it as unsigned char mem[4294967295].
However, the way it is used greatly differs depending on the processor mode and the configuration of the operating system.
Therefore, you do not access it as a big array.

Stack and Functions


Stack is special memory that is setup for temporary storage. Parameters passed to a function are "pushed" to the stack, when the
function ends they are "popped" so the stack clears and C functions's local variables go there, that's why they vanish when the
function terminates. The stack memory is, technically, nothing but normal memory used for special purposes.

This is (oversimplified for now) what approximately happens in assembly with a function:

int x(int a,int b)


{
return a + b;
}

int c = x(5,10); // result c = 15

x:
mov ax,[first stack element]
mov bx,[second stack element]
add ax,bx
ret 4

main:
push 5
push 10 ; the order is different, but let's forget about that now
call x
; ax contains the resuln

The variables "a" and "b" are "pushed" to temporary memory (which is now 4 bytes less if int = 16 bits). The function is called, and
then it returns with the stack cleared and ax containing the return value. Note that the above is a big oversimplification of what the
assembly code actually looks like, but let's pass for now.

Registers
In addition to memory, each CPU has some auxilliary places to store data, called registers. What registers are available depends
on the current running mode. Some registers have special meanings, some are for generic purposes.

Interrupts
An interrupt is a piece of code that interrupts other running code. For the moment, just assume it's a function that can run while you
are inside another function. There are interrupts that are automatically generated by the CPU, and interrupts that are "called" by
software. The way they work depends on the running mode, and there can be a maximum of 255 interrupts.

Exceptions
An exception is an interrupt triggered by either the CPU (for example, when a divide by zero occurs in your C++ code, int 00
functions are executed), or by using the API (via the throw keyword, for example), which generates a software interrupt. In the
https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print 2/28
10/01/2019 The Intel Assembly Manual - CodeProject
lower level we are discussing, there is no difference between exceptions and interrupts.

Now that we have an idea of the basics, let's proceed to CPU modes.

Real Mode

Architecture
Real mode is the oldest mode. DOS runs in it. Windows 3.0 also runs in it when started with the /r switch. Everything is 16 bit. It is
the weakest mode of operation, but not the simplest one. Memory is addressed by an 20 bit controller, making possible to access
up to 1MB memory. Available memory over this limit is useless in real mode.

Segmentation
Memory is not accessed as an array, but in segments. Each pointer is described by a 16 bit segment, which is a memory address
divided by 16, and an offset, which describes how far from the offset we will go. So we will see some simple (in hex) examples:

0000:0000 -> memory address 0


0000:0010 -> memory address 16 (hex 10)
0001:0002 -> memory address 18. Segment 1*16 + offset 2
0010:0034 -> 0x10*16 + 0x34
0011:0024 -> 0x11*16 + 0x24, same pointer as above
FFFF:0010 -> Maximum available address, specifying more than 0010h results in wrapping around
zero.

We can see that segments can overlap. Specifying 0ffffh segment and an offset larger than 0010h results in wrapping. A segment
maximum capacity is 64KB. Although we can go up to a FFFF segment, only the lower 640KB were available for DOS applications,
because the upper segments (over 0xA000) were reserved for the BIOS.

All segments have read/write/execute access from anywhere (that is, any program can read/write or execute code within any
segment). Any application can read from or write to any part of memory, including the part in which the OS resides. That is why a
real mode OS is a single tasking OS and if one app crashes, you have to reboot.

Registers

Real mode registers are 16 bits, and they include:

Four generic purpose registers: AX, BX, CX, DX. The upper 8 bit part of them can be accessed as AH, BH, CH, CL and the
lower part as AL, BL, CL, DL.
A register to hold the offset of the currently executing code: IP.
Four registers to be used as pointers: SI, DI, BP, SP. SP points to the end of the available stack memory. Each time we push
something to the stack, SP decreases. On POP, SP increases. These registers have no 8 bit splits.
Four registers to contain segments: CS, holding always the segment of the currently executing code, DS,ES and SS. SS
holds the segment of the stack memory, DS holds the segment of the data, and ES is an auxilliary register.

So the code is always executing at CS:IP, and stack is pointed by SS:SP.

The 386 CPU adds more registers, also accessible in real mode:

32 bit extensions to the non segment registers: EAX, EBX, ECX, EDX, ESI, EDI, EBP, ESP, EIP.
Two more auxilliary segment registers, GS and FS.
5 control registers, CR0, CR1, CR2, CR3, CR4.
6 debug registers, DR0, DR1, DR2, DR3, DR6, DR7, used for hardware breakpoints.

DS is the default data segment, unless else is specified or if DI or SP or BP are used:

mov ax,[100] ; gets value from DS:100


mov ax,[si] ; gets value from DS:SI
mov ax,[es:si] ; from ES:SI
mov ax,[di] ; from ES:DI. When DI is used, ES is the default segment.
; When BP or SP is used, SS is the default.

https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print 3/28
10/01/2019 The Intel Assembly Manual - CodeProject
ESI, EDI, EBP and ESP can be used as pointers. If their high bits are not zero, then an exception occurs (unless you are in Unreal
mode, discussed below).

COM and EXE files


A COM file is a memory map, fitting in one segment. The first 128 bytes contain the PSP, a data structure containing information,
and the rest of the segment contains all code, data, and stack memory for the program. CS = DS = ES = SS. SP is set to 0xFFFE to
point to the end of the segment. Execution starts from CS:IP = 0x100 (after the PSP).

An EXE file might have multiple segments, so an EXE can be more than 64KB. DS and ES initially point to the PSP. When an EXE
is loaded, "relocations" are resolved. A relocation is a position within the executable that the assembler leaves as empty, to be filled
with a segment value which would only be known at run time.

Interrupts
All the functions that DOS and BIOS provides are available through real mode software interrupts. In real mode, the first 1024 bytes
of RAM (Starting at 0000:0000) contain a set of 256 segment:offset pointers to each interrupt. In 286+ this location can be changed
by the LIDT command, which points to a 6 byte array:

Bytes 0-1 contain the full length of the IDT, maximum 1KB => 256 entries.
Bytes 2-5 contain the physical address of the first entry of the IDT, in memory.

Some interrupts are automatically issued by the processor when some event occurs. In real mode, the most significant are:

Interrupt 0, called on divide by zero.


Interrupt 1, called when using a debugger for single step.
Interrupt 3, called on breakpoints.
Interrupt 6, called on invalid opcode.
Interrupt 9, called on key press.

Software interrupts provide various services to real mode apps. The most important interrupts are:

0x10, BIOS display functions


0x13, BIOS disk functions
0x14, BIOS serial port functions
0x16, BIOS keyboard functions
0x17, BIOS parallel port functions
0x21, DOS functions (files, input, output, application, configuration etc)
0x2F, TSR functions
0x31, DPMI functions
0x33, Mouse functions

Using the excellent Ralf Brown Interrupt List you can learn about every interrupt in the world.

Models
Because of the segmented memory, different sets of programming models were created, which mostly resulted in incompatibilities
between compilers and libraries. C pointers were described as near or far, depending on whether they included a segment or not:

The tiny model. Everything has to be included in a single segment (COM file). Pointers are near.
The small model. One segment for the code, one for the data. All pointers are near.
The medium model. One data segment, multiple code segments. Code pointers far, data pointers near.
The compact model. One code segment, multiple data segments. Code pointers near, data pointers far.
The large model. Multiple code and data segments, code and data pointers far. Single data structures still limited to 64KB.
The huge model. Multiple code and data segments, all pointers far.

Benefits
The only benefit in real mode is that you have DOS and BIOS functions available as software interrupts. Therefore, all techniques
used by DOS extenders (which allowed applications to run in protected mode) involved temporarily switching to real mode to call
DOS.

https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print 4/28
10/01/2019 The Intel Assembly Manual - CodeProject
Here is a quick hello world in tiny model:

org 0x100 ; code starts at offset 100h


use16 ; use 16-bit code
mov ax,0900h
mov dx,Msg
int 21h
mov ax,4c00h
int 21h
Msg db "Hello World!$"

This very simple program calls two DOS functions. The first is function 9 (ah register) which accepts a pointer of the string to be
written to the screen in DS:DX (DS already has the segment, it's a com file). The second is function 4C, which terminates the
program.

Here is the same application in EXE format:

FORMAT MZ ; DOS 16-bit EXE format


NTRY CODE16:Main ; Specify Entry point (i.e. the start address)
STACK STACK16:stackdata ; Specify The Stack Segment and Size

SEGMENT CODE16_2 USE16 ; Declare a 16-bit segment

ShowMsg:
mov ax,DATA16
mov ds,ax ; Load DS with our "default data segment"
mov ax,0900h
mov dx,Msg
int 21h; ; Call a DOS function: AX = 0900h (Show Message),
; DS:DX = address of a buffer, int 21h = show message
retf ; FAR return; we were called from
; another segment so we must pop IP and CS.

SEGMENT CODE16 USE16 ; Declare a 16-bit segment


ORG 0 ; Says that the offset of the first opcode
; of this segment must be 0.

Main:
mov ax,CODE16_2
mov es,ax
call far [es:ShowMsg] ; Call a procedure in another segment.
; CS/IP are pushed to the stack.
mov ax,4c00h ; Call a DOS function: AX = 4c00h (Exit), int 21h = exit
int 21h

SEGMENT DATA16 USE16


Msg db "Hello World!$"

SEGMENT STACK USE16


stackdata dw 0 dup(1024) ; use 2048 bytes as stack. When program is initialized,
; SS and SP are automatically set.

How does the assembler know the actual value of the data16, code16, code16_2, and stack16 segments? It doesn't.
What it does is to put null values, and then creates entries to the EXE file (known as "relocations") so the loader, once it copies the
code to the memory, writes to the specified address, the true values of the segments. And because this relocation map has a
header, COM files cannot have multiple segments even if they sum to less than 64KB in total.

This program calls a function ShowMsg in another segment via a far call, which uses a DOS function (09h, INT 21h) to
display text.

Problems
Any program can overwrite any other program, so no multitasking capability
Up to 1MB memory only, and the upper 384K were used by BIOS, so only 640K available.
Mixing far and near pointers between applications and libraries led to incompatibities and, usually, crashes.

https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print 5/28
10/01/2019 The Intel Assembly Manual - CodeProject
If something wrong happens, the PC has to reboot.

Expanded Memory
To cope with the 640KB limitation, an additional compatible memory, called expanded memory or EMS memory was created. This
was not a processor feature, but rather a set of hardware (ISA card) extensions which included a driver to perform bank switching,
i.e. replace portions of memory installed with memory from that card. It offered up to 32MB more, but it was mapped to one of the
high segments (A000, B000, C000, D000, E000 or F000), which means that this extra memory could not be available
simultaneously. The expansion card came with a driver which had to be installed in config.sys and, using the LIM EMS protocol,
offered the services via interrupt 67h.

Detecting EMS, by testing existence of a device called EMMXXXX0:

EMSName db 'EMMXXXX0',0
mov dx,EMSName ; device driver name
mov ax,3D00h ; open device-access/file sharing mode
int 21h
jc NotThere
mov bx,ax ; put handle in proper place
mov ax,4407h ; IOCTL - get output status
int 21h
jc NotThere
cmp al,0FFh
jne NotThere
mov ah,3Eh ; close device
int 21h
jmp ItIsThere

Allocating EMS

Interrupt 0x67, AH = 0x43, BX = # of pages (1 page = 16KB)

Detect segment to be used

Interrupt 0x67, AH = 0x41

Save previous EMS map

Interrupt 0x67, AH = 0x47

Save previous EMS map

Interrupt 0x67, AH = 0x47

Map our allocated memory

Interrupt 0x67, AH = 0x44

Restore previous EMS map

Interrupt 0x67, AH = 0x48

Release EMS

Interrupt 0x67, AH = 0x45

Various other functions are provided by int 0x67.

A20 line
We saw that the maximum address is FFFF:0010, because increasing the offset results in wrapping. That is true because the 8088
CPU has only 20 bits of addressing. However 286+ added the 21th line (known as A20 line) and, when it is enabled, FFFF:0010 to
FFFF:FFFF can be used without wrapping (an almost 64KB more). This memory (known as High Memory Area, HMA) is now
https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print 6/28
10/01/2019 The Intel Assembly Manual - CodeProject
accessible from real mode and it can be used by HIMEM.SYS to load parts of DOS in it and therefore make more low memory
available for applications.

Enabling or disabling A20 manually requires us to communicate with the keyboard controller:

WaitKBC:
mov cx,0ffffh
A20L:
in al,64h
test al,2
loopnz A20L
ret

ChangeA20:
call WaitKBC
mov al,0d1h
out 64h,al
call WaitKBC
mov al,0dfh ; use 0dfh to enable and 0ddh to disable.
out 60h,al
ret

Segmented Protected Mode

Architecture
Protected mode solves the real mode problems. In particular:

Up to 16 MB (286) and up to 4GB (386+) are directly accessible.


Memory access is checked, protections and protection levels are available.
If something wrong happens, the problem can be isolated and the rest of the applications are not affected.
There is 16-bit protected mode (286+) or 32-bit protected mode(386+)

DOS never ran in protected mode. Windows 3.0 run in 16-bit segmented protected mode, when started with the /s switch. Windows
95+, Linux and the rest of 32-bit OSes run in flat protected mode, but before checking the flat mode we will immerse in the complex
mechanisms that protected mode has. Flat mode greatly simplifies many complex things in normal segmented protected mode.

Protected mode introduces "rings", that is, levels of authorization. There are four rings (Ring 0, 1, 2 and 3), in which the Ring 0 is
the most authorized, where the Ring 3 is the less authorized. Code running in a less privileged ring cannot access (without the OS
supervision) code in a higher ring.

Memory
Each segment in memory is not anymore fixed, nor it has a fixed 64KB size. A protected mode segment can have any size, from 1
byte to 4GB. Each segment has its own limitations (read, write, execute access) and its own protection ring.

Registers
The same set of registers that exist in real mode are available. Also, every register can be used as an index, for example mov
ax,[ebx] will work.

Global Descriptor Table


The Global Descriptor Table (GDT) is a set of entries that describes all segments for the CPU. Each entry is 8 bytes long and has
the following format:

Bits Meaning
0-15 Limit low 16 bits
16-31 Base low 16 bits
32-39 Base medium 8 bits

https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print 7/28
10/01/2019 The Intel Assembly Manual - CodeProject

Bits Meaning
40 Ac
41 RW
42 DC
43 Ex
44 S
45-46 Priv
47 Pr
48-51 Limit upper 4 bits
52-53 Reserved (0)
54 Sz
55 Gr
56-63 Base upper 8 bits

The base is a 32-bit value that indicates the physical memory that this segment starts at.
The limit is an 20- bit value indicating the length of the segment, depending on the Gr bit. If the Gr bit is 1, then the actual
limit is the limit value * 4096.
The Ex flag is 1, to indicate a code segment, or 0, to indicate a data segment.
The DC flag has different meaning, depending on the Ex flag:

For code segment (Ex = 1), if DC is 0 then the segment is non conforming. A non conforming segment can only be
called from a segment with the same privilege level. If RW is 1 then the segment is conforming and can be also
called from segments with higher privilege. For example, a ring 3 conforming segment can be called from a ring 2
segment.
For data segment (Ex = 0), if DC is 0 then the data segment expands up, else it expands down. For an expanding
down segment, it starts from its limit and ends to its base, with the address going the reverse way. This flag was
created so a stack segment could be easily expanded, but it is not used today.

The RW flag has different meaning, depending on the Ex flag:

For code segment (Ex = 1), if 0, then the segment is not readable. If 1, then the code segment is readable.
For data segment (Ex = 0), if 0, segment is read only, else read-write.

Note that a code segment is not writable. However, because segment base addresses can overlap, you can create a
writable data segment with the same base address and limit of a code segment.

The Pr indicates the current ring (00 to 11)


The Ac bit indicates access. The CPU sets this bit each time the segment is accessed, so the OS gets an idea how frequent
is the access to the segment, so it knows if it can cache it to disk or not.
The S bit must be 1 for code and data segments, and 0 for system segments (see below).
The Pr bit can be set to 1 to indicate that the segment is present in memory. If the OS caches this segment to the disk, then
it sets Pr to 0. Any attempt to access the removed segment causes an exception. The OS catches this exception, and
reloads the segment to memory, setting Pr to 1 again.
The Sz bit can have two values:

0, in which case the default for opcodes is 16-bit. The segment can still execute 32-bit commands (386+) by putting
the 0x66 or 0x67 prefix to them.
1 (386+), in which case the default for opcodes is 32-bit. The segment can still execute 16-bit commands by putting
the 0x66 or 0x67 prefix to them.

In real mode, the segment registers (CS, DS, ES, SS, FS, GS) specify a real mode segment. And you can put anything to them,
no matter where it points. And you can read and write and execute from that segment. In protected mode, these registers are
loaded with selectors. The selectors are indices to the GDT and have the following format:

Bits Meaning
0-2 RPL. Requested protection level, must be equal or lower to the segment PL.
2 0 to take the entry from GDT, 1 from the LDT (see below)
3-15 0-based index to the table.

In protected mode, you can't just select random values to the segment registers like in real mode. You must put valid values or you
will get an exception. The exception is the first entry in the GDT table, which is always set to 0. CPU does not read information from
entry 0 and thus it is considered a "dummy" entry. This allows the programmer to put the 0 value to a segment register (DS, ES, FS,
GS) without causing an exception.

The GDT is loaded to the CPU by executing the LDGT command, which points to a 6-byte array:

Bytes 0-1 contain the full length of the GDT, maximum 4KB => 4096 entries.

https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print 8/28
10/01/2019 The Intel Assembly Manual - CodeProject
Bytes 2-5 contain the physical address of the first entry of the GDT, in memory.

Interrupts
The interrupt table is now 8 bytes long for each defined interrupt, having the following structure:

struc IDT_STR
{
.ofs0_15 dw ofs0_15
.sel dw sel
.zero db zero
.flags db flags ; 0 P,1-2 DPL, 3-7 index to the GDT
.ofs16_31 dw ofs16_31
}

Each interrupt also has a protection level. The LIDT command has the same functionality as in real mode, pointing to an 6 byte
array (containing the size and the physical location of the first entry).

After the LIDT command is executed, real mode interrupts no longer work, so a real mode debugger is useless.

Local Descriptor Table


Local Descriptor Table (LDT) is a method for each application, on multitasking scenarios, to have a private set of segments, loaded
with the LLDT assembly instruction. The LDT bit in the selector specifies if the segment loaded is from the GDT or from the LDT.

System Segments in the GDT


When the S bit in the GDT is 0, this indicates a system-related segment. In this case, GDT entries describe three kinds of system
segments:

Task Segments
Call Gates
Interrupt Gates
Trap Gates (same as interrupt gates, with the exception that when a trap occurs, interrupts are still enabled)

Bits 40-43 in a GDT entry have the following meaning:

0000 - Reserved
0001 - Available 16-bit TSS
0010 - Local Descriptor Table (LDT)
0011 - Busy 16-bit TSS
0100 - 16-bit Call Gate
0101 - Task Gate
0110 - 16-bit Interrupt Gate
0111 - 16-bit Trap Gate
1000 - Reserved
1001 - Available 32-bit TSS
1010 - Reserved
1011 - Busy 32-bit TSS
1100 - 32-bit Call Gate
1101 - Reserved
1110 - 32-bit Interrupt Gate
1111 - 32-bit Trap Gate

Call Gates
Call gates are a mechanism to switch from a low privilege code to a higher one, used for user-level code to call system-level code.
You specify a 1100 type entry in the GDT with the following format:

https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print 9/28
10/01/2019 The Intel Assembly Manual - CodeProject
Hide Copy Code

struct CALLGATE
{
unsigned short offs0_15;
unsigned short selector;
unsinged short argnum:5; // number of arguments to copy to the stack from the current
stack
unsigned char r:3; // Reserved
unsigned char type:5; // 1100
unsigned char dpl:2; // DPL of this gate
unsigned char P:1; // Present bit
unsigned short offs16_31;

};

Using CALL FAR with the selector of this callgate (the offset is ignored) will switch to the gate and execute the higher level privilege
commands. If argnum specifies parameters to be copied, the system copies them to the new stack after pushing SS,ESP,CS,EIP.
Using RETF will return from the gate call.

Call gates are slow mechanisms to transit between rings in the CPU.

TSS Descriptors, Task Gates and Hardware Multitasking


Having the ability to hold Task Segments in the GDT and Local Descriptor Tables, CPUs provide the ability for task switching. The
Task State Segment is where the CPU saves information about a local task (the current registers). Executing a far JMP or a CALL
(offsets are ignored like in call gates) with a selector pointing to a GDT TSS will "switch" to that task, restoring saved registers. The
TSS descriptor is used to specify the base address and limit of the TSS to be used to load the new CPU state from. The CPU has
a register named Task Register which tells which TSS will receive the old CPU state. When the TR register is loaded with an
LTR instruction the CPU looks at the GDT entry (specified with LTR) and loads the visible part of TR with the GDT entry, and the
hidden part with the base and limit of the GDT entry. When the CPU state is saved the hidden part of TR is used.

In addition to the far call and jmp, a context switch can be triggered by a using a Task Gate Descriptor. Unlike TSS Descriptors,
task-gate descriptors can be in the GDT, LDT or IDT (so you can force a task switching when an interrupt occurs).

Entering protected mode


The steps to follow are:

Enable A20
Set the GDT
Set the IDT (if you need interrupts in protected mode)
Enter protected mode with the MSW or the CR0 register.

You use the MSW register (in 286), or, in 386+ CR0:

; 386+
mov eax,cr0
or eax,1
mov cr0,eax

; 286
smsw ax
or al,1
lmsw ax

After that, you must execute a far jump to a protected mode code segment in order to clear possible invalid command cache. If this
code segment is a 16-bit code segment, you must do:

db 0eah ; Opcode for far jump


dw StartPM ; Offset to start, 16-bit
dw xx ; A selector value in the GDT, with the Sz bit off.
https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print 10/28
10/01/2019 The Intel Assembly Manual - CodeProject

If this code segment is a 32-bit code segment, you must do:

db 66h ; Prefix for 32-bit


db 0eah ; Opcode for far jump
dd StartPM ; Offset to start, 32-bit
dw xx ; A selector in the GDT, with the Sz bit on.

Also you must setup the stack and other registers:

mov ax, data_selector


mov ds,ax
mov ax, stack_selector
mov ss,ax
mov esp,1000h ; assuming that the limit of the stack segment
; selected by stack_selector is 1000h bytes.
sti
...

Exiting protected mode

cli
mov eax,cr0
and eax,0ffffffeh
mov cr0,eax
mov ax,data16
mov ds,ax
mov ax,stack16
mov ss,ax
mov sp,1000h ; assuming that stack16 is 1000h bytes in length
mov bx,RealMemoryInterruptTableSavedWithSidt
litd [bx]
sti
; (Real mode debugger works here) ...

In 286, you cannot get back to real mode because a LMSW ax to remove the protected mode flag results in a processor reset,
keeping the memory intact. 286 forces this reset and puts a routine to be executed after the reset with the following code:

MOV ax,40h
MOV es,ax
MOV di,67h
MOV al,8fh
OUT 70h,al
MOV ax,ShutdownProc
STOSW
MOV ax,cs
STOSW
MOV al,0ah
OUT 71h,al
MOV al,8dh
OUT 70h,al

In 386+, normal exit back to the real mode can be done.

Problems

While you can access all the memory directly, there is still a lot of segmentation and slow task switching or slow movement between
rings.

https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print 11/28
10/01/2019 The Intel Assembly Manual - CodeProject

Flat Protected Mode

Paging
Paging is the method to redirect a memory address to another address. The requested address is called linear address and the
target address is called physical address. When a linear address is the same as a physical address, we say that we are in a "see
through" area.

To accomplish paging, two tables are used: the page directory and the page table.

The Page Directory is an array of 1024 32-bit entries with the following format:

P,R,U,W,D,A,N,S,G,AA,Addr

P - Page is present in memory. This flag allows the OS to cache the pages back to disk , clear P, and reload them when a
page fault is generated when software attemps to access the page.
R - Page is Read Write if set, else Read only. This restriction applies only to ring 3 unless the WP bit in CR0 is set.
U - If unset, only ring 0 can access this page.
W - If set, write-through is enabled.
D - If set, the page will not be cached. The CPU caches the page tables in it's Translation Lookaside Buffer (TLB).
A - Set when the page is accessed (not automatically, like the GDT bit).
N - Set to 0.
S - Set to 0. If Page Size Extensions (PSE) are enabled, S can be 1, in which case the page size is 4MB instead, and the
pages must be 4MB aligned. This mode is introduced to avoid lots of small pages, at the expense of more memory wasted if
the needed memory is somewhat larger than 4MB. Fortunately, modes can be mixed.
G - Set to 0.
Addr - The upper 20 bits (the lower 12 are ignored because it must be 4096- aligned) of the Page Table entry that this Page
Directory entry points to.

The Page Table is an array of 1024 32-bit entries with a similar format:

P,R,U,W,C,A,D,N,G,AA,Addr

The C bit is the same as the previous D bit


The D bit is used to mark dirty pages (pages that have been written) by the OS.
The G flat, if set, prevents caching in the TLB.

To enable paging:

Load CR3 with the address of the first entry in the Page Directory (must be 4096-aligned).
Set CR0 bit 31. This requires protected mode, with the exception of LOADALL (see below).

Once the tables are loaded, they are cached into TLB. Reloading the CR3 will reset the cache. 486+ also has an INVLPG
instruction to reset only a particular page cache, not the entire TLB.

Architecture

The segmented protected mode is very complex. Using paging, protected mode can be "flat", enabling the following:

All processes get an 4GB virtual address space. Protection is done at the paging level. All segments are 4GB, all segment
selectors always point to the same segment.
Programming is way simpler since only "near" pointers are needed.
The OS can map shared libraries (residing once in physical memory) to multiple virtual destinations per application.
The application only sees memory paged to its own virtual address space, so processes are protected by hardware.

In addition, all modern OSes now use only 2 of the 4 protection rings, ring 0 for their kernel and ring 3 for all the user applications.
Call gates are no more used.

SYSENTER/SYSEXIT

https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print 12/28
10/01/2019 The Intel Assembly Manual - CodeProject
To make transitions between user mode (ring 3) and kernel mode (ring 0) faster, a method other than call gates had to be
implemented. SYSENTER/SYSEXIT instructions are the current way to switch from ring 3 to ring 0. You will use WRMSR to set
the new values for CS (0x174) , ESP (0x175) and EIP (0x176). ECX must hold the ring 3 stack pointer for SYSEXIT and EDX
contains the ring 3 EIP for SYSEXIT. The entry stored for CS must be the index to 4 selectors, the first is the ring 0 code, the
second is the ring 0 data, the third is the ring 3 code and the fourth is the ring 4 data. These values are fixed, so in order to use
SYSENTER your GDT table must contain these entries in this format.

These opcodes only support switching between ring 3 and ring 0, but they are much faster. They are used today instead of the way
slower call gates.

Software multitasking

Task gates are no longer used by today's operating systems. Instead, they apply software multitasking to switch between
processes:

A "scheduler" (an interrupt timer) is run.


It switches stack and EIP based on thread and process priorities.

Because a software scheduler saves only what is necessary for task switching, it is faster than the segmented mode hardware
switching.

Protected Mode Facts

Unreal mode

Because protected mode cannot call DOS or BIOS interrupts, it is generally not very useful to DOS applications. However, a 'bug' in
the 386+ processor turned out to be a feature called unreal mode. The unreal mode is a method to access the entire 4GB of
memory from real mode. This trick is undocumented, however a large number of applications are using it. The trick is based on the
fact that a segment selector can originally point to a 4GB data segment (set in the GDT), and when it goes back to the real mode its
"invisible part" remains intact and still having a 4GB limit.

To use unreal mode, you must:

Enable A20.
Enter protected mode.
Load a segment register (ES or FS or GS) with a 4GB data segment.
Return to real mode.

After returning from protected mode, you can easily do:

; assuming FS has loaded a 4GB data segment from Protected Mode


mov ax,0
mov fs,ax
mov edi,1048576 ; point above 1MB
mov byte [fs:edi],0 ; Set a byte above 1MB.

286 lacks this capability because to exit protected mode, the CPU has to be reset, so all registers are destroyed (but see LOADALL
below).

Huge real mode


The above unreal mode theory can be applied to CS as well, making it possible to execute code at a position over 1MB when EIP >
0xFFFF. However when calling an interrupt, the upper 16 bits of EIP are not pushed to the stack, so on return you will not return
where you were. Therefore, huge real mode was not very much used.

LOADALL
At that time, a now non-existent and mostly undocumented instruction existed, LOADALL (0xF 0x5 in 286, 0xF 0x7 in 386).
LOADALL used, as the name implies, to load all the registers (including the GDTR and IDTR) from one table in memory. In 286
LOADALL (which was not accessible from 386), this table was fixed at memory address 0x800, whereas in 386 LOADALL it reads
https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print 13/28
10/01/2019 The Intel Assembly Manual - CodeProject
the buffer pointed to by real mode ES:EDI. Because the CPU does not check in any way if any of the values loaded by LOADALL
are valid, LOADALL was used by many tools at the time, including HIMEM.SYS, for various infamous actions:

To access the entire memory from real mode without entering protected mode and unreal mode.
To run real code with paging.
To run normal 16-bit code inside protected mode without VM86 (which was not there in 286). This was done by trapping
each memory access (which would lead to GPF because all the segments were marked non-present) and emulating the
desired result by using another LOADALL. Of course this was too slow, but it led to the creation of the VM86 mode in 386,
where LOADALL eventually faded out.

LOADALL cannot switch the 286 back to real mode, but using LOADALL removes the need to enter protected mode altogether.

LOADALL 286 itself was mentioned in the manuals and was partially documented; by contrast, LOADALL 386 was heavily obscure,
probably to induce the programmers to take advantage of the new VM86 mode.

HIMEM.SYS

Protected mode is complex and, without a debugger available, it is prone to lots of unsolvable crashes. To help the programmers,
Microsoft created a driver that was able to manage protected mode from a normal 16-bit DOS application, allowing it to access high
memory. that time, extended memory was mostly, if not totally, used to cache data from the disk, especially from big apps. HIMEM
puts the CPU in unreal mode (or it uses LOADALL in 286) and provides a simple interface to the applications that want more
memory without messing with the protected mode details. By enabling the A20 line, HIMEM allowed a portion of
DOS COMMAND.COM to reside in the high memory area when config.sys had a DOS=HIGH directive.

Detect HIMEM.SYS

Interrupt 0x2F, AX = 0x4300

Return HIMEM.SYS function pointer

Interrupt 0x2F, AX = 0x4310

All the following functions are provided from the function at the returned ES:BX from the above interrupt.

Detect/Enable/Disable A20

AH = 0x7 (detect), 0x3 (enable), 0x4 (disable)

Allocate HMA

AH = 0x1

Free HMA

AH = 0x2

Allocate extended memory

AH = 0x9

Free extended memory

AH = 0xA

Copy real/protected memory from/to real/protected memory

AH = 0xB

Lock/Unlock protected mode memory

AH = 0xC (Lock), 0xD (Unlock)

https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print 14/28
10/01/2019 The Intel Assembly Manual - CodeProject
HIMEM.SYS moves memory in order to defragment it. Locking memory is useful when you will access the memory directly, within
protected mode. Actually, because HIMEM puts the CPU in unreal mode, you can use the very same returned pointers directly.

VM86 Mode
Many of the existing applications were real-mode at the time protected mode was introduced. Even today, many (mostly games) are
played under Windows. To force these applications (which think they own the machine) to cooperate, a special mode should be
created.

The VM86 mode is a special flag to the EFlags register, allowing a normal 16-bit DOS memory map of 640KB which is forwarded
via paging to the actual memory - this makes it possible to run multiple DOS applications at the same time without risking any
chance for one application to overwrite another. EMM386.EXE puts the processor to that state. The OS performs a step-by-step
watching to the process, making sure that the process won't execute something illegal. Normally also, you want to map all your
other critical structures (GDT, IDT etc) above 1MB so they are not visible to any VM86 process.

To trigger VM86 mode, you can use PUSHFD and IRET:

mov ebp,esp
push dword [ebp+4]
push dword [ebp+8]
pushfd
or dword [esp], (1 << 17) ; set VM flags
push dword [ebp+12] ; cs
push dword [ebp+16] ; eip
iret

Once the VM flag is set, you can load a normal "segment" to a segment register. Interrupt calls by DOS applications are caught by
the OS and emulated through it - if possible. Also, some instructions are ignored, for example, if you do a CLI, the interrupts are not
actually disabled. The OS sees that you prefer to not be interrupted and acts accordingly, but interrupts are still there.

All VM86 code executes in PL 3, the lowest privilege level. Ins/Outs to ports are also captured and emulated if possible. The
interesting thing about VM86 is that there are two interrupt tables, one for the real and one for the protected mode. But only
protected mode interrupts are executed.

VM86 was removed from 64-bit mode, so a 64-bit OS cannot execute 16-bit DOS code anymore. In order to execute such code,
you need an emulator such as DosBox.

Many applications were also written to take advantage of the expanded memory, but the modern standard was the protected mode.
EMM386 puts the CPU in VM86 mode and maps via paging memory over 1MB to real mode segments (over 0xA0000), so an
application that would like to use expanded memory can use it via EMM386.EXE, which provides an LIM EMS int 0x67 interface. In
addition, EMM386 allowed "devicehigh" and "loadhigh" commands in CONFIG.SYS, allowing applications to get loaded to these
high segments if possible.

Physical Address Extensions (PAE)


PAE is the ability of x86 to use 36 address bits instead of 32. This increases the available memory from 4GB to 64GB. The 32-bit
applications still see only a 4GB address space, but the OS can map (via paging) memory from the high area to the lower 4GB
address space. This extension was added to x86 to cope with the (nowadays not enough) limit of 4GB, before 64-bit CPUs came to
the foreground.

Enabling PAE (CR4 bit 5) means that now you have 3 paging levels: In addition to Page Directory and the Page Table , you have
now the PDTD, Page Directory Pointer Table, which has four 64-bit entries. Each of the PDTD entries points to a Page Directory of
4KB (like in normal paging). Each entry in the new Page Directory is now 64 bit long (so there are 512 entries). Each entry in the
new Page Directory points to a Page Table of 4KB (like in normal paging), and each entry in the new Page Table is now 64-bit long,
so there are 512 entries. Because that would allow only a quarter of the original mapping, that's why 4 directory/table entries are
supported. The first entry maps the first 1GB, the 2nd the 2nd GB, the 3rd the 3rd GB and finally, the 4th entry maps the 4th GB.

But now the "S" bit in the PDT has a different meaning: If not set, it means that the page entry is 4KB but if set, it means that this
entry does not point to a PT entry, but it describes itself a 2MB page. So you can have different levels of paging traversal depending
on the S bit.

There is a new flag in the Page Directory entry as well, the NX bit (Bit 63) which, if set, prevents code execution in that page.

This system allows the OS to handle memory over 4GB, but since the address space is still 4GB, each process is still limited to
4GB. The memory can be up to 64GB but a process cannot see the entire memory.

https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print 15/28
10/01/2019 The Intel Assembly Manual - CodeProject
Direct Memory Access drivers however have a problem, because they don't use paged memory. If working in 32 bits, the driver has
to manage the paging tables itself in order to be able to manipulate memory over 4GB and this cound mean incompatibilites with
the operating system, unless a safe DMA API was exposed to the driver. For this reason, PAE quickly faded out in favor of 64-bit
operating systems, in which it still remains a required paging level.

DPMI

For DOS applications, unreal mode was not enough, eventually a fully 32-bit capability application had to be created. DPMI (Dos
Protected Mode Interface) was a driver that provided a (relative complex) interface to applications wishing to run in 32 bit protected
mode. DOS extenders, based on DPMI, like DOS4GW and DOS32A were created to support applications (mostly games) that
wanted to run in 32 bit while still having access to DOS interrupts. DPMI catches the interrupt call, switches to real mode, executes
the interrupt and goes back to protected mode. DPMI even allows multitasking and multiple "virtual" 32 bit machines.

Detect DPMI using interrupt 2F:

Interrupt 0x2F, AX = 0x1687

Example from DJCPP:

modesw dd 0 ; far pointer to DPMI host's


; mode switch entry point
mov ax,1687h ; get address of DPMI host's
<a href="http://www.delorie.com/djgpp/doc/dpmi/api/2f1687.html">int 2fh</a> ;
mode switch entry point
or ax,ax ; exit if no DPMI host
jnz error
mov word ptr modesw,di ; save far pointer to host's
mov word ptr modesw+2,es ; mode switch entry point
or si,si ; check private data area size
jz @@1 ; jump if no private data area

mov bx,si ; allocate DPMI private area


mov ah,48h ; allocate memory
int 21h ; transfer to DOS
jc error ; jump, allocation failed
mov es,ax ; let ES=segment of data area

@@1: mov ax,0 ; bit 0=0 indicates 16-bit app


call modesw ; switch to protected mode
jc error ; jump if mode switch failed
; else we're in prot. mode now

App terminates via 0x4C int 0x21 (as in real mode). The rest of DPMI functions are provided through int 0x31 and include:

Real mode interrupt capturing (as function 0x25 int 0x21)


Real mode exception trapping
Call DOS interrupts either directly, or through int 0x31 function 3
Real mode callbacks
Sharing memory between DPMI clients
Paging
Setting hardware breakpoints
TSR capabilities

Many good games like The Dig were running under DPMI.

Long Mode

Architecture

https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print 16/28
10/01/2019 The Intel Assembly Manual - CodeProject
Whatever methods created to overcome the 4GB limit of the x86, they would eventually lead to full 64-bit processors. Having
discussed all the protected mode complexities, we are lucky to observe that the x64 CPU architecture is way simpler. The x64 CPU
has 3 operation modes:

Real mode
Protected mode (called legacy mode)
Long mode, containing two submodes:

Compatibility mode, 32 bit. This allows an 64-bit OS to run 32-bit applications natively.
64-bit mode

To work in Long mode, the programmer must take into consideration the facts below:

Unlike Protected mode, which can run with or without paging, long mode runs only with PAE and paging and in flat mode. All
the segments are flat, from 0 to 0xFFFFFFFFFFFFFFFF and all memory addressing is linear. DS, ES, SS are ignored. The
"flat" mode is the only valid mode in long mode. No segmentation.
You can get into long mode directly from real mode, by enabling protected mode and long mode within one instruction (this
can work because Control Registers are accessible from real mode).
Although in theory any 64-bit value could be used as an address, in practise we don't need yet 2^64 memory. Therefore,
current implementations only implement 48-bit addressing, which enforces all pointers to have bits 47-63 either all 0 or all 1.
This means that you have 2 ranges of valid "canonical" addresses, one from 0 to 0x00007FFF'FFFFFFFF and one from
0xFFFF8000'00000000 through 0xFFFFFFFF'FFFFFFFF, for a 256TB of total space. Most OSes reserve the upper area for
the kernel, and the lower area for the user space.

Registers

When running in 64-bit mode, the following 64-bit extensions are available:

RAX, RBX, RCX, RDX, RSI, RDI, RSP, RBP, RIP


8 new 64-bit registers added: R8 to R15. Lower 32 bits in R8D - R15D format, Upper 8 bits in R8W - R14W format and
lower 8 bits in R8B - R14B format.

These registers are only available in 64-bit mode. In all other modes, including compatibility mode, they are not available.

GDT/IDT
Bit 53 of the GDT, previously reserved, is now the "L", bit. When 1, the Sz bit must also be 0, and this indicates an 64-bit code (the
combination L = 1 and Sz = 1 is reserved and will throw an exception if used). The limits are always 0 to 0xFFFFFFFFFFFFFFFF
and the base is always 0.

If your GDT resides in the lower 4GB of memory, you need not change it after entering long mode. However, if you plan to
call SGDT or LGDT while in long mode, you must now deal with the 10-byte GDTR, which holds two bytes for the length of the
GDT and 8 bytes for the physical address of it.

Any selector you might load to access a 64-bit segment is ignored, and DS, ES, SS are not used at all. All the segments are flat,
and everything is done via paging. However GS and FS can still be used as auxilliary registers and their values are still subject to
verification from the GDT. In Windows, FS points to the Thread Information Block.

IDT is similar to the protected mode's, the difference being the fact that each entry is expanded to contain an 64-bit physical
address to the interrupt:

struc IDT_STR
{
.ofs0_15 dw ofs0_15
.sel dw sel
.zero db zero
.flags db flags ; 0 P,1-2 DPL, 3-7 index to the GDT
.ofs16_31 dw ofs16_31
.ofs32_63 dd ofs32_63
.zero dd 0
}

There is no LDT, VM86, DPMI, unreal mode or call gates in long mode. Missing VM86 is the reason that 64-bit OSes cannot run 16
bit software without an emulator.

https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print 17/28
10/01/2019 The Intel Assembly Manual - CodeProject

Long Mode Paging

In long mode the paging system adds a new top level structure, the PML4T which has 512 64-bit long entries which point to one
PDPT and now the PDPT has 512 entries as well (instead of 4 in the x86 mode). So now you can have 512 PDPTs which means
that one PT entry manages 4KB, one PDT entry manages 2MB (4KB * 512 PT entries), one PDPT entry manages 1GB (2MB*512
PDT entries), and one PML4T entry manages 512 GB (1GB * 512 PDPT entries). Since there are 512 PML4T entries, a total of
256TB (512GB * 512 PML4T entries) can be addressed.

This is another reason not to use the entire 64-bit for addressing. Using the entire thing would force us to have 6 levels of paging,
where now four are needed.

Each of the "S" bits in the PDPT/PDT can be 0 to indicate that there is a lower level structure below, or 1 to indicate that the
traversal ends here. If the PDPT S flag is 1, then the page size is 1GB.

There is an Intel draft about PML5, a new top level structure which would allow 5 levels of paging, when the CPUS will support 56
bits of addressing.

Entering Long Mode

; Disable paging, assuming that we are in a see-through.


mov eax, cr0 ; Read CR0.
and eax,7FFFFFFFh; Set PE=0
mov cr0, eax ; Write CR0.
mov eax, cr4
bts eax, 5
mov cr4, eax ; Set PAE
mov ecx, 0c0000080h ; EFER MSR number.
rdmsr ; Read EFER.
bts eax, 8 ; Set LME=1.
wrmsr ; Write EFER.
; Enable Paging to activate Long Mode. Assuming that CR3
' is loaded with the physical address of the page table.
mov eax, cr0 ; Read CR0.
or eax,80000000h ; Set PE=1.
mov cr0, eax ; Write CR0.

Turn off paging, if enabled. To do that, you must ensure that you are running in a "see through" area.
Set PAE, by setting CR4's fifth bit.
Create the new page tables and load CR3 with them. Because CR3 is still 32-bits before entering Long mode, the page
table must reside in the lower 4GB.
Enable Long mode (note, this does not enter Long mode, it just enables it).
Enable paging. Enabling paging activates and enters Long mode.

Because the rdmsr/wrmsr opcodes are also available in Real mode, you can activate Long mode from Real mode directly by
setting both PE and PM bits of CR0 simultaneously.

Entering 64-bit

Now you are in compatibility mode. Enter 64-bit mode by jumping to a 64-bit code segment:

; also db 066h if entering from a 16-bit code segment


db 0eah
dd LinearAddressOfStart64

The initial 64-bit segment must reside in the lower 4GB because compatibility mode does not see 64-bit addresses. Note that you
must use the linear address, because 64-bit segments always start from 0. Note also that if the current compatibility segment is 16-
bit default, you have to use the 066h prefix.

The only thing you have to do in 64-bit mode is to reset the RSP:

mov rsp,STACK64
shl rsp,4
add rsp,stack64_end

https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print 18/28
10/01/2019 The Intel Assembly Manual - CodeProject
SS, DS, ES, are not used in 64-bit mode. That is, if you want to access data in another segment, you cannot load DS with that
segment's selector and access the data. You must specify the linear address of the data. Data and stack are always accessed with
linear addresses. "Flat" mode is not only the default, it is the only one for 64-bit.

Once you are in 64-bit mode, the defaults for the opcodes (except from jmp/call) are still 32-bit. So a REX prefix is required
(0x40 to 0x4F) to mark a 64-bit opcode. Your assembler handles that automatically if it supports a "code64" segment.

In addition, a 64-bit interrupt table must now be set with a new LIDT instruction, this time taking a 10-byte operator (2 for the
length and 8 for the location).

Returning to Compatibility Mode


To exit 64-bit mode, it is first necessary to return to compatibility mode. Because 0eah is not a valid jump when in 64-bit mode, you
have to use a RETF trick to get back to a compatibility mode segment.

push code32_idx ; The selector of the compatibility code segment


xor rcx,rcx

mov ecx,Back32 ; The address must be an 64-bit address,


; so upper 32-bits of RCX are zero.
push rcx
retf

This gets you back to compatibility mode. 64-bit OSs keep jumping from 64-bit to compatibility mode in order to be able to run both
64-bit and 32-bit applications.

Exiting from Long Mode


You have to setup all the registers again with 32-bit selectors - back to segmentation. Also you must be in a see-through area
because to exit long mode you must deactivate paging. Of course, you can switch immediately to real mode by resetting the PM bit
as well.

; We are now in Compatibility mode again


mov ax,stack32_idx
mov ss,ax
mov esp,stack32_end
mov ax,data32_idx
mov ds,ax
mov es,ax
mov ax,data16_idx
mov gs,ax
mov fs,ax

; Disable Paging to get out of Long Mode


mov eax, cr0 ; Read CR0.
and eax,7fffffffh ; Set PE=0.
mov cr0, eax ; Write CR0.

; Deactivate Long Mode


mov ecx, 0c0000080h ; EFER MSR number.
rdmsr ; Read EFER.
btc eax, 8 ; Set LME=0.
wrmsr ; Write EFER.

; Back to protected mode

Multiple Cores

https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print 19/28
10/01/2019 The Intel Assembly Manual - CodeProject
A single CPU can execute one instruction at a time. Multitasking in single processors is generally the fast switching (at the software
level) between different registers/paging for each process running, and this is so fast that it appears that processes run
simultaneously.

A multiple core CPU is similar to having many single CPUs that share the same memory. Everything else (Registers, modes, etc)
are specific to each CPU. That means that if we have an 8 core processor, we have to execute the same procedure 8 times to put it
e.g. in long mode. We can have one processor to real mode and another processor in protected mode, another processor in long
mode etc.

In multiple core configurations we are concerned with three things:

How to discover multiple processors and their properties


How to communicate from one CPU to another
How to synchronize access to sensitive data

Discovery

The Advanced Programmable Interrupt Controller (APIC) is a set of tables, found in memory, that will provide us the information we
need. First we discover the presence of APIC:

mov eax,1
cpuid
bt edx,9
jc ApicFound

Second, we search for the Advanced Configuration and Power Interface (ACPI) in memory. The ACPI is the first of the APIC tables,
it resides somewhere in BIOS memory, between physical addresses 0xE0000 and 0xFFFFF and it has the following header:

struct RSDPDescriptor
{
char Signature[8];
uint8_t Checksum;
char OEMID[6];
uint8_t Revision;
uint32_t RsdtAddress;

; The following is present if ACPI 2.0


uint32_t Length;
uint64_t XsdtAddress;
uint8_t ExtendedChecksum;
uint8_t reserved[3];
}

The above RSDP Descriptor contains the signature value which, for the first ACPI table, is 0x2052545020445352. If this signature
is not found in the memory, then we don't have ACPI and therefore, there are no multiple CPU cores.

Each descriptor also has a checksum, which is verified with the following algorithm:

IsChecksumValid:
PUSH ECX
PUSH EDI
XOR EAX,EAX
.St:
ADD EAX,[FS:EDI]
INC EDI
DEC ECX
JECXZ .End
JMP .St
.End:
TEST EAX,0xFF
JNZ .F
MOV EAX,1
.F:
POP EDI
POP ECX
RETF

https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print 20/28
10/01/2019 The Intel Assembly Manual - CodeProject
In case we succeed in finding an ACPI 2.0 table and its ExtendedChecksum is verified, then we must use the XsdtAddress (which
always points to lower 4GB) to find the other tables. If it is an ACPI 1.0 then we use the RsdtAddress.

Having found the address, we use it to locate the first APIC table. The starting table contains pointers to all the other tables (32 or
64 bit if APIC 2.x+) after the header. This physical address is over the 1MB and hence it is only accessible from protected (or
unreal) mode. There are many ACPI tables but we are only interested in a few of them.

All of them have the following header:

struct ACPISDTHeader
{
char Signature[4];
unsigned long Length;
unsigned char Revision;
unsigned char Checksum;
char OEMID[6];
char OEMTableID[8];
unsigned long OEMRevision;
unsigned long CreatorID;
unsigned long CreatorRevision;
};

The first table that we will find contains the pointers to all other APIC tables after this header. The Length member contains the
length of the entire table, including the header.

To find how many processors we have, we find the "MADT" table, a table which has the signature "APIC" in its header. After the
standard header, we have:

At offset 0x24, the Local APIC Address, which we will need later.
At offset 0x2C, the rest of the MADT table contains a sequence of variable length records which enumerate the interrupt
devices. Each record begins with the 2 header bytes, 1 for the type and one for the length. If the type bype is 0, then the
bytes following the length byte contain 6 bytes, describing a physical CPU. The first byte is the ACPI Processor ID and the
second byte is the APIC ID of this processor.

Looping the above table will reveal us all the installed processors along with their ACPI and APIC IDs.

Initial Startup

A CPU can communicate with another CPU by issuing an "Interprocessor Interrupt" (IPI). To prepare the APIC to manage
interrupts, we have to enable the "Spurious Interrupt Vector Register", indexed at 0xF0:

; Assuming FS is loaded with a linear 4GB segment unreal mode


MOV EDI,[LocalApic]
ADD EDI,0x0F0
MOV EDX,[FS:EDI]
OR EDX,0x1FF
MOV [FS:EDI],EDX

After that, we are ready to send IPIs. An IPI (Interprocessor Interrupt) is sent by using the Interrupt Command Register of the
Local APIC. This consists of two 32-bit registers, one at offset 0x300 and one at offset 0x310 (All Local APIC registers are aligned
to 16 bytes):

The register at 0x310 is what we write it first, and it contains the Local APIC of the processor we want to send the interrupt
at the bits 24 - 27.
The register at 0x300 has the following structure:

struct R300
{
unsigned char VectorNumber; // Starting page for SIPI
unsigned char DestinationMode:3; // 0 normal, 1 low, 2 SMI, 4 NMI, 5 Init, 6 SIPI
unsigned char DestinationModeType:1; // 0 for physical 1 for logical
unsigned char DeliveryStatus:1; // 0 - message delivered
unsigned char R1:1;
unsigned char InitDeAssertClear:1;
unsigned char InitDeAssertSet:1;
unsigned char R2:2;
unsigned char DestinationType:2; // 0 normal, 1 send to me, 2 send to all, 3 send to all
except me

https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print 21/28
10/01/2019 The Intel Assembly Manual - CodeProject
unsigned char R3:12;
};

Writing to register 0x300 will actually send the IPI (that is why you must write to 0x310 first). Note that if DestinationType is
not 0, the Destination target in the register 0x310 is ignored. Under Windows, IPIs are sent with an IRQL level 29.

As we know, the CPU starts in real mode from 0xFFFF:0xFFF0 position, but this is only true for the first cpu. All other CPUs stay
"asleep" until woken up, in a special state called Wait-for-SIPI. The main CPU awakes other CPUs by sending a SIPI (Startup
Inter-Processor Interrupt) which contains the startup address for that CPU. Later on, there are other Inter-processor Interrupts to
communicate between the CPUs.

To awake the processor, we send two special IPIs. The first is the "Init" IPI, DestinationMode 5, which stores the starting
address for the CPU. Remember that the CPU starts in real mode. Because the processor starts in real mode, we have to give it a
real memory address, stored in VectorNumber. The second IPI is the SIPI, DestinationMode 6, which starts the CPU.
The starting address must be 4096 aligned.

Later Communication

Apart from INIT and SIPI, which we saw above, the local APIC can be used to send a normal interrupt, i.e., merely
executing INT XX in the context of the target CPU. We have to take into consideration the following:

If the CPU is in HLT state, the interrupt awakes it, and when the interrupt returns the CPU resumes with the instruction after
the HLT opcode. If there is also a CLI, then we must send a NMI interrupt (A flag in the APIC Interrupt Register) to wake
the CPU.
If the CPU is in HLT state and we send again an INIT and a SIPI, the CPU starts all over again from real mode.
The interrupt must exist in the target processor. For example, in protected mode, the interrupt must have been defined
in IDT.
The Local APIC is common to all CPUS (memorywise), therefore, we must lock for write access (mutex) before we can
issue the interrupt.
Because the registers cannot be passed from CPU to CPU, we have to write all the registers (that will be used for the
interrupt, if any) in a separated memory area.
The interrupt might fail, so, you have to rely on some inter-cpu communication (via shared memory and mutexes) to verify
the delivery.
Finally, the handler of the interrupt must tell its own Local APIC that there is an "End of Interrupt". It was similar to int 0x21's
out 020h,al in the past. Now we write to the EOI register (LocalApic + 0xB0) the value 0 (End Of Interrupt).

Synchronization

Since the CPUS share the same memory, it is crucial to synchronize write and read accesses to critical parts of it. In Windows of
course we have mutexes ready to be used, but here some extra work has to be done. We can create our own mutex variable as
follows:

Initialization, put a byte to value 0xFF


Lock mutex, decrease its value
Unlock mutex, increase its value unless already 0xFF
Wait for a mutex, but not lock it: A simple loop.

; assuming edi has the address


.Loop1:
CMP byte [edi],0xff
JZ .OutLoop1
pause
JMP .Loop1
.OutLoop1:

Note the pause opcode (equal to rep nop). This is a hint to the cpu that we are inside a spin loop, which greatly enhances
performance because code prefetching is avoided.

Our problem is to wait for a mutex, then grab it when it is free (similar to WaitForSingleObject()). This code is not going to work:

.Loop1:
CMP byte [edi],0xff
JZ .OutLoop1
pause
JMP .Loop1

https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print 22/28
10/01/2019 The Intel Assembly Manual - CodeProject
.OutLoop1:
.MutexIsFree:
DEC [edi]

The reason is that, between the JZ command (which has verified that the mutex is free) and before the DEC [edi] is executed,
another CPU might grab the mutex (race condition).

Fortunately for us, the CPU provides a LOCK CMPXCHG opcode which atomically grabs the lock for us:

.Loop1:
CMP byte [edi],0xff
JZ .OutLoop1
pause
JMP .Loop1
.OutLoop1:
; Lock is free, can we grab it?
mov bl,0xfe
MOV AL,0xFF
LOCK CMPXCHG [EDI],bl
JNZ .Loop1 ; Write failed, someone got us
.OutLoop2: ; Lock Acquired

We use the CMPXCHG instruction which, along with the LOCK prefix, atomically tests [edi] if it is still 0xFF (the value in AL), and
if yes, then it writes BL to it and sets the ZF. If another CPU has done the same meanwhile, the ZF is cleared and BL is not
moved to the [edi].

Virtualization

Virtualization, techically, is a "system" inside the system. Its a clone of the processor running inside the same processor. It is not
very much complex to setup and it greatly enhances computing since you are able to run another OS inside an existing OS.

Each CPU (called Host) can run one Virtual Machine (called guest) at a time. You can configure multiple guests per CPU and
pause/resume each guest, much like multitasking. If you have 8 CPU cores of course, you can have 8 guests running
simultaneously.

The lifecycle of VM operations is as follows:

Test if the CPU supports virtualization:

mov eax,1
cpuid
bt ecx,5
jc VMX_Supported
jmp VMX_NotSupported

Check CPU-specific revision from the IA32_VMX_BASIC register:

mov ecx, 0480h


rdmsr

This 64-bit register contains important information for our project:

Bits 0 - 31: 32-bit VMX Revision Number


Bits 32 - 44: Number of bytes (up to 4096) which we will need to allocate later.

Enable VMX operations

mov rax,cr4
bts rax,13
mov cr4,rax

Configure a VMXON structure. This is a 4096-aligned CPU-specific array and its size must be the number we got from
the IA32_VMX_BASIC register. A VMXON structure contains:

4 bytes which hold the revision number


4 bytes that are used for VMX Abort data (we will check that later),
https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print 23/28
10/01/2019 The Intel Assembly Manual - CodeProject
Execute the VMXON command
For each guest, configure a VMCS. A VMCS is a 4096-aligned CPU-specific array which we need to allocate for each guest,
and its size must be the number we got from the IA32_VMX_BASIC register. To load a VMCS for configuration we use the
VMPTRLD opcode. To read or write into the VMCS we use the VMREAD, VMWRITE and VMCLEAR. A VMCS
contains:

4 bytes that are used for VMX Abort data (we will check that later),
The rest of the bytes are used by VMCS groups (we will check that later).

Configure the memory available to the guests.


Launch a guest with VMLAUNCH.
Guest returns (exits) to the host on specific conditions.
Host uses VMPAUSE, VMRESUME to pause or resume its guests.
When the guest terminates, host uses VMXOFF to turn off VMX operations.

VMCS Groups
The rest of the VMCS (that is, after the first 8 bytes (revision + VMX Abort) is divided into 6 subgroups:

Guest State
Host State
Non root controls
VMExit controls
VMEntry controls
VMExit information
Each of the above fields contains important information. We will look at them one by one. To mark a VMCS for further
reading/writing with VMREAD or VMWRITE, you would first initialize its first 4 bytes to the revision (as with the VMXON structure
above), and then execute a VMPTRLD with its address.

Appendix H of the 3B Intel Manual has a list of all indices. For example, the index of the RIP of the guest is 0x681e. To write the
value 0 to that field, we would use:

mov rax,0681eh
mov rbx,0
vmwrite rax,rbx

Not all features are always present in all processors. We must check the VMX MSRs for available features before testing them.
Intel's 3B Appendix G contains all these MSRs. To load a MSR, you put its number to RCX and execute the rdmsr opcode. The
result is in RAX.

IA32_VMX_BASIC (0x480): Basic VMX information including revision, VMCS size, memory types and others.
IA32_VMX_PINBASED_CTLS (0x481): Allowed settings for pin-based VM execution controls.
IA32_VMX_PROCBASED_CTLS (0x482): Allowed settings for processor based VM execution controls.
IA32_VMX_PROCBASED_CTLS2 (0x48B): Allowed settings for secondary processor based VM execution controls.
IA32_VMX_EXIT_CTLS (0x483): Allowed settings for VM Exit controls.
IA32_VMX_ENTRY_CTLS (0x484): Allowed settings for VM Entry controls.
IA32_VMX_MISC MSR (0x485): Allowed settings for miscellaneous data, such as RDTSC options, unrestricted guest
availability, activity state and others.
IA32_VMX_CR0_FIXED0 (0x486) and IA32_VMX_CR0_FIXED1 (0x487): Indicate the bits that are allowed to be
0 or to 1 in CR0 in the VMX operation.
IA32_VMX_CR4_FIXED0 (0x488) and IA32_VMX_CR4_FIXED1 (0x489): Same for CR4.
IA32_VMX_VMCS_ENUM (0x48A): enumerator helper for VMCS.
IA32_VMX_EPT_VPID_CAP (0x48C): provides information for capabilities regarding VPIDs and EPT.

The Host State

This contains the following information (In parentheses, the bit number):

CR0,CR3,CR4,RSP,RIP (64 each)


CS,SS,DS,ES,FS,GS,TR selectors (16 each)
FS,GS,TR,GDTR,IDTR base addresses (64 each)
IA32_SYSENTER_CS (32)
IA32_SYSENTER_ESP (64)
IA32_SYSENTER_EIP (64)
https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print 24/28
10/01/2019 The Intel Assembly Manual - CodeProject
*IA32_PERF_GLOBAL_CTRL (64)
*IA32_PAT (64)
*IA32_EFER (64)
The host state tells the CPU how to return to the host after the guest exits. After executing a successfull VMLAUNCH or
VMRESUME command (if this command fails, execution resumes after it), then the host is paused until the guest exits. When the
guest exits, the host is reloaded with values from this VMCS group.

The Guest State

This contains the following information (In parentheses, the bit number):

CR0,CR3,CR4,DR7,RSP,RIP,RFLAGS, (64 each)


For each of CS,SS,DS,ES,FS,GS,LDTR,TR:

Selector (16)
Base address (64)
Segment limits (32)
Access rights (32)

For GDTR and IDTR:

Base address (64)


Limit (32)

IA32_DEBUGCRTL (64)
IA32_SYSENTER_CS (32)
IA32_SYSENTER_ESP (64)
IA32_SYSENTER_EIP (64)
IA_PERF_GLOBAL_CTRL (64)
IA32_PAT (64)
IA32_EFER (64)
SMBASE (32)
Activity State (32) - 0 Active , 1 Inactive (HLT executed) , 2 Triple fault occured , 3 waiting for startup IPI (SIPI).
Interruptibility state (32) - a state that defines some features that should be blocked in the VM.
Pending debug exceptions (64) - to facilitate hardware breakpoings with DR7.
VMCS Link pointer (64) - reserved, set to 0xFFFFFFFFFFFFFFFF.
VMX Preemption timer value (32)
Page Directory pointer table entries (4x64), pointers to pages.

This group defines how the guest will start. The guest can be started in two modes:

Paged 32 bit protected mode.


Real mode (unrestricted guest), if the CPU supports it.

Starting a guest in protected mode still allows the guest to turn later into long mode. If a guest expects a real mode start but
unrestricted guest is not available, then you can start in VM86 mode.

The Execution Control Fields

These fields configure what is allowed to be executed in the guest and what is not. Everything not allowed causes a
VMEXIT. The sections are:

Pin-Based (32b) : Interrupts


Processor-Based (2x32b)

Primary: Single Step, TSC HLT INVLPG MWAIT CR3 CR8 DR0 I/O Bitmaps
Secondary: EPT, Descriptor Table Change, Unrestricted Guest and others

Exception bitmap (32b): One bit for each exception. If bit is 1, the exception causes a VMExit.
I/O bitmap addresses (2x64b): Controls when IN/OUT cause VMExit.
Time Stamp Counter offset
CR0/CR4 guest/host masks
CR3 Targets
APIC Access
MSR Bitmaps

https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print 25/28
10/01/2019 The Intel Assembly Manual - CodeProject
For example, you can configure it so an exception would make it to the host, instead of being caught in the guest. Similarily you
might not allow GDT changes, Control Register changes etc.

Exit Control Fields

These fields tell the CPU what to load and what to discard in case of a VMExit:

VMExit Controls (32b)


VMExit Controls for MSRs

Exit Control Fields

These fields tell the CPU what to inject to the guest in case of an exit:

VMEntry Controls (32b)


VMEntry Controls for MSRs
VMEntry Controls for event injection

Exit Information Field (Read only)

Basic information

Exit Reason (32)


Exit Qualification (64)
Guest Linear Address (64)
Guest Physical Address (64)

Vectored exit information


Event delivery exits
Intstruction execution exits
Error field

EPT

An EPT is a mechanism that translates host physical address to guest physical addresses. It is exactly the same as the long mode
paging mechanism.

Manual Exits

A guest that knows that is a guest might want to deliberately exchange information with its host. For this reason, the instruction
VMCALL is provided to manually trigger an exit.

DMMI

DPMI works, but a long mode driver is also needed. Therefore I have decided to create a TSR service, included in the github
project. I've called it DOS Multicore Mode Interface. It is a driver which helps you develop 32 and 64 bit applications for DOS,
using int 0xF0. This interrupt is accessible from both real, protected and long mode. Put the function number to AH.

To check for existence, check the vector for INT 0xF0. It should not be pointing to 0 or to an IRET, ES:BX+2 should point to
a dword 'dmmi'.

Int 0xF0 provides the following functions to all modes (real, protected, long)
AH = 0, verify existence. Return values, AX = 0xFACE if the driver exists, DL = total CPUs. This function is
accessible from real, protected and long mode.
AH = 1, begin thread. BL is the CPU index (1 to max-1). The function creates a thread, depending on AL:

https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print 26/28
10/01/2019 The Intel Assembly Manual - CodeProject
0, begin (un)real mode thread. ES:DX = new thread seg:ofs. The thread is run with FS capable of unreal mode
addressing, must use RETF to return.
1, begin 32 bit protected mode thread. EDX is the linear address of the thread. The thread must return with RETF.
2, begin 64 bit long mode thread. EDX holds the linear address of the code to start in 64-bit long mode. The thread
must terminate with RET.
3, begin virtualized thread. BH contains the virtualization mode (currently only mode 2 = protected mode
virtualization is supported), and EDX the virtualized linear stack. The thread must return with RETF or VMCALL.

AH = 5, mutex functions.
AL = 0 => initialize mutex to ES:DI (real) , EDI linear (protected), RDI linear (long).
AL = 1 => Lock mutex
AL = 2 => Unlock mutex
AL = 3 => Wait for mutex
AH = 4, execute real mode interrupt. AL is the interrupt number, BP holds the AX value and BX,CX,DX,SI,DI are
passed to the interrupt. DS and ES are loaded from the high 16 bits of ESI and EDI.

Now, if you have more than one CPU, your DOS applications/games can now directly access all 2^64 of memory and all your
CPUs, while still being able to call DOS directly. In order to avoid calling int 0xF0 directly from assembly and to make the
driver compatible with higher level languages, an INT 0x21 redirection handler is installed. If you call INT 0x21 from the
main thread, INT 0x21 is executed directly. If you call INT 0x21 from protected or long mode thread, then INT
0xF0 function AX = 0x0421 is executed automatically.

The project
The full github project includes many functions discussed in this article. It's arranged with 4 filters: 16 bit code, 32 bit code, data,
DMMI client and configuration files.

The fact that you made it to this end means that you are truly decisive. Have fun and good luck!

References
http://www.fysnet.net/emsinfo.htm, EMS info
http://www.ctyme.com/rbrown.htm, Ralf Brown Interrupt List
http://bochs.sourceforge.net, Bochs
https://github.com/Himmele/My-Blog-
Repository/blob/master/Operating%20Systems/Build%20Your%20Own%20OS/Protected%20Mode%20Tutorial.txt,
Till Gerken PM Tutorial
https://wiki.osdev.org/Context_Switching, Task Switching
http://www.sudleyplace.com/dpmione/dpmispec1.0.pdf, DPMI specification
http://www.delorie.com/djgpp/doc/dpmi/, DJCPP DPMI examples
http://www.sudleyplace.com/swat/, 386SWAP protected mode debugger
http://dos32a.narechk.net/index_en.html, DOS32A DPMI extender

License
This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print 27/28
10/01/2019 The Intel Assembly Manual - CodeProject

Michael Chourdakis
Engineer
Greece

I'm working in C++, PHP , Java, Windows, iOS and Android.

I 've a PhD in Digital Signal Processing and Artificial Intelligence and I specialize in Pro Audio and AI applications.

My home page: http://www.michaelchourdakis.com

Comments and Discussions


0 messages have been posted for this article Visit https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-
Manual to post and view comments on this article, or click here to get a print view with messages.

Permalink | Advertise | Privacy | Cookies | Terms of Use | Mobile Article Copyright 2019 by Michael Chourdakis
Web05 | 2.8.190109.1 | Last Updated 10 Jan 2019 Everything else Copyright © CodeProject, 1999-2019

https://www.codeproject.com/Articles/1273844/The-Intel-Assembly-Manual?display=Print 28/28

You might also like