Introduction to x86 Architecture
Gadi Haber
Software & Services Group
1
1
Early Intel microprocessors
• Intel 8080 (1972)
– 64K addressable RAM
– 8-bit registers
– CP/M operating system
– 5,6,8,10 MHz
– 29K transistros
• Intel 8086/8088 (1978)
– IBM-PC used 8088
– 1 MB addressable RAM
– 16-bit registers
– 16-bit data bus (8-bit for 8088)
– separate floating-point unit (8087)
– used in low-cost microcontrollers now
2
The IBM-AT
• Intel 80286 (1982)
– 16 MB addressable RAM
– Protected memory
– several times faster than 8086
– introduced IDE bus architecture
– 80287 floating point unit
– Up to 20MHz
– 134K transistors
3
Intel IA-32 Family
• Intel386 (1985)
– 4 GB addressable RAM
– 32-bit registers
– paging (virtual memory)
– Up to 33MHz
• Intel486 (1989)
– instruction pipelining
– Integrated FPU
– 8K cache
• Pentium (1993)
– Superscalar (two parallel pipelines)
4
Intel P6 Family
• Pentium Pro (1995)
– advanced optimization techniques in microcode
– More pipeline stages
– On-board L2 cache
• Pentium II (1997)
– MMX (multimedia) instruction set
– Up to 450MHz
• Pentium III (1999)
– SIMD (streaming extensions) instructions (SSE)
– Up to 1+GHz
• Pentium 4 (2000)
– NetBurst micro-architecture, tuned for multimedia
– 3.8+GHz
• Pentium D (2005, Dual core)
5
IA32 Processors
• Totally Dominate Computer Market
• Evolutionary Design
– Starting in 1978 with 8086
– Added more features as time goes on
– Still support old features, although obsolete
• Complex Instruction Set Computer (CISC)
– Many different instructions with many different
formats
• But, only small subset encountered with Linux programs
– Hard to match performance of Reduced Instruction
Set Computers (RISC)
– But, Intel has done just that!
CPU pipe stages in in-order vs. out-of-order CPUs
In-order CPU pipe:
pipe1: fetch -> decode -> exe -> writeback
Pipe1: fetch -> decode -> exe -> writeback
pipe1: fetch -> decode -> exe -> writeback
Out-of-order CPU (exe is out-of-order, writeback is in-order):
pipe1: fetch -> decode -> exe ------------------------------------> writeback
Pipe2: fetch -> decode ----------> exe (depends or prev instr) -> writeback
pipe3: fetch -> decode -> exe -------------------------------------> writeback
Writeback is done at the same clock cycle for ALL instructions in the same window
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
Branch prediction and speculation
In out-of-order CPUs
• Waiting on conditional branch is very performance expensive
• To solve this:
• The Branch Unit – predicts the condition direction based on history statistics
• Then, the predicted path is executed SPECULATIVELY
• If branch prediction was correct → perform writeback
• If branch prediction was incorrect → flush speculative data + flush the pipe until the right IP and start again
(Branch misprediction)
• Eliminating Write-after-write dependencies using physical micro-arch registers:
• EAX = EBX → mapped to R378 = EBX
• EAX = EBX + ECX → mapped to R365 = EBX + ECX → in writeback: R365 is mapped to RAX
Copyright © 2014, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Optimization Notice
X86 Protection levels
9
x86 CPU modes
10
x86 CPU modes
• Real-address Mode - “This mode implements the programming environment of the
Intel 8086 processor with extensions (such as the ability to switch to protected or
system management mode). The processor is placed in real-address mode
following power-up or a reset.”
– DOS was running in Real Mode.
– No virtual memory, no privilege rings, 16 bit mode
• Protected Mode - “This mode is the native state of the processor. Among the
capabilities of protected mode is the ability to directly execute ‘Real-address mode’
8086 software in a protected, multi-tasking environment. This feature is called
virtual-8086 mode, although it is not actually a processor mode. Virtual-8086 mode
is actually a protected mode attribute that can be enabled for any task.”
– Virtual-8086 is just for backwards compatibility, and I point it out only to say that Intel says it’s not really
its own mode.
– Protected mode adds support for virtual memory and privilege rings.
– Modern OSes operate in protected mode
• System Management Mode - “This mode provides an operating system or
executive with a transparent mechanism for implementing platform-specific
functions such as power management and system security. The processor enters
SMM when the external SMM interrupt pin (SMI#) is activated or an SMI is received
from the advanced programmable interrupt controller (APIC).”
11
32bit mode - Segmentation
• In 32bit mode
registers can contain
up to 32 bits
• “Segmentation
provides a mechanism
for dividing the
processor’s
addressable memory
space (called the
linear address
space) into smaller
protected address
spaces called
segments.”
• In 32bit mode there
are 6 special segment
reguisters
12
64bit Mode
• No Segmentation
• Used by 64-bit applications under a 64-bit OS
• Architectural support for up to 64 bits of linear address
• Physical address support of up to 52 bits
• Can use flat address space with a single code, data and stack space
• A new RIP-relative data addressing mode.
Assembly Syntax
There are two syntaxes for x86 assembly
• Intel Syntax
<instruction> <dst> <src1> <src2>
• GNU Assembly syntax: (GAS)
<instruction> <src1> <src2> <dst>
64bit Registers Layout
32bit
MOV and Addressing modes
• x86 does have loads and stores. It has mov
Application Memory Types
1. Static Data
• Allocated by the compiler during compile time.
• User can allocate static data using the “static” reserved C/C++ word for data of a known fixed size.
• The actual memory allocation is done by the OS loader - when the executable is loaded into memory a special data
section (created by the compiler) is mapped into memory
2. Code:
• Considered as Static Data that is Read Only (R/O) and Executable.
• Allocated by the compiler and mapped into R/O/Exec memory at load time
3. Heap
• Service of the OS by using the malloc() system call for data of size that is known only at runtime.
• The OS maintains the heap and is responsible for allocating and de-allocating the data via the free() system call
4. Shared Memory
• Memory formed at runtime by the OS using the shmget/shmat/mmap system calls.
• Used for sharing data between processes.
5. Stack
• Dynamic data reserved for automatic local variables and arguments of a function.
• User can allocate stack data by simply declaring automatic variables inside the scope of a function.
• The Stack is maintained by the compiler which generates the code for allocating an de-allocating the local variables
by using special registers of: rsp and rbp
• Read/Only Data:
• Constants can be placed by the compiler in a separate r/o section in the binary file or embedded with the code
• User create r/o data by using constant values or constant strings in the program. 20
Calling convention for x86
The x86 calling convention that is implemented by the compiler for C has the
following rules:
• Return value is stored in RAX.
• Registers in the following order: RDI, RSI, RDX, RCX, R8, and R9 are used
for integer and memory address arguments.
• For system calls, R10 is used instead of RCX.
• Additional arguments are passed on the stack.
• The argument in C are passed from left to right.
• The caller cleans the stack after the function call returns.
• The callee is in charge of cleaning up the arguments from the stack
Calli Stack frame (64bit)
The C code is compiled into the following code
Consider the following C (without optimization flags):
source code: 0000000000000000 <callee>:
push %rbp // save rbp
mov %rsp,%rbp // rbp -> rsp
int callee(int i, int j, int k) { mov %edi,-0x4(%rbp) // edi -> arg 3 stack location
mov %esi,-0x8(%rbp) // esi -> arg 2 stack location
return i+j+k; mov %edx,-0xc(%rbp) // edx -> arg 1 stack location
mov -0x4(%rbp),%eax
} add -0x8(%rbp),%eax
add -0xc(%rbp),%eax // eax holds the ret value
pop %rbp // restore rbp
retq
0000000000000020 <caller>:
push %rbp // save rbp
int caller(void) { mov %rsp,%rbp // rbp ? rsp
mov $0x1,%edi // pass argument 3 via edi
return callee(1, 2, 3) + 5; mov $0x2,%esi // pass argument 2 via esi
mov $0x3,%edx // pass argument 1 via edx
} callq 0 <callee>
add $0x5,%eax // eax holds ret value of callee
pop %rbp // restore rbp
retq
Fundamental Data Types
7 0
Byte
15 8 7 0
High Low
Byte Byte
Word
N+1 N
31 16 15 0
High Word Low Word Doubleword
N+2 N
63 32 31 0
High Doubleword Low Doubleword Quadword
N+4 N
127 64 63 0
Double
High Quadword Low Quadword
Quadword
N+8 N
Memory Layout
Little Endian - D7H FH
LSB in lower 12H EH
addresses 7AH DH
FEH CH Doubleword at Address AH
Word at Address BH
Contains 7AFE0636H
Contains FE06H
06H BH
36H AH
Byte at Address 9H
1FH 9H
Contains 1FH
Quadword at Address 6H
A4H 8H Contains 7AFE06361FA4230BH
23H 7H
Word at Address 6H
Contains 230BH
0BH 6H
45H 5H
67H 4H
Word at Address 2H Double Quadword at Address 0H
74H 3H
Contains 74CBH Contains
D77AFE06361FA4230B456774CB3112H
CBH 2H
Word at Address 1H
Contains CB31H
31H 1H
12H 0H
Memory Layout – Encoding Example
• Assume we want to encode a direct jump to itself e.g., 0x700 jmp 0x700
• Relative direct jump to same location is in fact : Jmp -5 (offset -5)
• Bytes representation: e9 fb ff ff ff
• Here are all jump instructions opcodes in x86:
• Encoding the instructions in C :
option1: option2:
char mem[5]; ULONG mem = 0x000000fffffffbe9;
mem[0] = 0xe9; mem[1] = 0xfb; 28
mem[2] = 0xff; mem[3] = 0xff;
mem[4] = 0xff;
Numeric Data Types - Integers
0 – 255 Byte Unsigned Integer
7 0
0 – 65,535 Word Unsigned Integer
15 0
0 – 4,294,967,295 Doubleword Unsigned Integer
31 0
0 – 264 - 1 Quadword Unsigned Integer
63 0
Sign
-128 – +127 Byte Signed Integer
7 0
Sign
-32,768 – +32,767 Word Signed Integer
15 0
Sign
-231 – +231 - 1 Doubleword Signed Integer
31 0
Sign
-263 – +263 - 1 Quadword Signed Integer
63 0
Numeric Data Types – Floating Point
Approximate Normalized Range: 1.18 × 10-38 to 3.40 × 1038
Sign
Single Precision
Floating Point
31 30 22 21 0
Approximate Normalized Range: 2.23 × 10-308 to 1.79 × 10308
Sign
Double Precision
Floating Point
63 62 52 51 0
Approximate Normalized Range: 3.37 × 10-4932 to 1.18 × 104932
Assembly snapshot
1e8803a4: 48 8b 7c 24 00 mov 0x00(%rsp),%rdi
1e8803a9: 48 8b 74 24 08 mov 0x08(%rsp),%rsi
1e8803ae: 48 8b 6c 24 10 mov 0x10(%rsp),%rbp
1e8803b3: 48 8b 5c 24 20 mov 0x20(%rsp),%rbx
1e8803b8: 48 8b 54 24 28 mov 0x28(%rsp),%rdx
1e8803bd: 48 8b 4c 24 30 mov 0x30(%rsp),%rcx
1e8803c2: 48 8b 44 24 38 mov 0x38(%rsp),%rax
1e8803ef: 48 8b 64 24 18 mov 0x18(%rsp),%rsp
1e8803f4: ff 25 12 ff ff ff jmpq *-238(%rip)# 1e88030c
; <arch_state+0x30c> //RIP
Relative Addressing instruction
31