KEMBAR78
Advanced Dynamic Tool Building | PDF | Thread (Computing) | Kernel (Operating System)
100% found this document useful (1 vote)
440 views241 pages

Advanced Dynamic Tool Building

The document outlines a tutorial on building dynamic instrumentation tools with DynamoRIO. The tutorial agenda includes an introduction to DynamoRIO's history and internals, examples of DynamoRIO's use, and its API. DynamoRIO is a dynamic binary instrumentation platform that allows efficient, transparent, and customizable instrumentation of applications at runtime. It uses techniques like a software code cache and basic block linking to achieve near-native performance during instrumentation.

Uploaded by

tommaso
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
440 views241 pages

Advanced Dynamic Tool Building

The document outlines a tutorial on building dynamic instrumentation tools with DynamoRIO. The tutorial agenda includes an introduction to DynamoRIO's history and internals, examples of DynamoRIO's use, and its API. DynamoRIO is a dynamic binary instrumentation platform that allows efficient, transparent, and customizable instrumentation of applications at runtime. It uses techniques like a software code cache and basic block linking to achieve near-native performance during instrumentation.

Uploaded by

tommaso
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 241

Building Dynamic

Instrumentation Tools
with
DynamoRIO

Saman Amarasinghe
Derek Bruening
Qin Zhao
Tutorial Outline

• 1:30-1:40 Welcome + DynamoRIO History


• 1:40-2:40 DynamoRIO Internals
• 2:40-3:00 Examples, Part 1
• 3:00-3:15 Break
• 3:15-4:15 DynamoRIO API
• 4:15-5:15 Examples, Part 2
• 5:15-5:30 Feedback

DynamoRIO Tutorial at CGO 24 April 2010


2
DynamoRIO History

• Dynamo
– HP Labs: PA-RISC late 1990’s
– x86 Dynamo: 2000
• RIO  DynamoRIO
– MIT: 2001-2004
• Prior releases
– 0.9.1: Jun 2002 (PLDI tutorial)
– 0.9.2: Oct 2002 (ASPLOS tutorial)
– 0.9.3: Mar 2003 (CGO tutorial)
– 0.9.4: Feb 2005

DynamoRIO Tutorial at CGO 24 April 2010


3
DynamoRIO History

– 0.9.5: Apr 2008 (CGO tutorial)


– 1.0 (0.9.6): Sep 2008 (GoVirtual.org launch)
• Determina
– 2003-2007
– Security company
• VMware
– Acquired Determina (and DynamoRIO) in 2007
• Open-source BSD license
– Feb 2009: 1.3.1 release
– Dec 2009: 1.5.0 release
– Apr 2010: 2.0.0 release

DynamoRIO Tutorial at CGO 24 April 2010


4
DynamoRIO Internals
1:30-1:40 Welcome + DynamoRIO History
1:40-2:40 DynamoRIO Internals
2:40-3:00 Examples, Part 1
3:00-3:15 Break
3:15-4:15 DynamoRIO API
4:15-5:15 Examples, Part 2
5:15-5:30 Feedback
Typical Modern Application: IIS

DynamoRIO Tutorial at CGO 24 April 2010


6
Runtime Interposition Layer

DynamoRIO Tutorial at CGO 24 April 2010


7
Design Goals

• Efficient
– Near-native performance
• Transparent
– Match native behavior
• Comprehensive
– Control every instruction, in any application
• Customizable
– Adapt to satisfy disparate tool needs

DynamoRIO Tutorial at CGO 24 April 2010


8
Challenges of Real-World Apps

• Multiple threads
– Synchronization
• Application introspection
– Reading of return address
• Transparency corner cases are the norm
– Example: access beyond top of stack
• Scalability
– Must adapt to varying code sizes, thread counts, etc.
• Dynamically generated code
– Performance challenges

DynamoRIO Tutorial at CGO 24 April 2010


9
Internals Outline

• Efficient
– Software code cache overview
– Thread-shared code cache
– Cache capacity limits
– Data structures
• Transparent
• Comprehensive
• Customizable

DynamoRIO Tutorial at CGO 24 April 2010


10
Direct Code Modification

e9 37 6f 48 92 jmp <callout>

Kernel32!TerminateProcess:
7d4d1028 7c 05 jl 7d4d102f
7d4d102a 33 c0 xor %eax,%eax
7d4d102c 40 inc %eax
7d4d102d eb 08 jmp 7d4d1037
7d4d102f 50 push %eax
7d4d1030 e8 ed 7c 00 00 call 7d4d8d22

DynamoRIO Tutorial at CGO 24 April 2010


11
Debugger Trap Too Expensive

cc int3 (breakpoint)

Kernel32!TerminateProcess:
7d4d1028 7c 05 jl 7d4d102f
7d4d102a 33 c0 xor %eax,%eax
7d4d102c 40 inc %eax
7d4d102d eb 08 jmp 7d4d1037
7d4d102f 50 push %eax
7d4d1030 e8 ed 7c 00 00 call 7d4d8d22

DynamoRIO Tutorial at CGO 24 April 2010


12
Variable-Length Instruction Complications

e9 37 6f 48 92 jmp <callout>

Kernel32!TerminateProcess:
7d4d1028 7c 05 jl 7d4d102f
7d4d102a 33 c0 xor %eax,%eax
7d4d102c 40 inc %eax
7d4d102d eb 08 jmp 7d4d1037
7d4d102f 50 push %eax
7d4d1030 e8 ed 7c 00 00 call 7d4d8d22

DynamoRIO Tutorial at CGO 24 April 2010


13
Entry Point Complications

e9 37 6f 48 92 jmp <callout>

Kernel32!TerminateProcess:
7d4d1028 7c 05 jl 7d4d102f
7d4d102a 33 c0 xor %eax,%eax
7d4d102c 40 inc %eax
7d4d102d eb 08 jmp 7d4d1037
7d4d102f 50 push %eax
7d4d1030 e8 ed 7c 00 00 call 7d4d8d22

DynamoRIO Tutorial at CGO 24 April 2010


14
Direct Code Modification: Too Limited

• Not transparent
– Cannot write jump atomically if crosses cache line
– Even if write is atomic, not safe if overwrites part of next
instruction
– Jump may span code entry point
• Too limited
– Not safe w/o suspending all threads and knowing all entry points
– Limited to inserting callouts
• Code displaced by jump is a mini code cache
– All the same consistency challenges of larger cache
• Inter-operation issues with other hooks

DynamoRIO Tutorial at CGO 24 April 2010


15
We Need Indirection

• Avoid transparency and granularity limitations of directly


modifying application code
• Allow arbitrary modifications at unrestricted points in code
stream
• Allow systematic, fine-grained modifications to code stream
• Guarantee that all code is observed

DynamoRIO Tutorial at CGO 24 April 2010


16
Basic Interpreter

START

fetch decode execute

Slowdown: ~300x

DynamoRIO Tutorial at CGO 24 April 2010


17
Improvement #1: Interpreter + Basic Block Cache

START basic block builder

dispatch

context switch

BASIC BLOCK
CACHE Non-control-flow instructions
non-control-flow executed from software code
instructions
cache

Slowdown: 300x 25x

DynamoRIO Tutorial at CGO 24 April 2010


18
Improvement #1: Interpreter + Basic Block Cache

B C

START basic block builder


D

dispatch

context switch

BASIC BLOCK
CACHE
A B D

DynamoRIO Tutorial at CGO 24 April 2010


19
Example Basic Block Fragment

add %eax, %ecx frag7: add %eax, %ecx


cmp $4, %eax cmp $4, %eax
jle $0x40106f jle <stub0>
jmp <stub1>
stub0: mov %eax, eax-slot
dstub0
mov &dstub0, %eax
target: 0x40106f
jmp context_switch
stub1: mov %eax, eax-slot
dstub1
mov &dstub1, %eax
target: fall-thru
jmp context_switch

DynamoRIO Tutorial at CGO 24 April 2010


20
Improvement #2: Linking Direct Branches

START basic block builder

dispatch

context switch

BASIC BLOCK
CACHE Direct branch to existing
non-control-flow block can bypass dispatch
instructions

Slowdown: 300x 25x 3x

DynamoRIO Tutorial at CGO 24 April 2010


21
Improvement #2: Linking Direct Branches

B C

START basic block builder


D

dispatch

context switch

BASIC BLOCK
CACHE

A B D

DynamoRIO Tutorial at CGO 24 April 2010


22
Direct Linking

add %eax, %ecx frag7: add %eax, %ecx


cmp $4, %eax cmp $4, %eax
jle $0x40106f jle <frag8>
jmp <stub1>
stub0: mov %eax, eax-slot
dstub0
mov &dstub0, %eax
target: 0x40106f
jmp context_switch
stub1: mov %eax, eax-slot
dstub1
mov &dstub1, %eax
target: fall-thru
jmp context_switch

DynamoRIO Tutorial at CGO 24 April 2010


23
Improvement #3: Linking Indirect Branches

START basic block builder

dispatch

context switch

BASIC BLOCK
CACHE Application address
non-control-flow indirect branch
lookup
mapped to code cache
instructions

Slowdown: 300x 25x 3x 1.2x

DynamoRIO Tutorial at CGO 24 April 2010


24
Indirect Branch Transformation

frag8: mov %ecx, ecx-slot


ret pop %ecx
jmp <ib_lookup>

ib_lookup: ...
...
...

DynamoRIO Tutorial at CGO 24 April 2010


25
Improvement #4: Trace Building
Basic Block Cache Trace Cache
A D G J A G
B K
E J
B E H K
F
H
C F I L
D

• Traces reduce branching, improve layout and locality, and


facilitate optimizations across blocks
– We avoid indirect branch lookup
• Next Executing Tail (NET) trace building scheme [Duesterwald
2000]

DynamoRIO Tutorial at CGO 24 April 2010


26
Incremental NET Trace Building
Basic Block Cache Trace Cache
G J G

G J G
K
K

G J G
K
K J

DynamoRIO Tutorial at CGO 24 April 2010


27
Improvement #4: Trace Building

START basic block builder trace selector

dispatch

context switch

BASIC BLOCK TRACE


CACHE CACHE
non-control-flow indirect branch non-control-flow indirect branch
instructions lookup instructions stays on trace?

Slowdown: 300x 26x 3x 1.2x 1.1x

DynamoRIO Tutorial at CGO 24 April 2010


28
Base Performance

SPEC CPU2000 Server Desktop


DynamoRIO Tutorial at CGO 24 April 2010
29
Sources of Overhead

• Extra instructions
– Indirect branch target comparisons
– Indirect branch hashtable lookups
• Extra data cache pressure
– Indirect branch hashtable
• Branch mispredictions
– ret becomes jmp*
• Application code modification

DynamoRIO Tutorial at CGO 24 April 2010


30
Time Breakdown for SPECINT

START basic block builder trace selector

dispatch < 1%

context switch

BASIC BLOCK TRACE


CACHE CACHE
non-control-flow indirect branch non-control-flow indirect branch
instructions lookup instructions stays on trace?

~ 0% ~ 4% ~ 94% ~ 2%
DynamoRIO Tutorial at CGO 24 April 2010
31
Not An Ordinary Application

• An application executing in DynamoRIO’s code cache looks


different from what the underlying hardware has been tuned
for
• The hardware expects:
– Little or no dynamic code modification
• Writes to code are expensive
– call and ret instructions
• Return Stack Buffer predictor

DynamoRIO Tutorial at CGO 24 April 2010


32
Performance Counter Data

DynamoRIO Tutorial at CGO 24 April 2010


33
Internals Outline

• Efficient
– Software code cache overview
– Thread-shared code cache
– Cache capacity limits
– Data structures
• Transparent
• Comprehensive
• Customizable

DynamoRIO Tutorial at CGO 24 April 2010


34
Threading Model
Running Program

Thread1 Thread2 Thread3 … ThreadN

Code Caching Runtime System

Thread1 Thread2 Thread3 … ThreadN

Operating System

Thread1 Thread2 Thread3 … ThreadN

DynamoRIO Tutorial at CGO 24 April 2010


35
Code Space
Running Program

Thread Thread Thread Thread

Thread-Private Code Caches Thread-Shared Code Cache

Thread Thread Thread Thread

Operating System

Thread1 Thread2 Thread1 Thread2

DynamoRIO Tutorial at CGO 24 April 2010


36
Thread-Private versus Thread-Shared

• Thread-private
– Less synchronization needed
– Absolute addressing for thread-local storage
– Thread-specific optimization and instrumentation
• Thread-shared
– Scales to many-threaded apps

DynamoRIO Tutorial at CGO 24 April 2010


37
Database and Web Server Suite

Benchmark Server Processes

ab low IIS low isolation inetinfo.exe

ab med IIS medium isolation inetinfo.exe, dllhost.exe

guest low IIS low isolation, inetinfo.exe, sqlservr.exe


SQL Server 2000

guest med IIS medium isolation, SQL inetinfo.exe, dllhost.exe,


Server 2000 sqlservr.exe

DynamoRIO Tutorial at CGO 24 April 2010


38
Memory Impact

ab med guest low guest med

DynamoRIO Tutorial at CGO 24 April 2010


39
Performance Impact

DynamoRIO Tutorial at CGO 24 April 2010


40
Scalability Limit

DynamoRIO Tutorial at CGO 24 April 2010


41
Internals Outline

• Efficient
– Software code cache overview
– Thread-shared code cache
– Cache capacity limits
– Data structures
• Transparent
• Comprehensive
• Customizable

DynamoRIO Tutorial at CGO 24 April 2010


42
Added Memory Breakdown

DynamoRIO Tutorial at CGO 24 April 2010


43
Code Expansion

exit stubs
19%

indirect branch target


handling
7%

net jumps
8% original code
66%

DynamoRIO Tutorial at CGO 24 April 2010


44
Cache Capacity Challenges

• How to set an upper limit on the cache size


– Different applications have different working sets and different
total code sizes
• Which fragments to evict when that limit is reached
– Without expensive profiling or extensive fragmentation

DynamoRIO Tutorial at CGO 24 April 2010


45
Adaptive Sizing Algorithm

• Enlarge cache if warranted by


percentage of new fragments that are
regenerated
• Target working set of application: don’t
enlarge for once-only code
• Low-overhead, incremental, and
reactive

DynamoRIO Tutorial at CGO 24 April 2010


46
Cache Capacity Settings

• Thread-private:
– Working set size matching is on by default
– Client may see blocks or traces being deleted in the absence of
any cache consistency event
– Can disable capacity management via
• -no_finite_bb_cache
• -no_finite_trace_cache
• Thread-shared:
– Set to infinite size by default
– Can enable capacity management via
• -finite_shared_bb_cache
• -finite_shared_trace_cache

DynamoRIO Tutorial at CGO 24 April 2010


47
Internals Outline

• Efficient
– Software code cache overview
– Thread-shared code cache
– Cache capacity limits
– Data structures
• Transparent
• Comprehensive
• Customizable

DynamoRIO Tutorial at CGO 24 April 2010


48
Two Modes of Code Cache Operation

• Fine-grained scheme
– Supports individual code fragment unlink and removal
– Separate data structure per code fragment and each of its exits,
memory regions spanned, and incoming links
• Coarse-grained scheme
– No individual code fragment control
– Permanent intra-cache links
– No per-fragment data structures at all
– Treat entire cache as a unit for consistency

DynamoRIO Tutorial at CGO 24 April 2010


49
Data Structures

• Fine-grained scheme
– Data structures are highly tuned and compact
• Coarse-grained scheme
– There are no data structures
– Savings on applications with large amounts of code are typically
15%-25% of committed memory and 5%-15% of working set

DynamoRIO Tutorial at CGO 24 April 2010


50
Status in Current Release

• Fine-grained scheme
– Current default
• Coarse-grained scheme
– Select with –opt_memory runtime option
– Possible performance hit on certain benchmarks
– In the future will be the default option
– Required for persisted and process-shared caches

DynamoRIO Tutorial at CGO 24 April 2010


51
Adaptive Level of Granularity

• Start with coarse-grain caches


– Plus freezing and sharing/persisting
• Switch to fine-grain for individual modules or sub-regions of
modules after significant consistency events, to avoid
expensive entire-module flushes
– Support simultaneous fine-grain fragments within coarse-grain
regions for corner cases
• Match amount of bookkeeping to amount of code change
– Majority of application code does not need fine-grain

DynamoRIO Tutorial at CGO 24 April 2010


52
Many Varieties of Code Caches

• Coarse-grained versus fine-grained


• Thread-shared versus thread-private
• Basic blocks versus traces

DynamoRIO Tutorial at CGO 24 April 2010


53
Internals Outline

• Efficient
• Transparent
– Rules of transparency
– Cache consistency
– Synchronization
• Comprehensive
• Customizable

DynamoRIO Tutorial at CGO 24 April 2010


54
Transparency

• Do not want to interfere with the semantics of the program


• Dangerous to make any assumptions about:
– Register usage
– Calling conventions
– Stack layout
– Memory/heap usage
– I/O and other system call use

DynamoRIO Tutorial at CGO 24 April 2010


55
Painful, But Necessary

• Difficult and costly to handle corner cases


• Many applications will not notice…
• …but some will!
– Microsoft Office: Visual Basic generated code, stack convention
violations
– COM, Star Office, MMC: trampolines
– Adobe Premiere: self-modifying code
– VirtualDub: UPX-packed executable
– etc.

DynamoRIO Tutorial at CGO 24 April 2010


56
Rule 1: Avoid Resource Conflicts

• DynamoRIO system code executes at arbitrary points during


application execution
• If DynamoRIO uses the same library routine as the
application, it may call that routine in the middle of the same
routine being called by the application
• Most library routines are not re-entrant!
– Many are thread-safe, but that does not help us

DynamoRIO Tutorial at CGO 24 April 2010


57
Rule 1: Avoid Resource Conflicts

Linux Windows
DynamoRIO Tutorial at CGO 24 April 2010
58
Rule 2: If It’s Not Broken, Don’t Change It

• Threads
• Executable on disk
• Application data
– Including the stack!

DynamoRIO Tutorial at CGO 24 April 2010


59
Example Transparency Violation

Error
Error
Error
Error
Error
Error
Error
Error
Error
Error
SPEC CPU2000 Server Desktop
DynamoRIO Tutorial at CGO 24 April 2010
60
Rule 3: If You Change It, Emulate
Original Behavior’s Visible Effects

• Application addresses
• Address space
• Error transparency
• Code cache consistency

DynamoRIO Tutorial at CGO 24 April 2010


61
Internals Outline

• Efficient
• Transparent
– Rules of transparency
– Cache consistency
– Synchronization
• Comprehensive
• Customizable

DynamoRIO Tutorial at CGO 24 April 2010


62
Code Change Mechanisms

RISC x86
I-Cache D-Cache I-Cache D-Cache
A: A: A: A:
B: B: B: B:
C: C: C: C:
D: D: D: D:

Store B Store B
Flush B Jump B
Jump B

DynamoRIO Tutorial at CGO 24 April 2010


63
How Often Does Code Change?

• Not just modification of code!


• Removal of code
– Shared library unloading
• Replacement of code
– JIT region re-use
– Trampoline on stack

DynamoRIO Tutorial at CGO 24 April 2010


64
Code Change Events
Memory Generated Code Modified Code
Unmappings Regions Regions
SPECFP 112 0 0

SPECINT 29 0 0

SPECJVM 7 3373 4591

Excel 144 21 20

Photoshop 1168 40 0

Powerpoint 367 28 33

Word 345 20 6

DynamoRIO Tutorial at CGO 24 April 2010


65
Detecting Code Removal

• Example: shared library being unloaded


• Requires explicit request by application to operating system
• Detect by monitoring system calls (munmap,
NtUnmapViewOfSection)

DynamoRIO Tutorial at CGO 24 April 2010


66
Detecting Code Modification

• On x86, no explicit app


request required, as the x86
icache is kept consistent in
hardware – so any memory I-Cache D-Cache
write could modify code! A: A:
B: B:
C: C:
D: D:

Store B
Jump B

DynamoRIO Tutorial at CGO 24 April 2010


67
Page Protection Plus Instrumentation

• Invariant: application code copied to code cache must be


read-only
– If writable, hide read-only status from application
• Some code cannot or should not be made read-only
– Self-modifying code
– Windows stack
– Code on a page with frequently written data
• Use per-fragment instrumentation to ensure code is not stale
on entry and to catch self-modification

DynamoRIO Tutorial at CGO 24 April 2010


68
Adaptive Consistency Algorithm

• Use page protection by default


– Most code regions are always read-only
• Subdivide written-to regions to reduce flushing cost of write-
execute cycle
– Large read-only regions, small written-to regions
• Switch to instrumentation if write-execute cycle repeats too
often (or on same page)
– Switch back to page protection if writes decrease

Bruening et al. “Maintaining Consistency and Bounding Capacity


of Software Code Caches” CGO’05

DynamoRIO Tutorial at CGO 24 April 2010


69
Internals Outline

• Efficient
• Transparent
– Rules of transparency
– Cache consistency
– Synchronization
• Comprehensive
• Customizable

DynamoRIO Tutorial at CGO 24 April 2010


70
Synchronization Transparency

• Application thread management should not interfere with the


runtime system, and vice versa
– Cannot allow the app to suspend a thread holding a runtime
system lock
– Runtime system cannot use app locks

DynamoRIO Tutorial at CGO 24 April 2010


71
Code Cache Invariant

• App thread suspension requires safe spots where no runtime


system locks are held
• Time spent in the code cache can be unbounded
→ Our invariant: no runtime system lock can be held while
executing in the code cache

DynamoRIO Tutorial at CGO 24 April 2010


72
Internals Outline

• Efficient
• Transparent
• Comprehensive
• Customizable

DynamoRIO Tutorial at CGO 24 April 2010


73
Above the Operating System

DynamoRIO Tutorial at CGO 24 April 2010


74
Kernel-Mediated Control Transfers
user mode kernel mode

message pending
save user context

majority of
executed
code in a

time
typical message handler
Windows
application

no message pending
restore context

DynamoRIO Tutorial at CGO 24 April 2010


75
Intercepting Linux Signals
user mode kernel mode

signal pending
register our
save user context
own signal
handler
DynamoRIO handler

time
signal handler

no signal pending
restore context

DynamoRIO Tutorial at CGO 24 April 2010


76
Windows Messages
user mode kernel mode

message pending
save user context

time
dispatcher
message handler

no message pending
restore context

DynamoRIO Tutorial at CGO 24 April 2010


77
Intercepting Windows Messages
user mode kernel mode

message pending
modify save user context
shared library
memory image
dispatcher

time
dispatcher
message handler

no message pending
restore context

DynamoRIO Tutorial at CGO 24 April 2010


78
Must Monitor System Calls

• To maintain control:
– Calls that affect the flow of control: register signal handler,
create thread, set thread context, etc.
• To maintain transparency:
– Queries of modified state app should not see
• To maintain cache consistency:
– Calls that affect the address space
• To support cache eviction:
– Interruptible system calls must be redirected

DynamoRIO Tutorial at CGO 24 April 2010


79
Operating System Dependencies

• System calls and their numbers


– Monitor application’s usage, as well as for our own resource
management
– Windows changes the numbers each major rel
• Details of kernel-mediated control flow
– Must emulate how kernel delivers events
• Initial injection
– Once in, follow child processes

DynamoRIO Tutorial at CGO 24 April 2010


80
Internals Outline

• Efficient
• Transparent
• Comprehensive
• Customizable

DynamoRIO Tutorial at CGO 24 April 2010


81
Clients

• The engine exports an API for


building a client
• System details abstracted away:
client focuses on manipulating
the code stream

DynamoRIO Tutorial at CGO 24 April 2010


82
Client Events

client client

START
basic block builder trace selector
client

dispatch

context switch

BASIC BLOCK TRACE


CACHE CACHE
non-control-flow indirect branch non-control-flow indirect branch
instructions lookup instructions stays on trace?

DynamoRIO Tutorial at CGO 24 April 2010


83
Examples: Part I
1:30-1:40 Welcome + DynamoRIO History
1:40-2:40 DynamoRIO Internals
2:40-3:00 Examples, Part 1
3:00-3:15 Break
3:15-4:15 DynamoRIO API
4:15-5:15 Examples, Part 2
5:15-5:30 Feedback
DynamoRIO Examples Part I Outline

• Common Steps of writing a DynamoRIO client

• Dynamic Instruction Counting Example

DynamoRIO Tutorial at CGO 24 April 2010


85
Common Steps

• Step 1: Register Events


– DR_EXPORT void dr_init(client_id_t id)
Register Function Events
dr_register_bb_event Basic Block Building
dr_register_thread_init_event Thread Initialization
dr_register_exit_event Process Exit

• Step2: Implementation
– Initialization
– Finalization
– Instrumentation
• Step 3: Optimization
– Optimize the instrumentation to improve the performance

DynamoRIO Tutorial at CGO 24 April 2010


86
DynamoRIO Examples Part I Outline

• Common Steps of writing a DynamoRIO client

• Dynamic Instruction Counting Example

DynamoRIO Tutorial at CGO 24 April 2010


87
A Simplified View of DynamoRIO

START basic block builder

dispatch

context switch

BASIC BLOCK
CACHE

DynamoRIO Tutorial at CGO 24 April 2010


88
Step 1: Register Events

uint num_dyn_instrs;

static void event_init(void);


static void event_exit(void);
static dr_emit_flags_t event_basic_block(void *drcontext, void *tag, instrlist_t *ilist,
bool for_trace, bool translating);

DR_EXPORT void dr_init(client_id_t id) {


/* register events */
dr_register_bb_event (event_basic_block);
dr_register_exit_event(event_exit);
/* process initialization event */
event_init();
}

DynamoRIO Tutorial at CGO 24 April 2010


89
Step 2: Implementation (I)

static void event_init(void) {


num_dyn_instrs = 0;
}

static void event_exit(void) {


dr_printf(“Total number of instruction executed: %u\n”, num_dyn_instrs);
}

static dr_emit_flags_t event_basic_block(void *drcontext, void *tag, instrlist_t *ilist,


bool for_trace, bool translating) {
int num_instrs;
num_instrs = ilist_num_instrs(ilist);
insert_count_code(drcontext, ilist, num_instrs);
return DR_EMIT_DEFAULT;
}

DynamoRIO Tutorial at CGO 24 April 2010


90
Step 2: Implementation (II)

static int ilist_num_instrs(instrlist_t *ilist) {


instr_t *instr;
int num_instrs = 0;
/* iterate over instruction list to count number of instructions */
for (instr = instrlist_first(ilist); instr != NULL; instr = instr_get_next(instr))
num_instrs++;
return num_instrs;
}

static void do_ins_count(int num_instrs) { num_dyn_instrs += num_instrs; }

static void insert_count_code(void * drcontext, instrlist_t * ilist, int num_instrs) {


dr_insert_clean_call(drcontext, ilist, instrlist_first(ilist),
do_ins_count, false, 1,
OPND_CREATE_INT32(num_instrs));
}

DynamoRIO Tutorial at CGO 24 April 2010


91
Instrumented Basic Block

# switch stack
# switch aflags and errorno
# save all registers
# call do_ins_count
push $0x00000003
call $0xb7ef73e4 (do_ins_count)
# restore registers
# switch aflags and errorno back
# switch stack back
# application code
add $0x0000e574 %ebx  %ebx
test %al $0x08
jz $0xb80e8a98
DynamoRIO Tutorial at CGO 24 April 2010
92
Step 3: Optimization (I): counter update inlining

static void insert_count_code (void * drcontext, instrlist_t * ilist, int num_instrs) {


instr_t *instr, *where;
opnd_t opnd1, opnd2;

where = instrlist_first(ilist);
/* save aflags */
dr_save_arith_flags(drcontext, ilist, where, SPILL_SLOT_1);
/* num_dyn_instrs += num_instrs */
opnd1 = OPND_CREATE_ABSMEM(&num_dyn_instrs, OPSZ_PTR);
opnd2 = OPND_CREATE_INT32(num_instrs);
instr = INSTR_CREATE_add(drcontext, opnd1, opnd2);
instrlist_meta_preinsert(ilist, where, instr);
/* restore aflags */
dr_restore_arith_flags(drcontext, ilist, where, SPILL_SLOT_1);
}

DynamoRIO Tutorial at CGO 24 April 2010


93
Instrumented Basic Block

mov %eax  %fs:0x0c


lahf  %ah
seto  %al
add $0x00000003, 0xb7d25030
add $0x7f %al  %al
sahf %ah
mov %fs:0x0c  %eax
# application code
add $0x0000e574 %ebx  %ebx
test %al $0x08
jz $0xb7f14a98

DynamoRIO Tutorial at CGO 24 April 2010


94
Step 3: Optimization (II): aflags stealing

static void insert_count_code (void * drcontext, instrlist_t * ilist, int num_instrs) {



save_aflags = aflags_analysis(ilist);
/* save aflags */
if (save_aflags)
dr_save_arith_flags(drcontext, ilist, where, SPILL_SLOT_1);
/* num_dyn_instrs += num_instrs */
opnd1 = OPND_CREATE_ABSMEM(&num_dyn_instrs, OPSZ_PTR);
opnd2 = OPND_CREATE_INT32(num_instrs);
instr = INSTR_CREATE_add(drcontext, opnd1, opnd2);
instrlist_meta_preinsert(ilist, where, instr);
/* restore aflags */
if (save_aflags)
dr_restore_arith_flags(drcontext, ilist, where, SPILL_SLOT_1);
}

DynamoRIO Tutorial at CGO 24 April 2010


95
Instrumented Basic Block

add $0x00000003, 0xb7d25030


# application code
add $0x0000e574 %ebx  %ebx
test %al $0x08
jz $0xb7f14a98

DynamoRIO Tutorial at CGO 24 April 2010


96
Step 3: Optimization (III): more optimizations

• Using lea (load effective address) instead of add


lea [%reg, num_instr]  %reg
• Register liveness analysis
– Using dead register to avoid register save/restore for lea
• Global aflags/registers analysis
– Analyze aflags/registers liveness over CFG
• Trace Optimization
– Trace: single-entry multi-exit
– Update counters only at trace exits

DynamoRIO Tutorial at CGO 24 April 2010


97
Other Issues

• Data race on counter update in multithreaded programs


– Global lock for every update
– Atomic update (lock prefixed add)
• LOCK(instr);
– Thread private counter
• Thread-private code cache: different variable at different address
• Thread-shared code cache: thread local storage
• 32-bit counter overflow
– 64-bit counter:
• Two instructions on 32-bit architecture: add, adc
– One 32-bit local counter and one 64-bit global counter
• Instrument to update 32-bit local counter
• Update 64-bit global counter using time interrupt

DynamoRIO Tutorial at CGO 24 April 2010


98
DynamoRIO API
1:30-1:40 Welcome + DynamoRIO History
1:40-2:40 DynamoRIO Internals
2:40-3:00 Examples, Part 1
3:00-3:15 Break
3:15-4:15 DynamoRIO API
4:15-5:15 Examples, Part 2
5:15-5:30 Feedback
DynamoRIO API Outline

• Building and Deploying


• Events
• Utilities
• Instruction Manipulation
• State Translation
• Comparison with Pin
• Troubleshooting

DynamoRIO Tutorial at CGO 24 April 2010


100
Clients

• The engine exports an API for


building a client
• System details abstracted away:
client focuses on manipulating
the code stream

DynamoRIO Tutorial at CGO 24 April 2010


101
Cross-Platform Clients

• DynamoRIO API presents a consistent interface that works


across platforms
– Windows versus Linux
– 32-bit versus 64-bit
– Thread-private versus thread-shared
• Same client source code generally works on all combinations
of platforms
• Some exceptions, noted in the documentation

DynamoRIO Tutorial at CGO 24 April 2010


102
Building a Client

• Include DR API header file


– #include “dr_api.h”
• Set platform defines
– WINDOWS or LINUX
– X86_32 or X86_64
• Export a dr_init function
– DR_EXPORT void dr_init (client_id_t client_id)
• Build a shared library

DynamoRIO Tutorial at CGO 24 April 2010


103
Auto-Configure Using CMake

add_library(myclient SHARED myclient.c)


find_package(DynamoRIO)
if (NOT DynamoRIO_FOUND)
message(FATAL_ERROR "DynamoRIO package
required to build")
endif(NOT DynamoRIO_FOUND)
configure_DynamoRIO_client(myclient)

DynamoRIO Tutorial at CGO 24 April 2010


104
CMake

• Build system converted to CMake when open-sourced


– Switch from frozen toolchain to supporting range of tools
• CMake generates build files for native compiler of choice
– Makefiles for UNIX, nmake, etc.
– Visual Studio project files
• http://www.cmake.org/

DynamoRIO Tutorial at CGO 24 April 2010


105
Library Usage and Transparency

• For best transparency: completely self-contained client


– Imports only from DynamoRIO API
– -nodefaultlibs or /nodefaultlib
• On Windows:
– String and utility routines provided by forwards to ntdll
– Cl.exe /MT static copy of C/C++ libraries
– Custom loader loads private copy of client dependences
• On Linux:
– Use ld –wrap to redirect malloc calls to DR’s heap
– Older distributions shipped suitable static C/C++ lib
– Newer distros: need to build yourself
– Coming soon: custom loader for private copy of libs

DynamoRIO Tutorial at CGO 24 April 2010


106
DynamoRIO Extensions

• DynamoRIO API is extended via libraries called Extensions


• Both static and shared supported
• Built and packaged with DynamoRIO
• Easy for a client to use
– use_DynamoRIO_extension(myclient drsyms)
• Current Extensions:
– drsyms: symbol lookup (currently Windows-only)
– drcontainers: hashtable
• Coming soon:
– Umbra: shadow memory framework
– Your utility library or framework contribution!

DynamoRIO Tutorial at CGO 24 April 2010


107
Application Configuration

• File-based scheme
• Per-user local files
– $HOME/.dynamorio/ on Linux
– $USERPROFILE/dynamorio/ on Windows
• Global files
– /etc/dynamorio/ on Linux
– Registry-specified directory on Windows
• Files are lists of var=value

DynamoRIO Tutorial at CGO 24 April 2010


108
Deploying Clients

• One-step configure-and-run usage model:


– drrun <client> <options> <app cmdline>
– Uses an invisible temporary one-time configuration file
– Overrides any regular config file
• Two-step usage model giving control over children:
– drconfig –reg <appname> <client> <options>
– drinject <app cmdline>
• Systemwide injection:
– drconfig –syswide_on –reg <appname> <client> <options>
– <run app normally>

DynamoRIO Tutorial at CGO 24 April 2010


109
Deploying Clients On Linux

• drrun and drinject scripts: LD_PRELOAD-based


– Take over after statically-dependent shared libs but before exe
• Suid apps ignore LD_PRELOAD
– Place libdrpreload.so's full path in /etc/ld.so.preload
– Copy libdynamorio.so to /usr/lib
• In the future:
– Attach
– Early injection

DynamoRIO Tutorial at CGO 24 April 2010


110
Deploying Clients On Windows

• drinject and drrun injection


– Currently after all shared libs are initialized
• From-parent injection
– Early: before any shared libs are loaded
• Systemwide injection via –syswide_on
– Requires administrative privileges
– Launch app normally: no need to run via drinject/drrun
– Moderately early: during user32.dll initialization
• In the future:
– Earliest injection for drrun/drinject and from-parent

DynamoRIO Tutorial at CGO 24 April 2010


111
Non-Standard Deployment

• Standalone API
– Use DynamoRIO as a library of IA-32/AMD64 manipulation
routines
• Start/Stop API
– Can instrument source code with where DynamoRIO should
control the application

DynamoRIO Tutorial at CGO 24 April 2010


112
Runtime Options

• Pass options to drconfig/drrun


• A large number of options; the most relevant are:
– -code_api
– -client <client lib> <client ops> <client id>
– -thread_private
– -tracedump_text and –tracedump_binary
– -prof_pcs

DynamoRIO Tutorial at CGO 24 April 2010


113
Runtime Options For Debugging

• Notifications:
– -stderr_mask 0xN
– -msgbox_mask 0xN
• Windows:
– -no_hide
• Debug-build-only:
– -loglevel N
– -ignore_assert_list ‘*’

DynamoRIO Tutorial at CGO 24 April 2010


114
DynamoRIO API Outline

• Building and Deploying


• Events
• Utilities
• Instruction Manipulation
• State Translation
• Comparison with Pin
• Troubleshooting

DynamoRIO Tutorial at CGO 24 April 2010


115
Client Events

client client

START
basic block builder trace selector
client

dispatch

context switch

BASIC BLOCK TRACE


CACHE CACHE
non-control-flow indirect branch non-control-flow indirect branch
instructions lookup instructions stays on trace?

DynamoRIO Tutorial at CGO 24 April 2010


116
Client Events: Code Stream

• Client has opportunity to inspect and potentially modify every


single application instruction, immediately before it executes
• Entire application code stream
– Basic block creation event: can modify the block
– For comprehensive instrumentation tools
• Or, focus on hot code only
– Trace creation event: can modify the trace
– Custom trace creation: can determine trace end condition
– For optimization and profiling tools

DynamoRIO Tutorial at CGO 24 April 2010


117
Simplifying Client View

• Several optimizations disabled


– Elision of unconditional branches
– Indirect call to direct call conversion
– Shared cache sizing
– Process-shared and persistent code caches
• Future release will give client control over optimizations

DynamoRIO Tutorial at CGO 24 April 2010


118
Basic Block Event

static dr_emit_flags_t
event_basic_block(void *drcontext, void *tag,
instrlist_t *bb, bool for_trace,
bool translating) {
instr_t *inst;
for (inst = instrlist_first(bb);
inst != NULL;
inst = instr_get_next(inst)) {
/* … */
}
return DR_EMIT_DEFAULT;
}

DR_EXPORT void dr_init(client_id_t id) {


dr_register_bb_event(event_basic_block);
}

DynamoRIO Tutorial at CGO 24 April 2010


119
Trace Event

static dr_emit_flags_t
event_trace(void *drcontext, void *tag,
instrlist_t *trace, bool translating) {
instr_t *inst;
for (inst = instrlist_first(trace);
inst != NULL;
inst = instr_get_next(inst)) {
/* … */
}
return DR_EMIT_DEFAULT;
}

DR_EXPORT void dr_init(client_id_t id) {


dr_register_trace_event(event_trace);
}

DynamoRIO Tutorial at CGO 24 April 2010


120
Client Events: Application Actions

• Application thread creation and deletion


• Application library load and unload
• Application exception (Windows)
– Client chooses whether to deliver or suppress
• Application signal (Linux)
– Client chooses whether to deliver, suppress, bypass the app
handler, or redirect control

DynamoRIO Tutorial at CGO 24 April 2010


121
Client Events: Application System Calls

• Application pre- and post- system call


– Platform-independent system call parameter access
– Client can modify:
• Return value in post-, or set value and skip syscall in pre-
• Call number
• Params
– Client can invoke an additional system call as the app

DynamoRIO Tutorial at CGO 24 April 2010


122
Client Events: Bookkeeping

• Initialization and Exit


– Entire process
– Each thread
– Child of fork (Linux-only)
• Basic block and trace deletion during cache management
• Nudge received
– Used for communication into client
• Itimer fired (Linux-only)

DynamoRIO Tutorial at CGO 24 April 2010


123
Multiple Clients

• It is each client's responsibility to ensure compatibility with


other clients
– Instruction stream modifications made by one client are visible to
other clients
• At client registration each client is given a priority
– dr_init() called in priority order (priority 0 called first and thus
registers its callbacks first)
• Event callbacks called in reverse order of registration
– Gives precedence to first registered callback, which is given the
final opportunity to modify the instruction stream or influence
DynamoRIO's operation

DynamoRIO Tutorial at CGO 24 April 2010


124
DynamoRIO API Outline

• Building and Deploying


• Events
• Utilities
• Instruction Manipulation
• State Translation
• Comparison with Pin
• Troubleshooting

DynamoRIO Tutorial at CGO 24 April 2010


125
DynamoRIO API: General Utilities

• Transparency support
– Separate memory allocation and I/O
– Alternate stack
• Thread support
– Thread-local memory
– Simple mutexes
– Thread-private code caches, if requested

DynamoRIO Tutorial at CGO 24 April 2010


126
DynamoRIO API: General Utilities, Cont’d

• Communication
– Nudges: ping from external process
– File creation, reading, and writing
• Sideline support
– Create new client-only thread
– Thread-private itimer (Linux-only)

DynamoRIO Tutorial at CGO 24 April 2010


127
DynamoRIO API: General Utilities, Cont’d

• Application inspection
– Address space querying
– Module iterator
– Processor feature identification
– Symbol lookup (currently Windows-only)
• Third-party library support
– If transparency is maintained!
– -wrap support on Linux
– ntdll.dll link support on Windows
– Custom loader for private library copy on Windows

DynamoRIO Tutorial at CGO 24 April 2010


128
DynamoRIO Heap

• Three flavors:
– Thread-private: no synchronization; thread lifetime
– Global: synchronized, process lifetime
– “Non-heap”: for generated code, etc.
– No header on allocated memory: low overhead but must pass
size on free
• Leak checking
– Debug build complains at exit if memory was not deallocated

DynamoRIO Tutorial at CGO 24 April 2010


129
DynamoRIO API Outline

• Building and Deploying


• Events
• Utilities
• Instruction Manipulation
• State Translation
• Comparison with Pin
• Troubleshooting

DynamoRIO Tutorial at CGO 24 April 2010


130
DynamoRIO API: Instruction Representation

• Full IA-32/AMD64 instruction representation


• Instruction creation with auto-implicit-operands
• Operand iteration
• Instruction lists with iteration, insertion, removal
• Decoding at various levels of detail
• Encoding

DynamoRIO Tutorial at CGO 24 April 2010


131
Instruction Representation

8d 34 01 lea (%ecx,%eax,1) -> %esi -

8b 46 0c mov 0xc(%esi) -> %eax -

2b 46 1c sub 0x1c(%esi) %eax -> %eax WCPAZSO

0f b7 4e 08 movzx 0x8(%esi) -> %ecx -

c1 e1 07 shl $0x07 %ecx -> %ecx WCPAZSO

3b c1 cmp %eax %ecx WCPAZSO

0f 8d a2 0a 00 00 jnl $0x77f52269 RSO


raw bytes opcode operands eflags

DynamoRIO Tutorial at CGO 24 April 2010


132
Instruction Representation

lea (%ecx,%eax,1) -> %edi -

mov 0xc(%edi) -> %eax -

sub 0x1c(%edi) %eax -> %eax WCPAZSO

movzx 0x8(%edi) -> %ecx -

c1 e1 07 shl $0x07 %ecx -> %ecx WCPAZSO

3b c1 cmp %eax %ecx WCPAZSO

0f 8d a2 0a 00 00 jnl $0x77f52269 RSO


raw bytes opcode operands eflags

DynamoRIO Tutorial at CGO 24 April 2010


133
Instruction Creation

• Method 1: use the INSTR_CREATE_opcode macros that fill


in implicit operands automatically:
instr_t *instr = INSTR_CREATE_dec(dcontext,
opnd_create_reg(REG_EDX));
• Method 2: specify opcode + all operands (including implicit
operands):
instr_t *instr = instr_create(dcontext);
instr_set_opcode(instr, OP_dec);
instr_set_num_opnds(dcontext, instr, 1, 1);
instr_set_dst(instr, 0, opnd_create_reg(REG_EDX));
instr_set_src(instr, 0, opnd_create_reg(REG_EDX));

DynamoRIO Tutorial at CGO 24 April 2010


134
Linear Control Flow

• Both basic blocks and traces are


linear
• Instruction sequences are all
single-entrance, multiple-exit
• Greatly simplifies analysis
algorithms

DynamoRIO Tutorial at CGO 24 April 2010


135
64-Bit Versus 32-Bit

• 32-bit build of DynamoRIO only handles 32-bit code


• 64-bit build of DynamoRIO decodes/encodes both 32-bit and
64-bit code
– Current release does not support executing applications that mix
the two
• IR is universal: covers both 32-bit and 64-bit
– Abstracts away underlying mode

DynamoRIO Tutorial at CGO 24 April 2010


136
64-Bit Thread and Instruction Modes

• When going to or from the IR, the thread mode and instruction
mode determine how instrs are interpreted
• When decoding, current thread’s mode is used
– Default is 64-bit for 64-bit DynamoRIO
– Can be changed with set_x86_mode()
• When encoding, that instruction’s mode is used
– When created, set to mode of current thread
– Can be changed with instr_set_x86_mode()

DynamoRIO Tutorial at CGO 24 April 2010


137
64-Bit Clients

• Define X86_64 before including header files when building a


64-bit client
• Convenience macros for printf formats, etc. are provided
– E.g.:
• printf(“Pointer is ”PFX“\n”, p);
• Use “X” macros for cross-platform registers
– REG_XAX is REG_EAX when compiled 32-bit, and REG_RAX
when compiled 64-bit

DynamoRIO Tutorial at CGO 24 April 2010


138
DynamoRIO API: Code Manipulation

• Processor information
• State preservation
– Eflags, arith flags, floating-point state, MMX/SSE state
– Spill slots, TLS
• Clean calls to C code
• Dynamic instrumentation
– Replace code in the code cache
• Branch instrumentation
– Convenience routines

DynamoRIO Tutorial at CGO 24 April 2010


139
Processor Information

• Processor type
– proc_get_vendor(), proc_get_family(), proc_get_type(),
proc_get_model(), proc_get_stepping(), proc_get_brand_string()
• Processor features
– proc_has_feature(), proc_get_all_feature_bits()
• Cache information
– proc_get_cache_line_size(), proc_is_cache_aligned(),
proc_bump_to_end_of_cache_line(),
proc_get_containing_page()
– proc_get_L1_icache_size(), proc_get_L1_dcache_size(),
proc_get_L2_cache_size(), proc_get_cache_size_str()

DynamoRIO Tutorial at CGO 24 April 2010


140
State Preservation

• Spill slots for registers


– 3 fast slots, 6/14 slower slots
– dr_save_reg(), dr_restore_reg(), and dr_reg_spill_slot_opnd()
– from C code: dr_read_saved_reg(), dr_write_saved_reg()
• Dedicated TLS field for thread-local data
– dr_insert_read_tls_field(), dr_insert_write_tls_field()
– from C code: dr_get_tls_field(), dr_set_tls_field()
• Arithmetic flag preservation
– dr_save_arith_flags(), dr_restore_arith_flags()
• Floating-point/MMX/SSE state
– dr_insert_save_fpstate(), dr_insert_restore_fpstate()

DynamoRIO Tutorial at CGO 24 April 2010


141
Thread-Local Storage (TLS)

• Absolute addressing
– Thread-private only
• Application stack
– Not reliable or transparent
• Stolen register
– Performance hit
• Segment
– Best solution for thread-shared

DynamoRIO Tutorial at CGO 24 April 2010


142
Clean Calls

if (instr_is_mbr(instr)) {
app_pc address = instr_get_app_pc(instr);
uint opcode = instr_get_opcode(instr);
instr_t *nxt = instr_get_next(instr);
dr_insert_clean_call(drcontext, ilist, nxt, (void *) at_mbr,
false/*don't need to save fp state*/,
2 /* 2 parameters */,
/* opcode is 1st parameter */
OPND_CREATE_INT32(opcode),
/* address is 2nd parameter */
OPND_CREATE_INTPTR(address));
}

• Saved interrupted application state can be accessed using


dr_get_mcontext() and modified using dr_set_mcontext()

DynamoRIO Tutorial at CGO 24 April 2010


143
Dynamic Instrumentation

• Thread-shared: flush all code corresponding to application


address and then re-instrument when re-executed
– Can flush from clean call, and use dr_redirect_execution() since
cannot return to potentially flushed cache fragment
• Thread-private: can also replace particular fragment (does not
affect other potential copies of the source app code)
– dr_replace_fragment()

DynamoRIO Tutorial at CGO 24 April 2010


144
Flushing the Cache

• Immediately deleting or replacing individual code cache


fragments is available for thread-private caches
– Only removes from that thread’s cache
• Two basic types of thread-shared flush:
– Non-precise: remove all entry points but let target cache code be
invalidated and freed lazily
– Precise/synchronous:
• Suspend the world
• Relocate threads inside the target cache code
• Invalidate and free the target code immediately

DynamoRIO Tutorial at CGO 24 April 2010


145
Flushing the Cache

• Thread-shared flush API routines:


– dr_unlink_flush_region(): non-precise flush
– dr_flush_region(): synchronous flush
– dr_delay_flush_region():
• No action until a thread exits code cache on its own
• If provide a completion callback, synchronous once triggered
• Without a callback, non-precise

DynamoRIO Tutorial at CGO 24 April 2010


146
DynamoRIO API Outline

• Building and Deploying


• Events
• Utilities
• Instruction Manipulation
• State Translation
• Comparison with Pin
• Troubleshooting

DynamoRIO Tutorial at CGO 24 April 2010


147
DynamoRIO API: Translation

• Translation refers to the mapping of a code cache machine


state (program counter, registers, and memory) to its
corresponding application state
– The program counter always needs to be translated
– Registers and memory may also need to be translated
depending on the transformations applied when copying into the
code cache

DynamoRIO Tutorial at CGO 24 April 2010


148
Translation Case 1: Fault
user context user context

faulting instr. faulting instr.

• Exception and signal handlers are passed machine context of


the faulting instruction.
• For transparency, that context must be translated from the
code cache to the original code location
• Translated location should be where the application would
have had the fault or where execution should be resumed

DynamoRIO Tutorial at CGO 24 April 2010


149
Translation Case 2: Relocation

• If one application thread suspends another, or DynamoRIO


suspends all threads for a synchronous cache flush:
– Need suspended target thread in a safe spot
– Not always practical to wait for it to arrive at a safe spot (if in a
system call, e.g.)
• DynamoRIO forcibly relocates the thread
– Must translate its state to the proper application state at which to
resume execution

DynamoRIO Tutorial at CGO 24 April 2010


150
Translation Approaches

• Two approaches to program counter translation:


– Store mappings generated during fragment building
• High memory overhead (> 20% for some applications, because it
prevents internal storage optimizations) even with highly optimized
difference-based encoding. Costly for something rarely used.
– Re-create mapping on-demand from original application code
• Cache consistency guarantees mean the corresponding application
code is unchanged
• Requires idempotent code transformations
• DynamoRIO supports both approaches
– The engine mostly uses the on-demand approach, but stored
mappings are occasionally needed

DynamoRIO Tutorial at CGO 24 April 2010


151
Instruction Translation Field

• Each instruction contains a translation field


• Holds the application address that the instruction corresponds
to
• Set via instr_set_translation()

DynamoRIO Tutorial at CGO 24 April 2010


152
Context Translation Via Re-Creation

A1: mov %ebx, %ecx


A2: add %eax, (%ecx)
A3: cmp $4, (%eax)
A4: jle 710349fb

C1: mov %ebx, %ecx D1: (A1) mov %ebx, %ecx


C2: add %eax, (%ecx) D2: (A2) add %eax, (%ecx)
C3: cmp $4, (%eax) D3: (A3) cmp $4, (%eax)
C4: jle <stub0> D4: (A4) jle <stub0>
C5: jmp <stub1> D5: (A4) jmp <stub1>

DynamoRIO Tutorial at CGO 24 April 2010


153
Meta vs. Non-Meta Instructions

• Non-Meta instructions are treated as application instructions


– They must have translations
– Control flow changing instructions are modified to retain
DynamoRIO control and result in cache populating
• Meta instructions are added instrumentation code
– Not treated as part of the application (e.g., calls run natively)
– Cannot fault, so translations not needed
• Meta-may-fault instructions
– Can fault, but should not be “interpreted”: won’t modify app code
– Fault typically deliberate and handled by client
• Xref instr_set_ok_to_mangle() and
instr_set_meta_may_fault()

DynamoRIO Tutorial at CGO 24 April 2010


154
Client Translation Support

• Instruction lists passed to clients are annotated with


translation information
– Read via instr_get_translation()
– Clients are free to delete instructions, change instructions and
their translations, and add new meta and non-meta instructions
(see dr_register_bb_event() for restrictions)
– An idempotent client that restricts itself to deleting app
instructions and adding non-faulting meta instructions can ignore
translation concerns
– DynamoRIO takes care of instructions added by API routines
(insert_clean_call(), etc.)
• Clients can choose between storing or regenerating
translations on a fragment by fragment basis.

DynamoRIO Tutorial at CGO 24 April 2010


155
Client Regenerated Translations

• Client returns DR_EMIT_DEFAULT from its bb or trace event


callback
• Client bb & trace event callbacks are re-called when
translations are needed with translating==true
• Client must exactly duplicate transformations performed when
the block was generated
• Client must set translation field for all added non-meta
instructions and all meta-may-fault instructions
– This is true even if translating==false since DynamoRIO may
decide it needs to store translations anyway

DynamoRIO Tutorial at CGO 24 April 2010


156
Client Stored Translations

• Client returns DR_EMIT_STORE_TRANSLATIONS from its


bb or trace event callback
• Client must set translation field for all added non-meta
instructions and all meta-may-fault instructions
• Client bb or trace hook will not be re-called with
translating==true

DynamoRIO Tutorial at CGO 24 April 2010


157
Register State Translation

• Translation may be needed at a point where some registers


are spilled to memory
– During indirect branch or RIP-relative mangling, e.g.
• DynamoRIO walks fragment up to translation point, tracking
register spills and restores
– Special handling for stack pointer around indirect calls and
returns
• DynamoRIO tracks client spills and restores implicitly added
by API routines
– Clean calls, etc.
– Explicit spill/restore (e.g., dr_save_reg()) client’s responsibility

DynamoRIO Tutorial at CGO 24 April 2010


158
Client Register State Translation

• If a client adds its own register spilling/restoring code or


changes register mappings it must register for the restore
state event to correct the context
• The same event can also be used to fix up the application’s
view of memory
• DynamoRIO does not internally store this kind of translation
information ahead of time when the fragment is built
– The client must maintain its own data structures

DynamoRIO Tutorial at CGO 24 April 2010


159
DynamoRIO API Outline

• Building and Deploying


• Events
• Utilities
• Instruction Manipulation
• State Translation
• Comparison with Pin
• Troubleshooting

DynamoRIO Tutorial at CGO 24 April 2010


160
DynamoRIO versus Pin

• Basic interface is fundamentally different


• Pin = insert callout/trampoline only
– Not so different from tools that modify the original code: Dyninst,
Vulcan, Detours
– Uses code cache only for transparency
• DynamoRIO = arbitrary code stream modifications
– Only feasible with a code cache
– Takes full advantage of power of code cache
– General IA-32/AMD64 decode/encode/IR support

DynamoRIO Tutorial at CGO 24 April 2010


161
DynamoRIO versus Pin

• Pin = insert callout/trampoline only


– Pin tries to inline and optimize
– Client has little control or guarantee over final performance
• DynamoRIO = arbitrary code stream modifications
– Client has full control over all inserted instrumentation
– Result can be significant performance difference
• PiPA Memory Profiler + Cache Simulator:
3.27x speedup w/ DynamoRIO vs 2.6x w/ Pin

DynamoRIO Tutorial at CGO 24 April 2010


162
Base Performance Comparison (No Tool)

171%

121%

DynamoRIO Tutorial at CGO 24 April 2010


163
Base Memory Comparison

44MB

15MB
3.5MB
2.6MB

DynamoRIO Tutorial at CGO 24 April 2010


164
BBCount Pin Tool
static int bbcount;

VOID PIN_FAST_ANALYSIS_CALL docount() { bbcount++; }

VOID Trace(TRACE trace, VOID *v) {


for (BBL bbl = TRACE_BblHead(trace); BBL_Valid(bbl); bbl = BBL_Next(bbl)) {
BBL_InsertCall(bbl, IPOINT_ANYWHERE, AFUNPTR(docount),
IARG_FAST_ANALYSIS_CALL, IARG_END);
}
}

int main(int argc, CHAR *argv[]) {


PIN_InitSymbols();
PIN_Init(argc, argv);
TRACE_AddInstrumentFunction(Trace, 0);
PIN_StartProgram();
return 0;
}

DynamoRIO Tutorial at CGO 24 April 2010


165
BBCount DynamoRIO Tool
static int global_count;

static dr_emit_flags_t
event_basic_block(void *drcontext, void *tag, instrlist_t *bb,
bool for_trace, bool translating) {
instr_t *instr, *first = instrlist_first(bb);
uint flags;
/* Our inc can go anywhere, so find a spot where flags are dead.
* Technically this can be unsafe if app reads flags on fault =>
* stop at instr that can fault, or supply runtime op */
for (instr = first; instr != NULL; instr = instr_get_next(instr)) {
flags = instr_get_arith_flags(instr);
/* OP_inc doesn't write CF but not worth distinguishing */
if (TESTALL(EFLAGS_WRITE_6, flags) && !TESTANY(EFLAGS_READ_6, flags))
break;
}
if (instr == NULL)
dr_save_arith_flags(drcontext, bb, first, SPILL_SLOT_1);
instrlist_meta_preinsert(bb, (instr == NULL) ? first : instr,
INSTR_CREATE_inc(drcontext, OPND_CREATE_ABSMEM((byte *)&global_count, OPSZ_4)));
if (instr == NULL)
dr_restore_arith_flags(drcontext, bb, first, SPILL_SLOT_1);
return DR_EMIT_DEFAULT;
}

DR_EXPORT void dr_init(client_id_t id) {


dr_register_bb_event(event_basic_block);
}

DynamoRIO Tutorial at CGO 24 April 2010


166
BBCount: Pin Inlining Importance

569%

231%

DynamoRIO Tutorial at CGO 24 April 2010


167
BBCount Performance Comparison

233%
226%
185%

DynamoRIO Tutorial at CGO 24 April 2010


168
DynamoRIO API Outline

• Building and Deploying


• Events
• Utilities
• Instruction Manipulation
• State Translation
• Comparison with Pin
• Troubleshooting

DynamoRIO Tutorial at CGO 24 April 2010


169
Obtaining Help

• Read the documentation


– http://dynamorio.org/docs/
• Look at the sample clients
– In the documentation
– In the release package: samples/
• Ask on the DynamoRIO Users discussion forum/mailing list
– http://groups.google.com/group/dynamorio-users

DynamoRIO Tutorial at CGO 24 April 2010


170
Debugging Clients

• Use the DynamoRIO debug build for asserts


– Often point out the problem
• Use logging
– -loglevel N
– stored in logs/ subdir of DR install dir
• Attach a debugger
– gdb or windbg
– -msgbox_mask 0xN
– -no_hide
– windbg: .reload myclient.dll=0xN
• More tips:
– http://code.google.com/p/dynamorio/wiki/Debugging

DynamoRIO Tutorial at CGO 24 April 2010


171
Reporting Bugs

• Search the Issue Tracker off http://dynamorio.org first


– http://code.google.com/p/dynamorio/issues/list
• File a new Issue if not found
• Follow conventions on wiki
– http://code.google.com/p/dynamorio/wiki/BugReporting
– CRASH, APP CRASH, HANG, ASSERT
• Example titles:
– CRASH (1.3.1 calc.exe)
vm_area_add_fragment:vmareas.c(4466)
– ASSERT (1.3.0 suite/tests/common/segfault)
study_hashtable:fragment.c:1745 ASSERT_NOT_REACHED

DynamoRIO Tutorial at CGO 24 April 2010


172
Changes From Prior Releases

• Backward compatible with 1.0 (0.9.6) and above


– Except configuration and deployment scheme and tools:
switched to file-based scheme to support unprivileged and
parallel execution on Windows
• Not backward compatible with 0.9.1-0.9.5

DynamoRIO Tutorial at CGO 24 April 2010


173
Examples: Part 2
1:30-1:40 Welcome + DynamoRIO History
1:40-2:40 DynamoRIO Internals
2:40-3:00 Examples, Part 1
3:00-3:15 Break
3:15-4:15 DynamoRIO API
4:15-5:15 Examples, Part 2
5:15-5:30 Feedback
More Examples

• Dynamic Optimization
– Strength Reduction
– Software Prefetching
• Profiling
– Memory Reference Trace
– PiPA
• Shadow Memory
– Umbra
• Dr. Memory

DynamoRIO Tutorial at CGO 24 April 2010


175
Dynamic Optimization Opportunities

• Traditional compiler optimizations


– Compiler has limited view: application assembled at runtime
– Some shipped products are built without optimizations
• Microarchitecture-specific optimizations
– Feature set and relative performance of instructions varies
– Combinatorial blowup if done statically
• Adaptive optimizations
– Need runtime information: prior profiling runs not always
representative
– Execution phase changes during execution

DynamoRIO Tutorial at CGO 24 April 2010


176
Dynamic Optimization in DynamoRIO

• Traces are natural unit for optimization


– Focus only on hot code
– Cross procedure, file and module boundaries
• Linear control flow
– Single-entry, multi-exit simplifies analysis
• Support for adaptive optimization
– Can replace traces dynamically

DynamoRIO Tutorial at CGO 24 April 2010


177
DynamoRIO with Trace Optimization

START basic block builder trace selector

dispatch

context switch

BASIC BLOCK TRACE


CACHE CACHE
indirect branch indirect branch
lookup stays on trace?

DynamoRIO Tutorial at CGO 24 April 2010


178
Strength Reduction: inc to add

• Pentium 4
– inc is slower add 1
– dec is slower than sub 1
• Pentium 3
– inc is faster add 1
– dec is faster than sub 1
• Microarchitecture-specific optimization best performed
dynamically

DynamoRIO Tutorial at CGO 24 April 2010


179
EXPORT void dr_init() {

Pentium 4?
if (proc_get_family() == FAMILY_PENTIUM_IV)
dr_register_trace_event(event_trace);
}

static void event_trace(void *drcontext, app_pc tag, instrlist_t *trace, bool xl8) {
instr_t *instr, *next_instr;
int opcode;
for (instr = instrlist_first(bb); instr != NULL; instr = next_instr) {
next_instr = instr_get_next(instr);
opcode = instr_get_opcode(instr);
if (opcode == OP_inc || opcode == OP_dec)
replace_inc_with_add(drcontext, instr, trace);
}

}
} Look for inc / dec
static bool replace_inc_with_add(void *drcontext, instr_t *instr, instrlist_t *trace) {
instr_t *in;
uint eflags;
int opcode = instr_get_opcode(instr);
bool ok_to_replace = false;
for (in = instr; in != NULL; in = instr_get_next(in)) {
eflags = instr_get_arith_flags(in);
if ((eflags & EFLAGS_READ_CF) != 0) return false;
if ((eflags & EFLAGS_WRITE_CF) != 0) {
ok_to_replace = true;
break;

Ensure eflags change ok


}
if (instr_is_exit_cti(in)) return false;
}
if (!ok_to_replace) return false;
if (opcode == OP_inc)
in = INSTR_CREATE_add(drcontext, instr_get_dst(instr, 0), OPND_CREATE_INT8(1));
else
in = INSTR_CREATE_sub(drcontext, instr_get_dst(instr, 0), OPND_CREATE_INT8(1));
instr_set_prefixes(in, instr_get_prefixes(instr));

Replace with add / sub


instrlist_replace(trace, instr, in);
instr_destroy(drcontext, instr);
return true;
}
DynamoRIO Tutorial at CGO 24 April 2010
180
Strength Reduction Results
2% mean
1.2
1.1
speedup
Normalized Execution Time

1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
applu

mgrid
equake
ammp

sixtrack

wupwise

har. mean
apsi

swim
mesa
art

base inc2add B e nchmark

DynamoRIO Tutorial at CGO 24 April 2010


181
Software Prefetching

• Ubiquitous Memory Introspection (cgo 2007)


– Sampling to select hot traces
– Instrument to collect memory references
– Analyze to discover reference patterns
• Stride
– Insert software prefetching instruction

DynamoRIO Tutorial at CGO 24 April 2010


182
Software Prefetching Results

DynamoRIO Tutorial at CGO 24 April 2010


183
More Examples

• Dynamic Optimization
– Strength Reduction
– Software Prefetching
• Profiling
– Memory Reference Trace
– PiPA
• Shadow Memory
– Umbra
• Dr. Memory

DynamoRIO Tutorial at CGO 24 April 2010


184
Memory Reference Trace

• Memtrace
– Profile format
• <r/w, addr>
– Steps
• Thread initialization
– Allocate buffer per thread
• Instrumentation
– Fill the buffer
– Dump to file if buffer is full
• Thread exit
– Delete the buffer
– Optimization
• Inline buffer filling code
• Out-line the clean call invocation code

DynamoRIO Tutorial at CGO 24 April 2010


185
Register Events

DR_EXPORT void
dr_init(client_id_t id)
{

mutex = dr_mutex_create();
dr_register_exit_event(event_exit);
dr_register_thread_init_event(event_thread_init);
dr_register_thread_exit_event(event_thread_exit);
dr_register_bb_event(event_basic_block);
/* out-lined clean call invocation code */
code_cache_init();

}

DynamoRIO Tutorial at CGO 24 April 2010


186
Own Code Cache Initialization

static void code_cache_init(void) {


drcontext = dr_get_current_drcontext();
code_cache = dr_nonheap_alloc(PAGE_SIZE, DR_MEMPROT_READ |
DR_MEMPROT_WRITE |
DR_MEMPROT_EXEC);
ilist = instrlist_create(drcontext);
where = INSTR_CREATE_jmp_ind(drcontext, opnd_create_reg(REG_XCX));
instrlist_meta_append(ilist, where);
dr_insert_clean_call(drcontext, ilist, where, (void *)clean_call, false, 0);
end = instrlist_encode(drcontext, ilist, code_cache, false);
DR_ASSERT((end - code_cache) < PAGE_SIZE);
instrlist_clear_and_destroy(drcontext, ilist);
dr_memory_protect(code_cache, PAGE_SIZE, DR_MEMPROT_READ |
DR_MEMPROT_EXEC);
}

DynamoRIO Tutorial at CGO 24 April 2010


187
Clean Call

static void clean_call() {



drcontext = dr_get_current_drcontext();
data = dr_get_tls_field(drcontext);
mem_ref = (mem_ref_t *)databuf_base;
num_refs = (int)((mem_ref_t *)databuf_ptr - mem_ref);
for (i = 0; i < num_refs; i++) {
dr_fprintf(datalog, "%c:"PFX"\n", mem_refwrite ? 'w' : 'r', mem_refaddr);
++mem_ref;
}
datanum_refs += num_refs;

}

DynamoRIO Tutorial at CGO 24 April 2010


188
Thread initialization & exit

void event_thread_init(void *drcontext) {



data = dr_thread_alloc(drcontext, sizeof(per_thread_t));
dr_set_tls_field(drcontext, data);
databuf_base = dr_thread_alloc(drcontext, MAX_MEM_BUF_SIZE);

datanum_refs = 0;
}
void event_thread_exit(void *drcontext) {
data = dr_get_tls_field(drcontext);
dr_mutex_lock(mutex);
num_refs += datanum_refs;
dr_mutex_unlock(mutex);
dr_thread_free(drcontext, databuf_base, MAX_MEM_BUF_SIZE);
dr_thread_free(drcontext, data, sizeof(per_thread_t));
}

DynamoRIO Tutorial at CGO 24 April 2010


189
Basic Block Instrumentation

static dr_emit_flags_t
event_basic_block(void *drcontext, void *tag, instrlist_t *bb,
bool for_trace, bool translating) {

for (instr = instrlist_first(bb); instr != NULL; instr = instr_get_next(instr)) {
if (instr_get_app_pc(instr) == NULL) continue;
if (instr_reads_memory(instr))
for (i = 0; i < instr_num_srcs(instr); i++)
if (opnd_is_memory_reference(instr_get_src(instr, i)))
instrument_mem(drcontext, bb, instr, i, false);
if (instr_writes_memory(instr))
for (i = 0; i < instr_num_dsts(instr); i++)
if (opnd_is_memory_reference(instr_get_dst(instr, i)))
instrument_mem(drcontext, bb, instr, i, true);
}

DynamoRIO Tutorial at CGO 24 April 2010


190
instrument_mem (I)

/* get memory reference address into reg1 */


opnd1 = opnd_create_reg(reg1);
if (opnd_is_base_disp(ref)) {
/* lea [ref]  reg */
opnd2 = ref;
opnd_set_size(&opnd2, OPSZ_lea);
instr = INSTR_CREATE_lea(drcontext, opnd1, opnd2);
} else if(IF_X64(opnd_is_rel_addr(ref) ||) opnd_is_abs_addr(ref)) {
/* mov addr  reg */
opnd2 = OPND_CREATE_INTPTR(opnd_get_addr(ref));
instr = INSTR_CREATE_mov_imm(drcontext, opnd1, opnd2);
} else {
instr = NULL;
DR_ASSERT_MSG(false, "Unhandled instructions");
}
instrlist_meta_preinsert(ilist, where, instr);

DynamoRIO Tutorial at CGO 24 April 2010


191
instrument_mem (II)

/* Move write/read to write field */


opnd1 = OPND_CREATE_MEM32(reg2, offsetof(mem_ref_t, write));
opnd2 = OPND_CREATE_INT32(write);
instr = INSTR_CREATE_mov_imm(drcontext, opnd1, opnd2);
instrlist_meta_preinsert(ilist, where, instr);
/* Store address in memory ref */
opnd1 = OPND_CREATE_MEMPTR(reg2, offsetof(mem_ref_t, addr));
opnd2 = opnd_create_reg(reg1);
instr = INSTR_CREATE_mov_st(drcontext, opnd1, opnd2);
instrlist_meta_preinsert(ilist, where, instr);
/* Increment reg value by pointer size using lea instr */
opnd1 = opnd_create_reg(reg2);
opnd2 = opnd_create_base_disp(reg2, REG_NULL, 0, sizeof(mem_ref_t), OPSZ_lea);
instr = INSTR_CREATE_lea(drcontext, opnd1, opnd2);
instrlist_meta_preinsert(ilist, where, instr);

DynamoRIO Tutorial at CGO 24 April 2010


192
instrument_mem (III)

/* jecxz call */
call = INSTR_CREATE_label(drcontext);
instrlist_meta_preinsert(ilist, where, INSTR_CREATE_jecxz(drcontext, opnd_create_instr(call)));
/* jump restore to skip clean call */
restore = INSTR_CREATE_label(drcontext);
instrlist_meta_preinsert(ilist, where,
INSTR_CREATE_jmp(drcontext, opnd_create_instr(restore)));
instrlist_meta_preinsert(ilist, where, call);
/* mov restore REG_XCX */
instr = INSTR_CREATE_mov_st(drcontext,opnd_create_reg(reg2),opnd_create_instr(restore));
instrlist_meta_preinsert(ilist, where, instr);
/* jmp code_cache */
opnd1 = opnd_create_pc(code_cache);
instrlist_meta_preinsert(ilist, where, INSTR_CREATE_jmp(drcontext, opnd1));
/* restore %reg */
instrlist_meta_preinsert(ilist, where, restore);

DynamoRIO Tutorial at CGO 24 April 2010


193
PiPA

• Pipelined Profiling and Analysis (cgo 2008)


– Stages (thread/process)
• Profiling
• Reconstruction/extraction
• Analysis
– Profiling
• Runtime Execution Profile (REP)
– Communication
• Double buffer
– Analysis
• Parallel cache simulation

DynamoRIO Tutorial at CGO 24 April 2010


194
More Examples

• Dynamic Optimization
– Strength Reduction
• Profiling
– Memory Reference Trace
• Shadow Memory
– Umbra
• Dr. Memory

DynamoRIO Tutorial at CGO 24 April 2010


195
Shadow Memory

• Application
– Store meta-data to track properties of application memory
• Millions of software watchpoints
• Dynamic information flow tracking (taint propagation)
• Race detection
• Memory usage debugging tool (MemCheck/Dr. Memory)
• Issues
– Performance
– Multi-thread applications
– Flexibility
– Platform dependent
– Development challenges

DynamoRIO Tutorial at CGO 24 April 2010


196
Umbra (CGO 2010)

• Design
• Implementation
• Optimization
• Download
– http://people.csail.mit.edu/qin_zhao/umbra/

DynamoRIO Tutorial at CGO 24 April 2010


197
Design

• Address Space App Mem 1


– A collection of fixed size units
• 4G (64-bit) Unused
• Application, Shadow, Unused
Shd Mem 1
• Translation Table
– Translation from application memory unit to Unused

corresponding shadow memory unit Shd Mem 2


addr shd = addr app × scale + offset
App Mem Shd Mem Offset Shd Mem 3
[0x00000000, [0x20000000, 0x20000000
0x10000000) 0x30000000) App Mem 2

[0x60000000, [0x40000000, -0x20000000


App Mem 3
0x70000000) 0x50000000)
[0x80000000, [0x50000000, -0x20000000
0x90000000) 0x60000000)
DynamoRIO Tutorial at CGO 24 April 2010
198
Implementation

• Memory Manager
– Monitor and control application memory allocation
• brk, mmap, munmap, mremap
• dr_register_pre_syscall_event
• dr_register_post_syscall_event
– Allocate shadow memory
– Maintain translation table
• Instrumenter
– Instrument every memory reference
• Context save
• Address calculation
• Address translation
• Shadow memory update
• Context restore
DynamoRIO Tutorial at CGO 24 April 2010
199
Instrument Code Example
Context Save mov %ecx  [ECX_SLOT]
mov %edx  [EDX_SLOT]
mov %eax  [EAX_SLOT]
lahf  %ah
seto  %al

Address Calculation lea [%ebx, 16]  %ecx


Address Translation mov 0  %edx
… # table lookup code
add %ecx, table[%edx].offset  %ecx
Shadow Memory Update mov 1  [%ecx]
Context Restore add %al 0x7f
sahf
mov [ECX_SLOT]  %ecx
mov [EDX_SLOT]  %edx
mov [EAX_SLOT]  %eax

Application memory reference mov 0  [%ebx, 16]

DynamoRIO Tutorial at CGO 24 April 2010


200
Optimization

• Translation Optimization
– Thread Local Translation Table
– Memoization Check
– Reference Check

• Instrumentation Optimization
– Context Switch Reduction
– Reference Grouping
– 3-stage Code Layout

DynamoRIO Tutorial at CGO 24 April 2010


201
Translation Optimization

• Caching

App 1

Shd 2

Shd 1

App 2

Global translation
table

DynamoRIO Tutorial at CGO 24 April 2010


202
Translation Optimization

• Thread Local Translation Optimization


– Local translation table per thread
– Synchronize with global translation table when necessary
– Avoid lock contention

Thread 1

Thread 2

Thread Local translation Global translation


table table

DynamoRIO Tutorial at CGO 24 April 2010


203
Translation Optimization

• Memoization Cache
– Software cache per thread
– Stores frequently used translation entries
• Stack
• Units found in last table lookup

Thread 1

Thread 2

Memoization Thread Local translation Global translation


Cache table table

DynamoRIO Tutorial at CGO 24 April 2010


204
Translation Optimization

• Reference Cache
– Software cache per static application memory reference
• Last reference unit tag
• Last translation offset

Thread 1

Thread 2

Reference Memoization Thread Local translation Global translation


cache Cache table table

DynamoRIO Tutorial at CGO 24 April 2010


205
205
Instrumentation Optimization

• Context Switch Reduction


– Registers liveness analysis
• Reference Grouping
– One translation lookup for multiple references
• Stack local variables
• Different members of the same object
• 3-stage Code Layout
– Inline stub
• Quick inline check code with minimal context switch
– Lean procedure
• Simple assembly procedure with partial context switch
– Callout
• C function with complete context switch

DynamoRIO Tutorial at CGO 24 April 2010


206
3-stage Code Layout

• Inline stub
– Reference cache check
– Jump to lean procedure if miss

• Lean procedure
– Memoization cache check
– Local table lookup
– Clean call to call out

• Callout
– Global table synchronization
– Local table lookup

DynamoRIO Tutorial at CGO 24 April 2010


207
Instrumentation Optimization
Inline Stub Lean Procedure
# reference cache check # memorization check
lea [ref]  %r1 cmp %r1, cache1.tag
%r1 & 0xffffffff00000000  %r1 jne .cache1_miss
cmp %r1, ref.tag mov cache1.offset  %r1
je .update_shadow_memory jmp [ret_pc]
# jmp-and-link to lean procedure .cache1_miss
mov %r1  ref.tag cmp %r1, cache2.tag
mov .update_ref_cache  [ret_pc] jne .cache2_miss
jmp lean_procedure mov cache1.offset  %r1
.update_ref_cache jmp [ret_pc]
mov %r1  ref.offset .cache2_miss
# shadow memory update # table lookup
.update_shadow_memory mov %r1  cache2.tag
lea [ref]  %r1 mov %r2  [R2_SLOT]
add %r1 + ref.offset %r1 …
mov 1  [%r1] mov [R2_SLOT]  %r2
mov %r1  cache2.offset
jmp [ret_pc]

DynamoRIO Tutorial at CGO 24 April 2010


208
Performance Evaluation

DynamoRIO Tutorial at CGO 24 April 2010


209
Umbra Client: Shared Memory Detection

static void instrument_update(void *drcontext, umbra_info_t *umbra_info,


mem_ref_t *ref, instrlist_t *ilist, instr_t *where) {

/* test [%reg].tid_map, tid_map*/
opnd1 = OPND_CREATE_MEM32(umbra_inforeg, 0, OPSZ_4);
opnd2 = OPND_CREATE_INT32(client_tls_datatid_map);
instrlist_meta_preinsert(ilist, where, INSTR_CREATE_test(drcontext, opnd1, opnd2));
/* jnz where */
opnd1 = opnd_create_instr(where);
instrlist_meta_preinsert(ilist, where, INSTR_CREATE_jcc(drcontext, OP_jnz, opnd1));
/* or */
opnd1 = OPND_CREATE_MEM32(umbra_inforeg, 0, OPSZ_4);
opnd2 = OPND_CREATE_INT32(client_tls_datatid_map | 1);
instr = INSTR_CREATE_or(drcontext, opnd1, opnd2);
LOCK(instr);
instrlist_meta_preinsert(ilist, label, instr);
}
DynamoRIO Tutorial at CGO 24 April 2010
210
More Examples

• Dynamic Optimization
– Strength Reduction
• Profiling
– Memory Reference Trace
• Shadow Memory
– Umbra
• Dr. Memory

DynamoRIO Tutorial at CGO 24 April 2010


211
Dr. Memory

• Detects reads of uninitialized memory


• Detects heap errors
– Out-of-bounds accesses (underflow, overflow)
– Access to freed memory
– Invalid frees
– Memory leaks
• Detects other accesses to invalid memory
– Stack tracking
– Thread-local storage slot tracking
• Operates at runtime on unmodified Windows & Linux binaries

DynamoRIO Tutorial at CGO 24 April 2010


212
Dr. Memory Instrumentation

• Monitor all memory accesses, stack adjustments, and heap


allocations
• Shadow each byte of app memory
• Each byte’s shadow stores one of 4 values:
– Unaddressable
– Uninitialized
– Defined at byte level
– Defined at bit level  escape to extra per-bit shadow values

DynamoRIO Tutorial at CGO 24 April 2010


213
Dr. Memory

Shadow Stack Shadow Heap


Stack Heap
defined redzone invalid
undefined defined
malloc undefined
defined
defined
invalid redzone invalid

freed invalid

DynamoRIO Tutorial at CGO 24 April 2010


214
Partial-Word Defines But Whole-Word Transfers

• Sub-dword variables are moved around as whole dwords


• Cannot raise error when a move reads uninitialized bits
• Must propagate on moves and thus must shadow registers
– Propagate shadow values by mirroring app data flow
• Check system call reads and propagate system call writes
– Else, false negatives (reads) or positives (writes)
• Raise errors instead of propagating at certain points
– Report errors only on “significant” reads

DynamoRIO Tutorial at CGO 24 April 2010


215
Shadowing Registers

• Use multiple TLS slots


– dr_raw_tls_calloc()
– Alternative: steal register
• Can read and write w/o spilling
• Bring into spilled register to combine w/ other args
– Defined=0, uninitialized=1
– Combine via bitwise or

DynamoRIO Tutorial at CGO 24 April 2010


216
Monitoring Stack Changes

• As stack is extended and contracts again, must update stack


shadow as unaddressable vs uninitialized
• Push, pop, or any write to stack pointer
• Try to distinguish large alloc/dealloc from stack swap

DynamoRIO Tutorial at CGO 24 April 2010


217
Kernel-Mediated Stack Changes

• Kernel places data on the stack and removes it again


– Windows: APC, callback, and exception
– Linux: signals
• Linux signals as an example:
– intercept sigaltstack changes
– intercept handler registration to instrument handler code
– use DR's signal event to record app xsp at interruption point
– when see event followed by handler, check which stack and
mark from either interrupted xsp or altstack base to cur xsp as
defined (ignoring padding)
– record cur xsp in handler, and use to undo on sigreturn

DynamoRIO Tutorial at CGO 24 April 2010


218
Types Of Instrumentation

• Clean call
– Simplest, but expensive in both time and space
• Shared clean call
– Saves space
• Lean procedure
– Shared routine with smaller context switch than full clean call
• Inlined
– Smallest context switch, but should limit to small sequences of
instrumentation

DynamoRIO Tutorial at CGO 24 April 2010


219
Non-Code-Cache Code

• Use dr_nonheap_alloc() to allocate space to store code


• Generate code using DR’s IR and emit to target space
• Mark read-only once emitted via dr_memory_protect()

DynamoRIO Tutorial at CGO 24 April 2010


220
Jump-and-Link

• Rather than using call+return, avoid stack swap cost by using


jump-and-link
– Store return address in a register or TLS slot
– Direct jump to target
– Indirect jump back to source

PRE(bb, inst, INSTR_CREATE_mov_st(drcontext,


spill_slot_opnd(drcontext, SPILL_SLOT_2),
opnd_create_instr(appinst)));
PRE(bb, inst, INSTR_CREATE_jmp(drcontext,
opnd_create_pc(shared_slowpath_region)));
...
PRE(ilist, NULL, INSTR_CREATE_jmp_ind(drcontext,
spill_slot_opnd(SPILL_SLOT_2)));

DynamoRIO Tutorial at CGO 24 April 2010


221
Inter-Instruction Storage

• Spill slots provided by DR are only guaranteed to be live


during a single app instr
– In practice, live until next selfmod instr
• Allocate own TLS for spill slots
– dr_raw_tls_calloc()
• Steal registers across whole bb
– Restore before each app read
– Update spill slot after each app write
– Restore on fault

DynamoRIO Tutorial at CGO 24 April 2010


222
Using Faults For Faster Common Case Code

• Instead of explicitly checking for rare cases, use faults to


handle them and keep common case code path fast
• Signal and exception event and restore state extended event
all provide pre- and post-translation contexts and containing
fragment information
• Client can return failure for extended restore state event
– When can support re-execution of faulting cache instr, but not
re-start translation for relocation

DynamoRIO Tutorial at CGO 24 April 2010


223
Address Space Iteration

• Repeated calls to dr_query_memory_ex()


• Check dr_memory_is_in_client() and
dr_memory_is_dr_internal()
• Heap walk
– API on Windows
• Initial structures on Windows
– TEB, TLS, etc.
– PEB, ProcessParameters, etc.

DynamoRIO Tutorial at CGO 24 April 2010


224
Intercepting Library Routines

• Common task
• Dr. Memory monitors malloc, calloc, realloc, free,
malloc_usable_size, etc.
– Alternative is to replace w/ own copies
• Locating entry point
– Module API
• Pre-hooks are easy
• Post-hooks are hard
– Three techniques, each with its own limitations

DynamoRIO Tutorial at CGO 24 April 2010


225
Intercepting Library Routines: Technique 1

• CFG analysis at init time


– Statically analyze code from entry point and find return
instruction(s)
– Post-hook placed at each return instruction
• Complications:
– Not always easy or even possible to statically analyze
• Hot/cold and other layout optimizations
• Switches or other indirection
• Mixed code/data
– Tailcall from hooked routine A to hooked routine B will skip a
return in A
– Longjmp or SEH unwind can skip any post-hook

DynamoRIO Tutorial at CGO 24 April 2010


226
Intercepting Library Routines: Technique 2

• At call site identify target


– Direct calls/jmp: easy
– Indirect through PLT/IAT: easy
– Indirect through register/unknown memory: not always easy
– Post-hook is placed at post-call-site
– Flush post-call-site if it exists
• Complications:
– Indirect targets that are not “statically” analyzable
– Same call targets multiple hooked routines
– Tailcall from hooked routine A to hooked routine B will skip post-
call of site in A
– Longjmp or SEH unwind can skip any post-hook

DynamoRIO Tutorial at CGO 24 April 2010


227
Intercepting Library Routines: Technique 3

• Inside callee, obtain return address


– Flush that address if it already exists
– Post-hooks are placed at return address
• Complications:
– Same call targets multiple hooked routines
• Store which inside callee
– Tailcall from hooked routine A to hooked routine B will skip
return address of B
• Identify tailcall and store target B; process B and then A at post-A
– Longjmp or SEH unwind can skip any post-hook
• Try to intercept

DynamoRIO Tutorial at CGO 24 April 2010


228
Modifying Library Routine Parameters

• Use cases for Dr. Memory:


– Add redzone to heap allocations
– Delay frees
• Simply clobber the actual parameter
– Need to know calling convention
– Use dr_safe_write() for robustness

DynamoRIO Tutorial at CGO 24 April 2010


229
Replacing Library Routines

• Dr. Memory replaces libc routines containing optimized code


that raises false positives
– memcpy, strlen, strchr, etc.
• Simplification: arrange for routines to always be entered in a
new bb
– Do not request elision or indcall2direct from DR
• Want to interpret replaced routines
– DR treats native execution differently: aborts on fault, etc.
• Replace entire bb with jump to replacement routine

DynamoRIO Tutorial at CGO 24 April 2010


230
State Across Windows Callbacks

• Per-thread state varies in whether should be shared or private


across callbacks
– Pre-to-post syscall data must be callback-private
• Callback entry:
– Ntdll!KiUserCallbackDispatcher
• Callback exit:
– NtCallbackReturn system call
– Int 0x2b
• Make these DynamoRIO Events? (Issue 241)

DynamoRIO Tutorial at CGO 24 April 2010


231
Delayed Fragment Deletion

• Due to non-precise flushing we can have a flushed bb made


inaccessible but not actually freed for some time
• When keeping state per bb, if a duplicate bb is seen, replace
the state and increment a counter ignore_next_delete
• On a deletion event, decrement and ignore unless below 0
• Can't tell apart from duplication due to thread-private copies:
but this mechanism handles that if saved info is deterministic
and identical for each copy

DynamoRIO Tutorial at CGO 24 April 2010


232
Callstack Walking

• Use case: error reporting


• Technique:
– Start with xbp as frame pointr (fp)
– Look for <fp,retaddr> pairs where retaddr = inside a module
• Interesting issues:
– When scanning for frame pointer (in frameless func, or at bottom
of stack), querying whether in a module dominates performance
– msvcr80!malloc pushes ebx and then ebp, requiring special
handling
– When displaying, use retaddr-1 for symbol lookup

DynamoRIO Tutorial at CGO 24 April 2010


233
Client Files Closed By Application

• Some applications close all file descriptors


• Solution:
– Keep table of file descriptors owned by client
– Intercept close system call
– Turn close into nop when called on client descriptors

DynamoRIO Tutorial at CGO 24 April 2010


234
Suspending The World

• Use case: Dr. Memory leak check


– GC-like memory scan
• Use dr_suspend_all_other_threads() and
dr_resume_all_other_threads()
• Cannot hold locks while suspending

DynamoRIO Tutorial at CGO 24 April 2010


235
Using Nudges

• Daemon apps do not exit


• Request results mid-run
• Cross-platform
– Signal on Linux
– Remote thread on Windows

DynamoRIO Tutorial at CGO 24 April 2010


236
Tool Packaging

• DynamoRIO is redistributable, so can include a copy with your


tool
• Front end to configure and launch app
– On Linux use a script that execs drrun
– On Windows use drinjectlib.dll

DynamoRIO Tutorial at CGO 24 April 2010


237
Feedback
1:30-1:40 Welcome + DynamoRIO History
1:40-2:40 DynamoRIO Internals
2:40-3:00 Examples, Part 1
3:00-3:15 Break
3:15-4:15 DynamoRIO API
4:15-5:15 Examples, Part 2
5:15-5:30 Feedback
Contributors

• Looking for contributors to DynamoRIO


– Work on major features
• Windows 7 support
• Auto-inline callouts
• Attach to running process
• MacOS port
– Add new Extensions
– Run nightly test suites
– Maintain particular platforms
– Etc.

DynamoRIO Tutorial at CGO 24 April 2010


239
Future Releases

• Better Linux library support (STL, etc.)


– Custom loader
• Auto-inline callouts
• Persistent and process-shared caches
• Attach to a process
• Symbol table lookup support on Linux
• 64-bit client controlling 32-bit app
• Your suggestion here

DynamoRIO Tutorial at CGO 24 April 2010


240
Feedback

• Questions for you


– Feedback on what you want to see in API
• Thank you!

DynamoRIO Tutorial at CGO 24 April 2010


241

You might also like