KEMBAR78
Linux Performance Tuning Logistics: Tutorial Runs From 9 To 5:00pm | PDF | Hard Disk Drive | Transmission Control Protocol
0% found this document useful (0 votes)
59 views46 pages

Linux Performance Tuning Logistics: Tutorial Runs From 9 To 5:00pm

This document provides an agenda and overview for a Linux performance tuning tutorial taking place on November 10, 2010. The tutorial will run from 9am to 5pm with breaks scheduled throughout the day. The agenda includes introductions to performance tuning, filesystem and storage tuning, network tuning, NFS performance tuning, memory tuning, and application tuning. Basic performance monitoring tools like free, top, and iostat are also discussed.

Uploaded by

Franck
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views46 pages

Linux Performance Tuning Logistics: Tutorial Runs From 9 To 5:00pm

This document provides an agenda and overview for a Linux performance tuning tutorial taking place on November 10, 2010. The tutorial will run from 9am to 5pm with breaks scheduled throughout the day. The agenda includes introductions to performance tuning, filesystem and storage tuning, network tuning, NFS performance tuning, memory tuning, and application tuning. Basic performance monitoring tools like free, top, and iostat are also discussed.

Uploaded by

Franck
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Linux Performance Tuning Logistics

 Tutorial runs from 9 to 5:00pm


 Morning break at 10:30am
 Lunch at 12:30--1:30 pm
 Afternoon break at 3:00-3:30pm
Wednesday, November 10, 2010  Feel free to ask me questions
 But I reserve the right to defer some answers until
later in the session or to the break/end of the class.
 Please fill out and return the tutorial evaluation
form!
1 2
Introduction to Performance
Agenda
Tuning
 Introduction to Performance Tuning  Complex task that requires in-depth
 Filesystem and storage tuning understanding of hardware, software, and
application
 Network tuning  If it were easy the OS would do it automatically (and
 NFS performance tuning the OS does a lot automatically to begin with)
 Memory tuning  Goals of Performance Tuning
 Application tuning  Speed up time to do a single large task (time to do
perform some large matrix calculation)
 Graceful degredation of a web/application server as
it is asked to service a larger and larger number of
3
requests 4
Stress Testing Finding Bottlenecks

 What happens when a server is put under a  Careful tuning of memory usage won't matter if
large amount of stress? the problem is caused by a shortage of disk
 “My web server just got slashdotted!” bandwidth
 Typically the server behaves well until the load  Performance measurement tools are hugely
increases beyond a certain critical point; then it important to diagnose what is placing limits on
breaks down. the scalability or performance of your
application

Transaction latencies go through the roof
 The server may cease functioning altogether
 Start with large areas, then narrow down
 Measure the system when it is functioning  Is your application I/O bound? CPU bound?
Network bound?
normally, and then when it is under stress. 5 6
What changes?
Incremental Tuning Measurement overhead

 Use the scientific method  Some performance measurement tools may


 Establish a baseline impact your application's behavior
 If you're not familiar with how a particular tool
 Define testing parameters which are replicated from
interacts with your workload, don't assume that a
test to test.
tool has zero overhead with your application!
 Measure the performance given a starting
configuration.
 Enabling application performance metering or
debugging may also change its baseline
 Change one parameter at a time numbers.
 Record everything
 Make sure you get the same results when you
repeat a test!
7 8
A basic performance tuning
Basic Performance Tools
methodology
 Define your baseline configuration and  free
measure its performance  top
 [ If appropriate, define a stress test workload, and
measure it. ]
 iostat
 Make a single change to the system
configuration. Measure the results of that
change and record it.
 Repeat as necessary
 Make sure to test single changes as well as
combination of changes. Sometimes effects are
synergistic 9 10
The free(1) command

 Basic command which shows memory usage

11
Questions to ask yourself when
The top(1) command
looking at top(1) output
 Good general place to start  What are the “top” tasks running; should they
be there? Are they running? Waiting for disk?
How much memory are they taking up?
 How is the CPU time (overall) being spent?
 User time, System time, Niced user time, I/O Wait,
Hardware IRQ, Software IRQ, “Stolen” time

13 14
The iostat(1) command Advanced iostat(1)

 Part of the systat package; shows I/O statistics  Many more details with the -x option
 Use -k to for kilobytes instead of 512 sectors  rrqm/s, wrqm/s – read/write requests merged per
second

r/s, w/s – read/write request per second
 rkB/s, wkB/s – number of kilobytes of read/write
transfers per second

avgrq-sz --- average request size in 512 byte
sectors per second
 avgqu-sa – average request queue length
 …
15 16
Advanced iostat(1), continued Example of iostat -xk 1

 Still more details revealed with the -x option  Workload “fs_mark -s 10240 -n 1000 -d /mnt”
 await – average time (in ms) between when a  Creates 1000 files, each 10k, in /mnt, with an fsync
requested is issued and when it is completed (time after writing each file
in queue plus time for device to service the request) 
Result: 33.7 files/second
 svctm – average sevice time (in ms) for I/O
requests that were issued to the device
 %util – Percentage of CPU time during which the
device was servicing requests. (100% means the
device is fully saturated)

17 18
Conclusions we can draw from
Speeding up fs_mark
the iostat results
 Utilization: 98.48%  If we mount the (ext4) file system with -o
 The system is I/O bound barrier=0, file/sec becomes 358.3
 Adding memory or speeding up the CPU clock  But this risks fs corruption after a power fail
won't help  Is the fsync() really needed? Without it, file/sec
 Solution – attack the I/O bottleneck goes up to 17,010.30
 Add more I/O bandwidth resources (use a faster  Depends on application requirements
disk or use a RAID array)  Better: use -o journal_async_commit
 Or, do less work!  Using journal checksums, it allows ext4 to safely
use only one barrier per fsync() instead of two.
(Requires Linux 2.6.32)
19 20
Using -o journal_async_commit Comparing the two results

 Using ”fs_mark -s 10240 -n 1000 -d /mnt” again 33.7


files/sec
 Result: 48.2 files/sec (a 46% improvement over
33.7 files/sec!) 2000
barrier
ops

49.2
files/sec

1000
barrier
ops

21 22
Before we leave fs_mark... Lessons Learned So Far

 How does fs_mark fare on other file systems?  Measure, analyze, and then tweak
 ext2 (no barriers) – 574.9  Bottleneck analysis is critical
 ext3 (no barriers) – 348.8 (w/ barriers) – 30.8  It is very useful to understand how things work
 ext4 (no barriers) – 358.3 (w/ barriers) – 49.2 under the covers
 XFS (no barriers) – 337.3 (w/ barriers) – 29.0  Adding more resources is one way to address a

reiserfs (no barriers) – 210.0 (w/ barriers) – 31.5
bottleneck
 Important note: these numbers are specific to
 But so is figuring ways of doing less work!
this workload (small files, fsync heavy) and not  Sometimes you can achieve your goal by working
a general figure of merit for these file systems smarter, not harder.

23 24
The snap script Agenda

 Handy quickie shell script which I and a  Introduction to Performance Tuning


colleague developed while working on the  Filesystem and storage tuning
Advanced Linux Response Team at IBM
 Network tuning
 Collects a lot of statistics: iostat, meminfo,
slabinfo, sar, etc. in a low impact fashion.  NFS performance tuning
 Collects system configuration information  Memory tuning
 Especially useful when I might not have access to  Application tuning
the system for security reasons
 Gather information for a day; then analyze for
trends or patterns
25 26
File system and storage tuning Hard Drives

 Choosing the right storage devices  Disks are probably the biggest potential
 Hard Drives bottleneck in your system
 SSD  Punched cards and paper tape having fallen out of
favor...
 RAID
 NFS appliances
 Critical performance specs you should examine
 Sustained Data Transfer Rate
 File System Tuning
 Rotational speed: 5400rpm, 7200rpm, 10,000 rpm
 General Tips  Areal density (max capacity in that product family)
 File system specific  Seek time (actually 3 numbers: average, track-to-
track, full stroke)
27 28
Transfer Rates Short stroking hard drives

 The important number is the sustained data  HDD performance are not uniform across the
transfer rate (aka ”disk to buffer”) rate platter
 Typically around 70-100 Mb/s; slower for laptop  Up to 100% performance improvements on the
drives ”outer edge” of the disk
 Much less important: The I/O transfer rate  Consider partitioning your disk to take this into
 At least for hard drives, whether you are using account!
SATA I's 1.5 Gb/s or SATA II's 3.0 Gb/s won't  If you don't need the full 1TB of space, partitioning
matter except for rare cases when transfering data your disk to only use the first 100GB or 300GB
out of the track buffer could speed things up!
 SDD's might be a different story, of course...  Also – when running benchmarks, use the same
29
partitions for each file system tested. 30
What about SSD's? Getting the right SSD is important

 Advantages of SSD  Really good website for goes into great detail
 Fast random access reads about this is Anand Tech
 Fails usually when writing, not when reading  http://www.anandtech.com/storage/
 Less suceptible to mechanical shock/vibration  Many of the OEM SSD's included laptops are
 Most SSD's use less power than HDD's not the good SSD's, and you the pay the OEM
markup to add insult to injury.
 Disadvantage of SSD's
 Cost per Gb much more expensive
 Limited number of write cycles
 Writes are slower than reads; random writes can be
much slower (up to a ½ sec average, 2 sec worst 31 32
case for 4k random writes for really bad SSD's!)
Should you use SSD's? PCIe attached flash

 For laptops and desktops, absolutely!  Like SSD's, only more so


 For servers, it depends...  Speed achieved by writing to large numbers of
 If you need fast random access reads, yes! flash chips in parallel
 If you care about power consumption, be careful  Potentially 100k to 1M 4k random reads / seconds
 When idle, SSD's save only 0.2 to 0.4 Watts  Synchronous 4k random write just as slow as SSD's
 When active, SSD's use roughly the same power as
5400rpm 2.5” drive and save 3W or so compared to
 Very expensive, but the price is starting to drop
high performance 3.5” drives  In some cases, they can be cost effective
 For certain workloads, the write endurance problem  1 server with PCIe attached flash could replace
of SSD's may be a strong concern
several servers with HDD's/SSD's in some cases
33 34
RAID RAID tuning considerations

 Redundant Array of Inexpensive Disks  Adding more spindle improves performance


 RAID 0 – Striping  RAID 5/6 requires some special care
 RAID 1 – Mirroring  Writes smaller than the N*stripe size will require a
 RAID 5 – 3 or more disks, with a rotating parity read/modify/write cycle in order to update the parity
stripe stripe (where N is the number of non-spare disks)

RAID 6 – 4 more disks, with two rotating parity 
If the RAID device is going to be broken up using
stripes LVM or partitions, make sure the LV/parition is
aligned on a full stripe boundary
 RAID 10 – Mirroring + striping

35 36
Filesystem Tuning Managing Access-time Updates

 Most general purpose file systems work quite  Posix requires that a file's last access time is
well for most workloads updated each time its contents are accessed.
 But in some file systems are better for certain  This means a disk write for every single read
specialized workloads  The mount options noatime and relatime can
 Reiserfs – small (< 4k) files reduce this overhead.
 XFS – very big RAID arrays, very large files  The relatime option will only update the atime if
mtime and ctime is newer than the last atime.
 Ext3 is a good general purpose filesystem that
many people use by default  Only saves approximately half the writes compared
to noatime
 Ext4 will be better at RAID, larger files, while still
working well on small-medium sized files
 Some applications do depend on atime being
37
updated 38
Using ionice to control read/write
Tuning ext3/ext4 journals
priorities
 Sometimes increasing the journal size can help;  Like the nice command but affects the priority
especially if your workload is very metadata- of read/write requests issued by the process
intensive (lots of small files; lots of file  Three scheduling classes
creates/deletes/renames)

 Idle – only if there are no other high priority
 Journal data modes requests pending

data=ordered (default) – data is written first before  Best-effort – requests served round-robin (default)
metadata is committed  Real time – highest priority request always gets
 data=journal – data is written into the journal access
 data=writeback – only metadata is logged; after a  For best-effort and real time classes, there are
crash, uninitialized data can appear in newly 8 priorities, with 0 being the highest priority and
allocated data blocks 39
7 the lowest priority 40
Agenda Network Tuning

 Introduction to Performance Tuning  Before you do anything else... check the basic
 Filesystem and storage tuning health of the network
 Speed, duplex, errors
 Network tuning
 Tools: ethtool, ifconfig, ping
 NFS performance tuning  Check TCP throughput: ttcp or nttcp
 Memory tuning  Look for ”wierd stuff” using wireshark / tcpdump
 Application tuning  Network is a shared resource

Who else is using it?
 What are bottlenecks in the network topology?
41 42
Latency vs Throughput Interrupt Coalescing

 Latency  This reduces CPU load by amortizing the cost


 When applications need maximum responsiveness of an interrupt over multiple packets; this allos
 Lockstep protocols (i.e., no sliding window
us to trade off latency for throughput
optimizations)  ethtool -C ethX rx-usecs 80 rx-frames 20”
 RPC-based protocols  This will delay a receive interrupt for 80 s or until 20
 Throughput packets are received, whichever comes first

When transfering large data sets
 ethtool -C ethX rx-usecs 0 rx-frames 1”
 This will cause an interrupt to be sent for every

Very often tuning efforts will trade off latency for packet received
throughput or vice versa  Different NIC's will have different defaults and
may have additional tuning parameters
43 44
Enable NIC optimizations The bandwidth-delay product

 Some device drivers don't enable these  Very important when optimizing for throughput,
features by default especially for high speed, long distance links
 You can check using “ethtool -k eth0”  Represents the amount of data that can be “in
 TCP segment offload flight” at any particular point in time.
 “ethtool -K tso on”  BDP = 2 * bandwidth * delay
 Checksum off-load
 BDP = bandwidth * Round Trip Time (RTT)
 example:
 “ethtool -K tx on rx on”
 (100 Mbits/sec / 8 bits/byte) * 50 ms ping time =
 Large Receive offload (for throughput) 625kbytes

“ethtool -K lro on”
45 46
Why the BDP matters Using the BDP

 TCP has to be able to retransmit any dropped  The BDP in bytes plus some overhead room
packets; so the kernel has to remember what should be used as [wmax] below when setting
data has been sent in case it needs to these parameters in /etc/sysctl.conf:
retransmit it.  net.core.rmem_max= [wmax]
 TCP Window  Maximum Socket Receive Buffer size
 Limits on the size of the TCP window to control  net.core.wmem_max= [wmax]
kernel memory consumed by the networking stack  Maximum Socket Send Buffer size
 net.core.rmem_max also known as
/proc/sys/net/core/rmem_max
 e.g., set via “echo 2097152 >
47 /proc/sys/net/core/rmem_max” 48
Per-socket /etc/sysctl.conf For large numbers of TCP
settings connections
 net.ipv4.tcp_rmem = [wmin] [wstd] [wmax]  net.ipv4.tcp_mem = [pmin] [pdef] [pmax]
 receive buffer sizing in bytes (per socket)  pages allowed to be used by TCP (for all sockets)
 net.ipv4.tcp_wmem = [wmin] [wstd] [wmax]  For 32-bit x86 systems, kernel text & data
 memory reserved for send buffers in bytes (per (including TCP buffers) can only be in the low
socket) 896MB.
 Modern kernels do automatic tuning of the  So on 32-bit x86 systems, do not adjust these
receive and send buffers; and the defaults are numbers, since they are needed to balance
memory usage with other Lowmem users.
better; still if your BDP is very high, you may
need to boost [wstd] and [wmax]. Keep [wmin]  If this is a problem, best bet is to switch to a 64-bit
small for out-of-memory situations. x86 system first.
49 50
Increase transmit queue length Optimizing for Low Latency TCP

 The ethernet default of 100 is good for most  This can be very painful, because TCP is not
networks and where we need to balance really designed for low latency applications.
interactive responsiveness with large transfers  TCP is engineered to worry about congestion
 However, for high speed networks and bulk control on wide-area networks, and to optimize for
transfer, this needs to be increased to some throughput on large data streams.
value between 1000-50000  If you are writing your own application from
 “ifconfig eth0 txqueuelength 2000” scratch, very often basing your own protocol on
UDP is often a better bet.
 Tradeoffs: more kernel memory used;
interactive response may be impacted.
 Do you really need a byte-oriented service?
 Do you only need automatic retransmission to deal
 Experiment with ttcp to find the slowest value that
51 with lost packets? 52
works for your network/application.
Nagle Algorithm Delayed Acknowledgements

 Goal: To make networking more efficient by  On the receiver end, wait a small amount of
batching small writes into a bigger packet for time before sending a bare acknowledgement
efficiency to see if there's more data coming (or if the
 When the OS gets a small amount of data (a single program will send a response upon which you
keystroke in an telnet connection), delay a very can piggy-back your response)
small amount of time to see if more bytes will be  This can interact with TCP slow-start to cause
coming.
longer latencies when the send window is
 This naturally increases latency! initially small.
 Requires application-level change  After congestion or after the TCP connection has
 int on = 1; been idle, the send window (maxmimum bytes of
unack'ed data) must be set down the MSS value
 setsockopt (sockfd, SOL_TCP, TCP_NODELAY, 53 54

&amp;on, sizeof (on));


Solving the Delayed Ack problem Enabling QUICKACK

 Disable slow-start algorithm on the sender?


 Linux tries to be “clever” and automatically
figure out when to disable delayed
 Slow-start is a MUST implement (RFC 2581) acknowledgments when it believes the other
 Disable delayed acknowledgments on the side is in slow start.
receiver?  Hack to force “quickack” mode:
 Delayed acknowledgments is a SHOULD (RFC
2581)
 int on = 1;
 setsockopt (sockfd, SOL_TCP, TCP_QUICKACK,
 Some OS's have a way of disabling delayed
acknowledgments; Linux does not &amp;on, sizeof (on));
 There is a hack that works on a per-packet basis, But QUICKACK mode is disabled once other side
though... is done with slow start. So you have to re-enable
55 it any time the connection is idle for longer than 56
the retransmission time.
Agenda NFS Performance tuning

 Introduction to Performance Tuning  Optimize both your network and your filesystem
 Filesystem and storage tuning  In addition, various client and server specific
 Network tuning settings that we'll discuss now
 NFS performance tuning
 General hint: use dedicated NFS servers
 NFS file serving uses all parts of your system: CPU
 Memory tuning time, memory, disk bandwidth, network bandwidth,
 Application tuning PCI bus bandwidth
 Trying to run applications on your NFS servers will
make both NFS and the apps run slowly

57 58
Tuning a NFS Server PCI Bus tuning

 If you only export file system mountpoints, use  NFS serving puts heavy demands on both
the no_subtree_check option in /etc/exports networking cards and hard bus adapters
 Can burn large amonuts of CPU for metadata  If you have a system with multiple PCI buses,
intensive workloads put the networking and storage cards on
 Bump up the number of NFS threads to a large different buses
number (it doesn't hurt that much to have too  Network cards tend to use lots of small DMA
many). Say, 128... (instead of 4 or 8 which is transfers, which tends to hog the bus
way too little). How to do this is distro-specific:
 /etc/sysconfig/nfs
 /etc/defaults/nfs-kernel-server
59 60
Tuning your network config for
NFS client tuning
NFS
 Make sure you use NFSv3 and not NFSv2  Tune the network for bulk transfers (throughput)
 Make sure you use TCP and not UDP  Use the largest MTU size you can
 Use the largest rsize/wsize that the client/server  For ethernets, consider using jumbo frames if all of
kernels support the intervening switches/routers support it
 Modern client/servers can do a megabyte at a time
 Use the hard mount option, and not soft
 Use intr so you can recover an NFS server is down
 All of these are the default except for intr
 Remove outdated fstab mount options. Just use
61 62
“rw,intr”
Agenda Memory Tuning

 Introduction to Performance Tuning  Memory tuning problems can often look like
 Filesystem and storage tuning other problems
 Unneeded I/O caused by excessive paging/swaping
 Network tuning
 Extra CPU time caused by cache/TLB thrashing
 NFS performance tuning  Extra CPU time caused by NUMA-induced memory
 Memory tuning access latencies
 Application tuning  These subtleties require using more
sophisticated performance measurement tools

63 64
Using sar to obtain swapping
To measure swapping activity
information
 The top(1) and free(1) command will both tell  Use “sar -W <interval> [<num. of samples>]”
you if any swap space is in use  Reports number of pages written (swapped out)
 To a first approximation, if there is any swap in use, and read (swapped in) from the page device
the system can be made faster by adding more out per second.
RAM.
 The first output is the average since system was
 To see current swap activity, use the sar(8) started.
program

First use of a very handy (and rather complicated)
system activity recorder program; reading through
the man page strongly recommended
 Part of the systat package 65 66
Optimizing swapping Swapping vs. Paging

 Use multiple swap devices  Swap used for anonymous pages


 Use fast swap devices  i.e., pages which are not backed by a file
 Fast devices can be given a higher priority  Pages which are backed by a file are subject to
 Add more memory to avoid swapping in the first paging
place  If they have been modified, or made dirty, they are
”cleaned” by being written to their backing store
 If a page has not been be used recently, it is
”deactivated” by removing it from processes' page
table
 Clean and inactive pages may be purposed for
67 other uses on an LRU basis 68
Using sar to obtain information
Optimizing Paging
about paging
 Unlike swapping, some amount of paging is  Use “sar -B <interval> [<num. of samples>]”
normal – and unavoidable  Reports many statistics
 So we can't just manage the amount of paging to  pgpgin/s, pgpgout/s – ignore, not useful/misleading
zero, like we can with swapping
 fault/s – # of page faults / sec.
 Goal: to minimize amount of paging in the steady-
state case  majfault/s – # of page faults that result in I/O / sec.
 Key statistics:  pgfree/s – # of pages placed on the free list / sec.
 majflts/s – major faults (which result in I/O) / second
 pgscank/s – # of pages scanned by kswaped / sec.
 pgsteal/s – pages reclaimed from the page and
 pgscand/s – # of pages scanned directly / sec.
swap cache / second to satisfy memory demands  pgsteal/s – # of pages reclaimed from scache / sec.
69  %vmeff – pgsteal/s / (pgscank/s + pgscand/s) 70
Other ways of finding information
/proc/meminfo
about memory utilization
 cat /proc/meminfo
 Something especially important on 32-bit x86
kernels: Low Memory vs. High Memory

Documentation/filesystems/proc.txt
 cat /proc/slabinfo
 Useful for seeing how the kernel is using memory
 ALT-sysrq-m (or 'echo m > /proc/sysrq-trigger')

Different for different kernel versions and
distributions; /proc/slabinfo may not exist if
CONFIG_SLUB used and not CONFIG_SLAB 71 72
Interesting bits from sysrq-m About Memory Caches

 Per-zone statistics  2GHz processor  2 billion cycles per second


 Memory is much slower
 Solution: use small amounts of fast cache
memory
 Typically 32Kb very fast Level 1 cache
 Maybe 4-8MB of somewhat slower Level 2 cache
 Can see how much cache you have using
dmidecode and x86info
 Not much tuning that can be done except by
73 improving the C/C++ program code 74
TLB Caches Using hugepages

 The Translation Lookaside Cache speeds up  Build a kernel that avoids using modules
translation from a virtual address to a physical  The core kernel text segment uses huge pages;
address modules do not
 Normally requires 2-3 lookups in the page tables  Modify an application to use hugepages (or
 TLB cache short circuits this lookup process configure an application to use it if it already
 The x86info program will show the TLB cache has provision to use hugepages).
layout  “mount -t hugetlbfs none /hugepages” then mmap
pages in /hugepages
 Hugepages are a way to avoid consuming too  On new qemu/kvm, you can use the option
many TLB cache entries  “-mem-path /hugepages”

75
 Use shmget(2) with the flag SHM_HUGETLB 76
Configuring hugepages Agenda

 On most enterprise distro's this must be done at  Introduction to Performance Tuning


boot time or shortly after it  Filesystem and storage tuning
 Kernel boot option “hugepages=nnn”  Network tuning
 /etc/sysctl.conf: “vm.nr_hugepages=nnn”
 NFS performance tuning
 These pages are reserved for hugepages and can
not be used for anything else  Memory tuning
 With kernels newer than 2.6.23, things are  Application tuning
more flexible
 Kernel boot option “movablecore=nnn[KMG]”
 Memory reserved this way can be used for
77 78
hugepages and other uses
A quick aside: Java Performance
Application Tuning
Tuning
 Access to the source code?  I'm not a Java programmer.... but I've worked
 Open source vs. Proprietary with a lot of Java performance tuning experts
 Ability/willingness to modify the code?  First thing to consider is Garbage Collection
 Even if it's open source, you might not want to  The GC is overhead that burns CPU time
modify the code  GC can cause unpredictable pauses in the program
 Proprietary programs  Collecting GC stats: JVM command-line option
 Read the documentation; find the knobs and find -verbose:gc
the application-level statistics you can gather  Sizing the heap
 … but there are still some tricks we can do to figure  Larger heap means less GC's
out what is going on when you don't have the
source... 79  … but more time spent GC'ing when you do 80
Reducing GC's by not creating as
Generational GC
much Garbage
 Observation: objects in Java have a high infant  Requires being able to modify the code
mortality rate  Very often, though, Java programmers can
 Temporary objects, etc. make extra work for the Java Run-time
 So put them in a separate arenas. Environment without realizing it
 An object starts in the nursery (aka eden) space.  Two common examples
The nursery is GC'ed more frequently. 
Using String and Integer class variables to do
 Objects which survive a certain number of GC calculations (instead of StringBuffer and the
passes get promoted from the nursery to a tenured primitive int type)
space (which is GC'ed less frequently)  Using Java.util.Map instead of creating a Class
 Need to configure the size of the nursury and
tenured space 81 82
Back to C/C++ applications strace and ltrace

 Tools for investigating applications  Useful for seeing what the application is doing
 strace/ltrace  Especially useful when you don't have source
 valgrind  System call tracing: strace
 gprof  Shared library tracing: ltrace
 oprofile 
Run a new command with tracing:

perf
 strace /bin/ls /usr
 Most of these tools work better if you have
source access
 Attach to an already existing process
 But sometimes not source is not absolutely required

ltrace -p 12345
83 84
Valgrind C/C++ profiling using gprof

 Used for finding memory leaks and other  To use, compile your code using the -pg option
memory access bugs  This will add code to the compiled binary to
 Best used with source access (compiled with -g); track each function call and its caller
but not strictly necessary
 In addition the program counter is sampled by
 Works by emulating x86 in x86 and adding the kernel at some regular interval (i.e., 100Hz
checks to pointer references and malloc/free or 1kHz) to find the “hot spots”
calls
 Demo time!
 Other architectures supported
 Commercial alternative: purify (uses object
code insertion)
85 86
System profiling using oprofile Perf: the next generation

 Basic operation very similar to gprof  Originally intended to be a way to access


 Sample the program counter at regular intervals performance counters
 Advantages over gprof  Added the ability to sample kernel tracepoints
 Does not require recompiling application with -pg  Sampling can be restricted to a process, a
 Can profile multiple processes and the kernel all at process and its children, or the whole system
the same time  With perf record/report/perf annotate
 Demo time! performance events can be tied to specific
C/C++ lines of code (with source files and
object files compiled with -g)
87
 Demo time! 88
Userspace Locking Processor Affinity

 One other application issues which can be a  Rarely a good idea... but can be used to
very big deal: userspace locking improve response time for critical tasks
 Rip out fancy multi-level locking (i.e., user-  Set CPU affinity for tasks using taskset(1)
space spinlocks, sched_yield() calls, etc.)  Set CPU affinity for interrupt handlers using
 Just use pthread mutexes, and be happy /proc/irq/<nn>/smp_affinity

Linux implements pthread mutexes using the  Strategies
futex(2) system call. Avoids kernel context switch
except in the contended case
 Put producer/consumer processes on the same
CPU
 The fast path really is fast! (So need for
fancy/complex multi-level locking – just rip it out)
 Move interrupt handlers to a different CPU
89  Use mpstat(1) and /proc/interrupts to get 90

processor-related statistics
Conclusion

 Performance tuning is fractal


 There's always more to tweak
 “It's more addictive than pistachios!”

Understanding when to stop
 Great way of learning more up and down the
technology stack – from the CPU chip up
through to the OS to the application tuning

91

You might also like