Code Tuning
Code Tuning
The next problem arises in applications ranging from in all bytes of word W, but it is just as easy and
set representation to coding theory. probably a little faster to unroll the loop as shown in
Figure 2 on p. 93.
Problem Two--Counting Bits. Given a sequence of one
(The code isolates the bytes by shifting and then "AND-
million 32-bit words, tell how many one bits are in
ing" off all but the eight low-order bits; this operation is
each word.
language- and machine-dependent) While the original
The obvious program uses a loop to do 32 shifts and solution would use over a hundred machine instruc-
logical AND's; can you think of a better way? tions, the above approach can usually be implemented
The principle is the same as before: a table whose I th in about a dozen instructions; it is not uncommon for
entry contains the number of one bits in the binary this change to result (again) in a speedup of an order of
representation of I. Unfortunately, most computers magnitude.
don't have enough room to store a table with 232 en- The final problem is typical of applications that deal
tries, and the time to initialize the four billion entries with geographic and geometric data.
would also be prohibitive. We can get around this by
trading a few additional operations for a smaller table. Problem Three--Computing Spherical Distances. The
We'll set up a table giving the counts for all eight-bit first part of the input is a set S of 5,000 points on the
bytes, and then answer the 32-bit question by summing surface of a globe; each point is represented by its
the answers to four eight-bit questions. The count table latitude and longitude. After those points are stored
is initialized with two loops that have the effect of the in a data structure of our choice, the program reads
following assignment statements (see Problem 2). the second part of the input: a sequence of 20,000
points, each represented by latitude and longitude.
C o u n t T a b l e [ 0 ] :=0 For every point in that sequence, the program must
C o u n t T a b l e [ 1 ] :=I tell which point in S is closest to it, where distance is
C o u n t T a b l e [ 2 ] :=I measured as the angle between the rays from the
C o u n t T a b l e [ 3 ] :=2 center of the globe to the two points.
. . •
Tuning the Federal Government's COBOL Code year minimum life of the systems, they will save almost
Readers interested in the role of code tuning in large $400,000.
data processing shops should see the General Account- In an Army application, one staff-day of optimization
ing Office report entitled "Improving COBOL Applica- reduced the typical elapsed time of a program from 7,5
tions Can Recover Significant Computer Resources" hours to less than two hours; this change saved $3,500
(dated I April 1982, it is available from the National in its first year and resulted in the output being deliv-
Technical Information Service, Springfield, Virginia ered to the user more reliably.
22161 with order code PB82-198540). Tuning several The report warns that code tuning should not be
application systems within the Department of Housing done haphazardly; other considerations such as correct-
and Urban Development resulted in the following. ness and maintainability must be given their rightful
high priority. It points out that there is usually a point
of diminishing returns in tuning either an individual
Percent Per-Year Total
CPU Time Dollar Cost To program or a set of programs on a system: work beyond
Reduced Savings Optimize that point will be very difficult and have little positive
impact.
82 $37,400 $5,500
30 4,400 2,400 The recommendations in the report include the fol-
19 9,000 900 lowing: "Heads of Federal agencies should require peri-
45 45,000 1,200 odic review of the machine resource consumption of
9 7,000 9,000 COBOL applications at their installations, and, where
feasible, require action to reduce the consumption of
The optimizations cost a total of $19,000; over the four- the expensive applications."
W o r d C o u n t :=
CountTable[W and 11111111B]
+ CountTable[(W rshift 8) a n d 11111111B]
+ CountTable[(W rshift 16) and 11111111B]
+ CountTable[(W rshift 2q) and 11111111B]
the set S by an array of latitude and longitude values. nometric functions translate its latitude and longitude
The nearest neighbor to each point in the sequence was into x, y and z coordinates, and we then compute its
found by calculating its distance to every point in S distance to every point in S. Its distance to a point in S
using a complicated trigonometric formula involving 10 is computed as the sum of the squares of the differences
sine and cosine functions. While the program was sim- in the three dimensions, which is usually cheaper than
ple to code and produced fine maps for small data sets, computing one trigonometric function, let alone 10.
it required several hours of mainframe time to produce This method computes the correct answer because the
large maps, which was well beyond the budget. angle between two points increases monotonically with
Because I had previously worked on geometric prob- the square of their Euclidean distance.
lems, Wright asked me to try my hand at this problem. Although this approach does require additional stor-
After spending much of a weekend on it, I developed age, it yields substantial benefits. When Wright incor-
several fancy algorithms and data structures for solving porated the change into her program, the run time for
it. Fortunately (in retrospect), each would have re- complicated maps was reduced from several hours to
quired many hundreds of lines of code, so I didn't try to less than half a minute. In this case, code tuning solved
program any of them. When I described the data struc- the problem with a couple of dozen lines of code, while
tures to Andrew Appel of Carnegie-Mellon University, algorithmic and data structure changes would have re-
he had a key insight: rather than approaching the prob- quired many hundreds of lines.
lem at the level of data structures, why not use the
simple data structure of keeping the points in an array, Major Surgery--Binary Search
but tune the code to reduce the cost of computing the We'll turn from first aid to major surgery on a program
distance between points. Any ideas? that is one of the most subtle examples I know of code
The cost can be greatly reduced by changing the rep- tuning. The problem was given in the December 1983
resentation of points: rather than using latitudes and column: perform a binary search in a table of 1,O0O
longitudes, we'll represent a point's location on the sur- integers. As we go through this exercise, keep in mind
face of the globe by its x, y and z coordinates. Thus the that code tuning is usually not needed in this pro-
data structure is an array that holds each point's lati- gram--the binary search algorithm is so efficient that
tude and longitude (which may be needed for other code tuning is often superfluous. Therefore, in the De-
operations) as well as its three Cartesian coordinates. cember 1983 column we ignored efficiency entirely (at
As each point in the sequence is processed, a few trigo- the level of code tuning) and concentrated on achieving
The History of Binary Search studied it was effective for N less than a dozen (see his
Several readers of the September 1983 column pointed Exercises 11 and 23 and pp. 127-129 of Writing Efficient
out that the performance of its binary search could be Programs).
improved. David Gries of Cornell and Ed Cohen of Even though it is not particularly well suited to im-
Prime Computer, Inc. both observed that the three-way plementation on Knuth's MIX computer, the loop-un-
test of that code could be replaced by the two-way test rolled binary search program has been used quite prof-
in code like the first new binary search in this column. itably in many different contexts. I first learned of the
Denis Wilson of the University of Aberdeen wrote that code from Guy Steele of Tartan Laboratories: he reports
in the early 1970s he and the late Iain Houstoun were that it was widely used in the ITS operating system on
able to remove an increment or decrement from each a Digital Equipment Corporation PDP-10 at MIT in the
iteration of the loop in the September column by per- late 1960s (each comparison compiled into two instruc-
forming a single extra division. tions, one-and-a-half of which were executed on the
Knuth describes many techniques for increasing the average). I later found that Vic Vyssotsky of AT&T Bell
performance of binary search programs in Section 6.2.1 Laboratories had used exactly this code on an IBM 7090
of his Sorting and Searching, including all the techniques in 1961 to achieve a binary search in which each com-
I used in this column. His techniques include changes parison compiled into three machine instructions.
at the source code level as well as methods for efficient I'd like to find out more from the readers of Communi-
assembly coding. Knuth found that replacing the ter- cations about the history of efficient binary search pro-
nary comparison by a binary comparison was effective grams. Could you please communicate to me any his-
in MIX only for very large values of N he cites a tory or efficiency techniques you know beyond those in
number over 64 billion--while on some machines I Section 6.2.1 of Knuth's third volume?
a simple, correct, and maintainable program. Every ment of X in each iteration of the loop. The previous
now and then, though, the tuned version can make a program sometimes had to test two such outcomes,
big difference in a system. The next version of the program uses a different rep-
We'll develop the fast binary search in a sequence of resentation of a range: instead of representing L.. U by
four programs. They are subtle, but there is good reason its lower and upper values, we'll represent it by its
to follow them closely: the final program is usually two lower value L and an increment I such that L + I = U.
or three times faster than the binary search we saw in The code wiI1 ensure that I is at all times a power of
the December 1983 column, which was two; this property is easy to keep once we have it, but it
is hard to get originally (recall that the size of the array
L:=I ; U : = N
is N = 1000). The main body of the program is there-
loop
fore preceded by an assignment and i f statement to
/~ i n v a r i a n t : if T is in X,
ensure that the range being searched is initially of size
it is in X [ L . . U ] ~/
512, the largest power of two less than 1000; thus L and
if L > U t h e n
L + I are either 0..512 or 489..1001. Translating the
P:=0; break
previous program to this new representation of a range
M := (L+U) d i v 2
yields the following code.
case
X [ M ] < T: L:=M+I I:=512
X[M] = T: P:=M; break if X [ 5 1 2 ] > = T then
X [ M ] > T: U:=M-I L:=0
endloop else
L:=I000+I-512
Before reading on, can you spot any obvious waste in
w h i l e I~I do
that code?
/$ i n v a r i a n t : X[L]<T
Our development of the fast binary search will start
a n d X [ L + I ]>=T
with the modified problem of locating the first occur-
and I=2~j ~/
rence of the integer T in the integer array X[1.. N] (the
N e x t I := I d i v 2
above code might return any one of multiple occur-
if X [ L N e x t I ] < T then
rences of T). The main loop of the program is similar to
L :=L+NextI ; I :=NextI
the one above; we'll keep indices L and U into the array
else
that bracket the position of T, but we'll use the invar-
I := N e x t I
iant relation that X[L] < T, X[U] ~_ T and L < U. We'll
/~ a s s e r t I=I a n d X [ L ] < T
assume that X[0] < T and that X[N + 1] ~ T, but the
and X[L+I]>=T ~/
program will never access those extreme elements.
P:=L+I
Starting with that invariant, the first program is easy
if P > I 0 0 0 or X [ P ] ~ T t h e n P : = 0
to write.
The correctness proof of this program has exactly the
L:=0; U:=N+I
same flow as the proof of the previous program. This
w h i l e L+I # U do
code is usually slower than its predecessor, but it opens
/~ i n v a r i a n t : X[L]<T and X[U]>=T
the door for future speedups.
a n d L < U ~/
The next program is a simplification of the above,
M := (L+U) d i v 2
incorporating some optimizations that a smart compiler
if X [ M ] < T t h e n
might perform. The first i f statement is simplified, the
L:=M
variable NextI is removed, and the assignments to NextI
else
are removed from the inner i f statement.
U:=M
/~ a s s e r t L + I = U a n d X [ L ] < T I:=512; L:=0
and X[U]>=T ~/ if X [ 5 1 2 ] < T then L:=I000+I-512
P:=U w h i l e I#I do
if P > N or X [ P ] ~ T t h e n P : = 0 /~ i n v a r i a n t : X[L]<T
a n d X [ L + I ]>=T
The first statement initializes the invariant. As the loop
and I=2~j ~/
is repeated, the invariant is maintained by the i f state-
I:= I d i v 2
ment; it's easy to check that both branches maintain
if X [ L + I ] < T then
the invariant. Upon termination we know that if T is
L := L + I
anywhere in the array, then its first occurrence is in
/~ a s s e r t I=I a n d X [ L ] < T
position U (that fact is stated more formally in the as-
and X[L+I]>=T ~/
sert comment). The final two statements set P to the
P := L + I
index of the first occurrence of T in X if it is present,
if P > I 0 0 0 or X [ P ] ~ T t h e n P : = 0
and to zero if it is not present.
While this binary search solves a more difficult prob- Although the correctness argument for the code (still)
lem than the previous program, it is potentially more has exactly the same structure, we can now understand
efficient: it makes only one comparison of T to an ele- its operation on a more intuitive level as well. When
the first test fails (that is, when L stays zero), the pro- Save concern for efficiency for when it matters.
gram computes the bits of P in left-to-right order, most Profiling. When efficiency is important, the first step
significant bit first. is to profile the system to find out where the processor
The final version of the code is not for the faint of spends most of its time. This activity usually shows that
heart. It removes the overhead of loop control and the most of the time is going to a few hot spots and that the
division of I by two by unrolling the entire loop. In rest of the code is almost never executed. Profiling
other words, because I assumes only a few distinct val- points to the critical areas; for the rest of the program
ues in this particular problem, we can write them all we follow the wise m a x i m of, "If it ain't broke, don't fix
down in the program, and thereby avoid computing it."
them over and over again at run time. Design Levels. When the run time of the critical mod-
ules must be reduced, there are usually several ways to
L:=0
proceed. Sometimes it is fruitful to take the high-level
if X[512]<T then L:=I000+I-512
approaches of changing problem definition, algorithms,
/* assert X[L]<T and X[L+512]>=T ~/
or data structures. Sometimes it is most effective to
if X[L+256]<T then L:=L+256
work at a lower level and write in assembly language
/~ assert X[L]<T and X[L+256]>=T */
or microcode or even build special-purpose hardware.
if X[L+I28]<T then L:=L+128
Before we start to tune code, we should make sure that
if X[L+64]<T then L:=L+64
other approaches don't provide a more effective solu-
if X[L+32]<T then L:=L+32
tion.
if X[L+I6]<T then L:=L+16
The above discussion considers w h e t h e r and when to
if X[L+8]<T then L:=L+8
tune code; once we decide to do so, that still leaves the
if X[L+4]<T then L:=L+4
question of how. In my book Writing Efficient Programs I
if X[L+2]<T then L:=L+2
tried to answer that question with a list of general rules
/~ assert X[L]<T and X[L+2]>=T */
for code tuning. All the examples we've seen can be
if X[L+I]<T then L:=L+I
explained in terms of those principles; I'll do that now
/~ assert X[L]<T and X[L+I]>=T ~/
with the names of the rules in italics.
P:=L+I
if P>I000 or X[P]#T then P:=0 Van Wyk's Drawing Program. The general strategy of
Van Wyk's solution was to Exploit Common Cases; his
The correctness of this code is easy to prove by insert- particular exploitation involved caching a list of the
ing the complete string of assertions like those sur- most common kind of record.
rounding the test of X[L + 256]; once you do the two-
Problem O n e - - C h a r a c t e r Classification. The solution
case analysis to see how that i f statement behaves, all
with a table indexed by a character Precomputes A
the other i f statements fall into line.
Logical Function.
I've compared the binary search of the December
1983 column with this fine-tuned binary search, and Problem T w o - - C o u n t i n g Bits. The table of byte
the results are consistent across several machines and counts is closely related to the previous solution; it
languages: the tuned search is usually between two and Stores Precomputed Results.
three times faster. Sometimes the results can be even Problem T h r e e - - C o m p u t i n g Spherical Distances.
more dramatic: on machines that skip after a successful Storing Cartesian coordinates along with latitudes
comparison, smart compilers can produce assembly and longitudes is an example of Data Structure Aug-
code for the tuned search that is five times faster than mentation; using the cheaper Euclidean distance
the original code. rather than the angular distance Exploits An Algebraic
This derivation is an idealized account of code tuning Identity.
at its most extreme. We replaced the obvious binary Binary Search. Combining Tests reduced the number
search program (which doesn't appear to have much fat of array comparisons per inner loop from two to one,
on it) with a super-lean version that is several times Exploiting An Algebraic Identity changed representa-
faster. The program verification tools of the December tions from a lower and upper bound to a lower bound
1983 column played a crucial role in the task. Because and an increment, and Loop Unrolling expanded the
we used them, we can believe that the final program is program to remove all loop overhead.
correct; when I first saw the final code presented with-
So far we've looked upon code tuning as a means to
out verification, I looked upon it as magic for months.
reduce run time. One can also tune code for other pur-
poses, such as reducing paging or increasing a cache hit
Principles ratio. Perhaps the most common use of code tuning
The most important principle about code tuning is that
(other than reducing run time) is to reduce the space
it should be done rarely. That sweeping generalization
required by a program. Problem 4 gives a taste of that
is explained by the following considerations.
endeavor, but the big picture of space reduction will
The Role of Efficiency. Other properties of software are have to wait for a column of its own.
as important as efficiency, if not more so. Don Knuth
has observed that premature optimization is the root of Problems
much programming evil; it can compromise the correct- 1. The character classification program in the text as-
ness, functionality, and maintainability of programs. sumed that the character classes were disjoint. How
would you write a routine to test membership in your machine when both programs are equally
overlapping character classes (such as lower case let- tuned? Is it a function of the level of tuning?
ters, upper case letters, letters, digits, and alphanu-
merics)? 7. D. B. Lomet of IBM Watson Research Center ob-
serves that hashing may solve the 10OO-integer
2. Write a code fragment that given N (a power of two), search problem more efficiently than the tuned bi-
initializes CountTable[O.. N - 1] as described in the nary search. Implement a fast hashing program and
text. compare it to the tuned binary search; how do they
compare in terms of speed and space?
3. How do the various binary search algorithms behave
if they are (against specification) applied to unsorted
arrays? Solutions to Problems in the December 1983 Column
1. To show that overflow does not occur we add the
4. In the very early days of programming, Fred Brooks conditions 1 <__L ~ N + 1 and O <__U ~ N to the
(then of Harvard, now of the University of North invariant; we can then bound L + U. The same con-
Carolina) faced the problem of representing a large ditions can be used to show that elements outside
table on a small computer. He couldn't store the the array bounds are never accessed. We can for-
entire table in an array because there was room for mally define MustBe(L, U) as X[L - 1] < T and
only a few bits for each table entry (actually, there X[U + 1] > T if we define the boundary elements
was one decimal digit available for each entry--I X[0] and X[N + 1] as we did in this column.
said that it was in the early days!). His second ap-
proach was to use numerical analysis to fit a func- 2, 6. A binary search for the first occurrence of a value
tion through the table. Although that resulted in a and a fast binary search were both given in this
column.
function that was quite close to the true table (no
3. This problem called for verifying your own pro-
entry was more than a couple of units off the true
grams.
entry) and required an unnoticeably small amount
4. The process always removes either zero or two
of memory, legal constraints meant that the approxi-
white beans from the coffee can, so it leaves invar-
mation wasn't good enough. How could Brooks get
iant the parity (oddness or evenness) of the number
the required accuracy in the limited space?
of white beans. Thus the last remaining bean is
5. A common example of code tuning speeds up se- white if and only if an odd number of white beans
quential search in an array by placing the element were in the can originally. (David Gries attributes
being searched for in a "sentinel" position at the end this problem to Carel Scholten.)
of the array. This eliminates the test of whether the
5. Because the line segments that form the rungs of the
current index is within array bounds, and typically
ladder are in increasing y-order, the ones that
reduces the run time of a sequential search program
bracket a given point can be found by binary search.
by 20 or 30 percent. How can sentinels decrease the
The basic comparison in the search tells whether the
search time in tables represented by linked lists,
point is below, on, or above a given line segment.
hash tables, or binary search trees?
The next two problems call for implementing pro-
grams to determine when one approach becomes more
efficient than another. Please let me know the results of Further Reading
your implementations; they will be reported in a later The only book I know that is devoted entirely to code
column. tuning is my own Writing Efficient Programs, published
by Prentice-Hall in 1982. The heart of the book is a 50
6. Because sequential search is simpler than binary
page discussion of the efficiency rules I mentioned
search, it is usually more efficient for small tables.
above; each is stated in general terms and then illus-
On the other hand, the logarithmic number of com-
trated by application to small code fragments and by
parisons made by binary search implies that it will
"war stories" of its application in real systems. Other
be faster than the linear time of sequential search
parts of the book discuss the role of efficiency in soft-
for large tables. The break-even point is a function of
ware systems and the application of the rules to several
how much each program is tuned. How low (or high)
important subroutines.
can you make that break-even point? What is it on