Fundamental Data Structures
Fundamental Data Structures
Structures
PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information.
PDF generated at: Mon, 30 Sep 2013 03:32:24 UTC
Contents
Articles
Introduction
Data structure
Analysis of algorithms
11
Amortized analysis
16
Accounting method
17
Potential method
19
Sequences
22
22
26
Dynamic array
32
Linked list
35
50
55
84
Double-ended queue
86
Circular buffer
89
Dictionaries
102
Associative array
102
Association list
105
Hash table
106
Linear probing
118
Quadratic probing
119
Double hashing
122
Cuckoo hashing
124
Hopscotch hashing
128
Hash function
129
138
Universal hashing
140
K-independent hashing
145
Tabulation hashing
146
Sets
149
157
157
Bit array
162
Bloom filter
167
MinHash
179
182
Partition refinement
186
Priority queues
188
Priority queue
188
193
Binary heap
196
d-ary heap
202
Binomial heap
204
Fibonacci heap
210
Pairing heap
215
217
Soft heap
222
224
224
233
240
Tree rotation
244
247
Treap
250
AVL tree
254
Redblack tree
259
Scapegoat tree
273
Splay tree
277
Tango tree
291
Skip list
294
B-tree
300
B+ tree
311
316
316
Radix tree
331
336
Suffix tree
337
Suffix array
342
348
Fusion tree
353
References
Article Sources and Contributors
357
363
Article Licenses
License
366
Introduction
Abstract data type
In computer science, an abstract data type (ADT) is a mathematical model for a certain class of data structures that
have similar behavior; or for certain data types of one or more programming languages that have similar semantics.
An abstract data type is defined indirectly, only by the operations that may be performed on it and by mathematical
constraints on the effects (and possibly cost) of those operations.[1]
For example, an abstract stack could be defined by three operations: push, that inserts some data item onto the
structure, pop, that extracts an item from it (with the constraint that each pop always returns the most recently
pushed item that has not been popped yet), and peek, that allows data on top of the structure to be examined without
removal. When analyzing the efficiency of algorithms that use stacks, one may also specify that all operations take
the same time no matter how many items have been pushed into the stack, and that the stack uses a constant amount
of storage for each element.
Abstract data types are purely theoretical entities, used (among other things) to simplify the description of abstract
algorithms, to classify and evaluate data structures, and to formally describe the type systems of programming
languages. However, an ADT may be implemented by specific data types or data structures, in many ways and in
many programming languages; or described in a formal specification language. ADTs are often implemented as
modules: the module's interface declares procedures that correspond to the ADT operations, sometimes with
comments that describe the constraints. This information hiding strategy allows the implementation of the module to
be changed without disturbing the client programs.
The term abstract data type can also be regarded as a generalised approach of a number of algebraic structures,
such as lattices, groups, and rings.[2] This can be treated as part of the subject area of artificial intelligence. The
notion of abstract data types is related to the concept of data abstraction, important in object-oriented programming
and design by contract methodologies for software development [citation needed].
Imperative view
In the "imperative" view, which is closer to the philosophy of imperative programming languages, an abstract data
structure is conceived as an entity that is mutable meaning that it may be in different states at different times.
Some operations may change the state of the ADT; therefore, the order in which operations are evaluated is
important, and the same operation on the same entities may have different effects if executed at different times
just like the instructions of a computer, or the commands and procedures of an imperative language. To underscore
this view, it is customary to say that the operations are executed or applied, rather than evaluated. The imperative
style is often used when describing abstract algorithms. This is described by Donald E. Knuth and can be referenced
from here The Art of Computer Programming.
Typical operations
Some operations that are often specified for ADTs (possibly under other names) are
compare(s,t), that tests whether two structures are equivalent in some sense;
hash(s), that computes some standard hash function from the instance's state;
print(s) or show(s), that produces a human-readable representation of the structure's state.
In imperative-style ADT definitions, one often finds also
create(), that yields a new instance of the ADT;
initialize(s), that prepares a newly created instance s for further operations, or resets it to some "initial
state";
copy(s,t), that puts instance s in a state equivalent to that of t;
clone(t), that performs s new(), copy(s,t), and returns s;
free(s) or destroy(s), that reclaims the memory and other resources used by s;
The free operation is not normally relevant or meaningful, since ADTs are theoretical entities that do not "use
memory". However, it may be necessary when one needs to analyze the storage used by an algorithm that uses the
ADT. In that case one needs additional axioms that specify how much memory each ADT instance uses, as a
function of its state, and how much of it is returned to the pool by free.
Examples
Some common ADTs, which have proved useful in a great variety of applications, are
Container
Deque
List
Map
Multimap
Multiset
Priority queue
Queue
Set
Stack
String
Tree
Graph
Each of these ADTs may be defined in many ways and variants, not necessarily equivalent. For example, a stack
ADT may or may not have a count operation that tells how many items have been pushed and not yet popped.
This choice makes a difference not only for its clients but also for the implementation.
Implementation
Implementing an ADT means providing one procedure or function for each abstract operation. The ADT instances
are represented by some concrete data structure that is manipulated by those procedures, according to the ADT's
specifications.
Usually there are many ways to implement the same ADT, using several different concrete data structures. Thus, for
example, an abstract stack can be implemented by a linked list or by an array.
An ADT implementation is often packaged as one or more modules, whose interface contains only the signature
(number and types of the parameters and results) of the operations. The implementation of the module namely,
the bodies of the procedures and the concrete data structure used can then be hidden from most clients of the
module. This makes it possible to change the implementation without affecting the clients.
When implementing an ADT, each instance (in imperative-style definitions) or each state (in functional-style
definitions) is usually represented by a handle of some sort.[3]
Modern object-oriented languages, such as C++ and Java, support a form of abstract data types. When a class is used
as a type, it is an abstract type that refers to a hidden representation. In this model an ADT is typically implemented
as a class, and each instance of the ADT is an object of that class. The module's interface typically declares the
constructors as ordinary procedures, and most of the other ADT operations as methods of that class. However, such
an approach does not easily encapsulate multiple representational variants found in an ADT. It also can undermine
the extensibility of object-oriented programs. In a pure object-oriented program that uses interfaces as types, types
refer to behaviors not representations.
/* Type: instance
stack_T stack_create(void);
instance, initially empty. */
void stack_push(stack_T s, stack_Item e);
the stack. */
stack_Item stack_pop(stack_T s);
the stack and return it . */
int stack_empty(stack_T ts);
empty. */
/*
/*
/*
/*
void *e = stack_pop(t);
the stack. */
if (stack_empty(t)) { }
This interface can be implemented in many ways. The implementation may be arbitrarily inefficient, since the formal
definition of the ADT, above, does not specify how much space the stack may use, nor how long each operation
should take. It also does not specify whether the stack state t continues to exist after a call s pop(t).
In practice the formal definition should specify that the space is proportional to the number of items pushed and not
yet popped; and that every one of the operations above must finish in a constant amount of time, independently of
that number. To comply with these additional specifications, the implementation could use a linked list, or an array
(with dynamic resizing) together with two integers (an item count and the array size)
Functional-style interface
Functional-style ADT definitions are more appropriate for functional programming languages, and vice-versa.
However, one can provide a functional style interface even in an imperative language like C. For example:
typedef struct stack_Rep stack_Rep;
representation (an opaque record). */
typedef stack_Rep *stack_T;
state (an opaque pointer). */
typedef void *stack_Item;
address). */
stack_T stack_empty(void);
/* Returns the empty stack
state. */
stack_T stack_push(stack_T s, stack_Item x); /* Adds x at the top of s,
returns the resulting state. */
stack_Item stack_top(stack_T s);
/* Returns the item
currently at the top of s. */
stack_T stack_pop(stack_T s);
/* Remove the top item
from s, returns the resulting state. */
The main problem is that C lacks garbage collection, and this makes this style of programming impractical;
moreover, memory allocation routines in C are slower than allocation in a typical garbage collector, thus the
performance impact of so many allocations is even greater.
ADT libraries
Many modern programming languages, such as C++ and Java, come with standard libraries that implement several
common ADTs, such as those listed above.
References
[1] Barbara Liskov, Programming with Abstract Data Types, in Proceedings of the ACM SIGPLAN Symposium on Very High Level Languages,
pp. 50--59, 1974, Santa Monica, California
[2] , Chapter 7,section 40.
[3] , definition 4.4.
Further
Mitchell, John C.; Plotkin, Gordon (July 1988). "Abstract Types Have Existential Type" (http://theory.stanford.
edu/~jcm/papers/mitch-plotkin-88.pdf). ACM Transactions on Programming Languages and Systems 10 (3).
External links
Abstract data type (http://www.nist.gov/dads/HTML/abstractDataType.html) in NIST Dictionary of
Algorithms and Data Structures
Walls and Mirrors, the classic textbook
Data structure
In computer science, a data structure is
a particular way of storing and
organizing data in a computer so that it
can be used efficiently.[1][2]
Different kinds of data structures are
suited to different kinds of applications,
and some are highly specialized to
specific tasks. For example, B-trees are
particularly
well-suited
for
implementation of databases, while
compiler implementations usually use
hash tables to look up identifiers.
Data structures provide a means to
manage large amounts of data efficiently,
A hash table
such as large databases and internet
indexing services. Usually, efficient data
structures are a key to designing efficient algorithms. Some formal design methods and programming languages
emphasize data structures, rather than algorithms, as the key organizing factor in software design. Storing and
retrieving can be carried out on data stored in both main memory and in secondary memory.
Data structure
Overview
An array stores a number of elements in a specific order. They are accessed using an integer to specify which
element is required (although the elements may be of almost any type). Arrays may be fixed-length or
expandable.
Records (also called tuples or structs) are among the simplest data structures. A record is a value that contains
other values, typically in fixed number and sequence and typically indexed by names. The elements of records are
usually called fields or members.
A hash table (also called a dictionary or map) is a more flexible variation on a record, in which name-value
pairs can be added and deleted freely.
A union type specifies which of a number of permitted primitive types may be stored in its instances, e.g. "float
or long integer". Contrast with a record, which could be defined to contain a float and an integer; whereas, in a
union, there is only one value at a time.
A tagged union (also called a variant, variant record, discriminated union, or disjoint union) contains an
additional field indicating its current type, for enhanced type safety.
A set is an abstract data structure that can store specific values, without any particular order, and no repeated
values. Values themselves are not retrieved from sets, rather one tests a value for membership to obtain a boolean
"in" or "not in".
Graphs and trees are linked abstract data structures composed of nodes. Each node contains a value and also one
or more pointers to other nodes. Graphs can be used to represent networks, while trees are generally used for
sorting and searching, having their nodes arranged in some relative order based on their values.
An object contains data fields, like a record, and also contains program code fragments for accessing or
modifying those fields. Data structures not containing code, like those above, are called plain old data structures.
Many others are possible, but they tend to be further variations and compounds of the above.
Basic principles
Data structures are generally based on the ability of a computer to fetch and store data at any place in its memory,
specified by an addressa bit string that can be itself stored in memory and manipulated by the program. Thus the
record and array data structures are based on computing the addresses of data items with arithmetic operations; while
the linked data structures are based on storing addresses of data items within the structure itself. Many data structures
use both principles, sometimes combined in non-trivial ways (as in XOR linking).
The implementation of a data structure usually requires writing a set of procedures that create and manipulate
instances of that structure. The efficiency of a data structure cannot be analyzed separately from those operations.
This observation motivates the theoretical concept of an abstract data type, a data structure that is defined indirectly
by the operations that may be performed on it, and the mathematical properties of those operations (including their
space and time cost).
Data structure
Language support
Most assembly languages and some low-level languages, such as BCPL (Basic Combined Programming Language),
lack support for data structures. Many high-level programming languages and some higher-level assembly
languages, such as MASM, on the other hand, have special syntax or other built-in support for certain data
structures, such as vectors (one-dimensional arrays) in the C language or multi-dimensional arrays in Pascal.
Most programming languages feature some sort of library mechanism that allows data structure implementations to
be reused by different programs. Modern languages usually come with standard libraries that implement the most
common data structures. Examples are the C++ Standard Template Library, the Java Collections Framework, and
Microsoft's .NET Framework.
Modern languages also generally support modular programming, the separation between the interface of a library
module and its implementation. Some provide opaque data types that allow clients to hide implementation details.
Object-oriented programming languages, such as C++, Java and Smalltalk may use classes for this purpose.
Many known data structures have concurrent versions that allow multiple computing threads to access the data
structure simultaneously.
References
[1] Paul E. Black (ed.), entry for data structure in Dictionary of Algorithms and Data Structures. U.S. National Institute of Standards and
Technology. 15 December 2004. Online version (http:/ / www. itl. nist. gov/ div897/ sqg/ dads/ HTML/ datastructur. html) Accessed May 21,
2009.
[2] Entry data structure in the Encyclopdia Britannica (2009) Online entry (http:/ / www. britannica. com/ EBchecked/ topic/ 152190/
data-structure) accessed on May 21, 2009.
Further reading
Peter Brass, Advanced Data Structures, Cambridge University Press, 2008.
Donald Knuth, The Art of Computer Programming, vol. 1. Addison-Wesley, 3rd edition, 1997.
Dinesh Mehta and Sartaj Sahni Handbook of Data Structures and Applications, Chapman and Hall/CRC Press,
2007.
Niklaus Wirth, Algorithms and Data Structures, Prentice Hall, 1985.
Diane Zak, Introduction to programming with c++, copyright 2011 Cengage Learning Asia Pte Ltd
External links
10
Analysis of algorithms
Analysis of algorithms
In computer science, the analysis of algorithms is the determination of the amount of resources (such as time and
storage) necessary to execute them. Most algorithms are designed to work with inputs of arbitrary length. Usually,
the efficiency or running time of an algorithm is stated as a function relating the input length to the number of steps
(time complexity) or storage locations (space complexity).
Algorithm analysis is an important part of a broader computational complexity theory, which provides theoretical
estimates for the resources needed by any algorithm which solves a given computational problem. These estimates
provide an insight into reasonable directions of search for efficient algorithms.
In theoretical analysis of algorithms it is common to estimate their complexity in the asymptotic sense, i.e., to
estimate the complexity function for arbitrarily large input. Big O notation, Big-omega notation and Big-theta
notation are used to this end. For instance, binary search is said to run in a number of steps proportional to the
logarithm of the length of the list being searched, or in O(log(n)), colloquially "in logarithmic time". Usually
asymptotic estimates are used because different implementations of the same algorithm may differ in efficiency.
However the efficiencies of any two "reasonable" implementations of a given algorithm are related by a constant
multiplicative factor called a hidden constant.
Exact (not asymptotic) measures of efficiency can sometimes be computed but they usually require certain
assumptions concerning the particular implementation of the algorithm, called model of computation. A model of
computation may be defined in terms of an abstract computer, e.g., Turing machine, and/or by postulating that
certain operations are executed in unit time. For example, if the sorted list to which we apply binary search has n
elements, and we can guarantee that each lookup of an element in the list can be done in unit time, then at most log2
n + 1 time units are needed to return an answer.
Cost models
Time efficiency estimates depend on what we define to be a step. For the analysis to correspond usefully to the
actual execution time, the time required to perform a step must be guaranteed to be bounded above by a constant.
One must be careful here; for instance, some analyses count an addition of two numbers as one step. This assumption
may not be warranted in certain contexts. For example, if the numbers involved in a computation may be arbitrarily
large, the time required by a single addition can no longer be assumed to be constant.
Two cost models are generally used:[1]
the uniform cost model, also called uniform-cost measurement (and similar variations), assigns a constant cost
to every machine operation, regardless of the size of the numbers involved
the logarithmic cost model, also called logarithmic-cost measurement (and variations thereof), assigns a cost to
every machine operation proportional to the number of bits involved
The latter is more cumbersome to use, so it's only employed when necessary, for example in the analysis of
arbitrary-precision arithmetic algorithms, like those used in cryptography.
A key point which is often overlooked is that published lower bounds for problems are often given for a model of
computation that is more restricted than the set of operations that you could use in practice and therefore there are
algorithms that are faster than what would naively be thought possible.[2]
11
Analysis of algorithms
12
Run-time analysis
Run-time analysis is a theoretical classification that estimates and anticipates the increase in running time (or
run-time) of an algorithm as its input size (usually denoted as n) increases. Run-time efficiency is a topic of great
interest in computer science: A program can take seconds, hours or even years to finish executing, depending on
which algorithm it implements (see also performance analysis, which is the analysis of an algorithm's run-time in
practice).
Computer A
run-time
(in nanoseconds)
Computer B
run-time
(in nanoseconds)
15
100,000
65
32
150,000
250
125
200,000
1,000
500
250,000
Based on these metrics, it would be easy to jump to the conclusion that Computer A is running an algorithm that is
far superior in efficiency to that of Computer B. However, if the size of the input-list is increased to a sufficient
number, that conclusion is dramatically demonstrated to be in error:
n (list size)
Computer A
run-time
(in nanoseconds)
Computer B
run-time
(in nanoseconds)
15
100,000
65
32
150,000
250
125
200,000
1,000
500
250,000
...
...
...
1,000,000
500,000
500,000
4,000,000
2,000,000
550,000
16,000,000
8,000,000
600,000
...
...
...
1,375,000 ns,
or 1.375 milliseconds
Computer A, running the linear search program, exhibits a linear growth rate. The program's run-time is directly
proportional to its input size. Doubling the input size doubles the run time, quadrupling the input size quadruples the
run-time, and so forth. On the other hand, Computer B, running the binary search program, exhibits a logarithmic
Analysis of algorithms
13
growth rate. Doubling the input size only increases the run time by a constant amount (in this example, 25,000 ns).
Even though Computer A is ostensibly a faster machine, Computer B will inevitably surpass Computer A in run-time
because it's running an algorithm with a much slower growth rate.
Orders of growth
Informally, an algorithm can be said to exhibit a growth rate on the order of a mathematical function if beyond a
certain input size n, the function f(n) times a positive constant provides an upper bound or limit for the run-time of
that algorithm. In other words, for a given input size n greater than some n0 and a constant c, the running time of that
algorithm will never be larger than c f(n). This concept is frequently expressed using Big O notation. For example,
since the run-time of insertion sort grows quadratically as its input size increases, insertion sort can be said to be of
order O(n).
Big O notation is a convenient way to express the worst-case scenario for a given algorithm, although it can also be
used to express the average-case for example, the worst-case scenario for quicksort is O(n), but the average-case
run-time is O(n log n).[3]
. If the order of growth indeed follows the power rule, the empirical value
of a will stay constant at different ranges, and if not, it will change - but still could serve for comparison of any two
given algorithms as to their empirical local orders of growth behaviour. Applied to the above table:
n (list size)
Computer A
run-time
(in nanoseconds)
Local order of
growth
(n^_)
Computer B
run-time
(in nanoseconds)
Local order of
growth
(n^_)
15
65
32
1.04
150,000
0.28
250
125
1.01
200,000
0.21
1,000
500
1.00
250,000
0.16
...
...
1,000,000
500,000
1.00
500,000
0.10
4,000,000
2,000,000
1.00
550,000
0.07
16,000,000 8,000,000
1.00
600,000
0.06
...
...
100,000
...
...
It is clearly seen that the first algorithm exhibits a linear order of growth indeed following the power rule. The
empirical values for the second one are diminishing rapidly, suggesting it follows another rule of growth and in any
case has much lower local orders of growth (and improving further still), empirically, than the first one.
Analysis of algorithms
A given computer will take a discrete amount of time to execute each of the instructions involved with carrying out
this algorithm. The specific amount of time to carry out a given instruction will vary depending on which instruction
is being executed and which computer is executing it, but on a conventional computer, this amount will be
deterministic.[5] Say that the actions carried out in step 1 are considered to consume time T1, step 2 uses time T2, and
so forth.
In the algorithm above, steps 1, 2 and 7 will only be run once. For a worst-case evaluation, it should be assumed that
step 3 will be run as well. Thus the total amount of time to run steps 1-3 and step 7 is:
The loops in steps 4, 5 and 6 are trickier to evaluate. The outer loop test in step 4 will execute ( n + 1 ) times (note
that an extra step is required to terminate the for loop, hence n + 1 and not n executions), which will consume T4( n +
1 ) time. The inner loop, on the other hand, is governed by the value of i, which iterates from 1 to i. On the first pass
through the outer loop, j iterates from 1 to 1: The inner loop makes one pass, so running the inner loop body (step 6)
consumes T6 time, and the inner loop test (step 5) consumes 2T5 time. During the next pass through the outer loop, j
iterates from 1 to 2: the inner loop makes two passes, so running the inner loop body (step 6) consumes 2T6 time,
and the inner loop test (step 5) consumes 3T5 time.
Altogether, the total time required to run the inner loop body can be expressed as an arithmetic progression:
which can be factored[6] as
The total time required to run the inner loop test can be evaluated similarly:
which reduces to
As a rule-of-thumb, one can assume that the highest-order term in any given function dominates its rate of growth
and thus defines its run-time order. In this example, n is the highest-order term, so one can conclude that f(n) =
14
Analysis of algorithms
15
(for n 0)
Let k be a constant greater than or equal to [T1..T7]
(for n 1)
Therefore
for
A more elegant approach to analyzing this algorithm would be to declare that [T1..T7] are all equal to one unit of
time greater than or equal to [T1..T7].Wikipedia:Please clarify This would mean that the algorithm's running time
breaks down as follows:[7]
(for n 1)
Relevance
Algorithm analysis is important in practice because the accidental or unintentional use of an inefficient algorithm can
significantly impact system performance. In time-sensitive applications, an algorithm taking too long to run can
render its results outdated or useless. An inefficient algorithm can also end up requiring an uneconomical amount of
computing power or storage in order to run, again rendering it practically useless.
Notes
[1] , section 1.3
[2] Examples of the price of abstraction? (http:/ / cstheory. stackexchange. com/ questions/ 608/ examples-of-the-price-of-abstraction),
cstheory.stackexchange.com
[3] The term lg is often used as shorthand for log2
[4] How To Avoid O-Abuse and Bribes (http:/ / rjlipton. wordpress. com/ 2009/ 07/ 24/ how-to-avoid-o-abuse-and-bribes/ ), at the blog "Gdels
Lost Letter and P=NP" by R. J. Lipton, professor of Computer Science at Georgia Tech, recounting idea by Robert Sedgewick
[5] However, this is not the case with a quantum computer
[6] It can be proven by induction that UNIQ-math-0-5cd073dcfe757d11-QINU
[7] This approach, unlike the above approach, neglects the constant time consumed by the loop tests which terminate their respective loops, but it
is trivial to prove that such omission does not affect the final result
Analysis of algorithms
References
Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L. & Stein, Clifford (2001). Introduction to
Algorithms. Chapter 1: Foundations (Second ed.). Cambridge, MA: MIT Press and McGraw-Hill. pp.3122.
ISBN0-262-03293-7.
Sedgewick, Robert (1998). Algorithms in C, Parts 1-4: Fundamentals, Data Structures, Sorting, Searching (3rd
ed.). Reading, MA: Addison-Wesley Professional. ISBN978-0-201-31452-6.
Knuth, Donald. The Art of Computer Programming. Addison-Wesley.
Greene, Daniel A.; Knuth, Donald E. (1982). Mathematics for the Analysis of Algorithms (Second ed.).
Birkhuser. ISBN3-7643-3102-X.
Goldreich, Oded (2010). Computational Complexity: A Conceptual Perspective. Cambridge University Press.
ISBN978-0-521-88473-0.
Amortized analysis
In computer science, amortized analysis is a method of analyzing algorithms that considers the entire sequence of
operations of the program. It allows for the establishment of a worst-case bound for the performance of an algorithm
irrespective of the inputs by looking at all of the operations. This analysis is most commonly discussed using
Big_O_notation.
At the heart of the method is the idea that while certain operations may be extremely costly in resources, they cannot
occur at a high-enough frequency to weigh down the entire program because the number of less costly operations
will far outnumber the costly ones in the long run, "paying back" the program over a number of iterations. It is
particularly useful because it guarantees worst-case performance rather than making assumptions about the state of
the program.
History
Amortized analysis initially emerged from a method called aggregate analysis, which is now subsumed by amortized
analysis. However, the technique was first formally introduced by Robert Tarjan in his paper Amortized
Computational Complexity, which addressed the need for a more useful form of analysis than the common
probabilistic methods used. Amortization was initially used for very specific types of algorithms, particularly those
involving binary trees and union operations. However, it is now ubiquitous and comes into play when analyzing
many other algorithms as well.
Method
The method requires knowledge of which series of operations are possible. This is most commonly the case with
data structures, which have state that persists between operations. The basic idea is that a worst case operation can
alter the state in such a way that the worst case cannot occur again for a long time, thus "amortizing" its cost.
There are generally three methods for performing amortized analysis: the aggregate method, the accounting method,
and the potential method. All of these give the same answers, and their usage difference is primarily circumstantial
and due to individual preference.
Aggregate analysis determines the upper bound T(n) on the total cost of a sequence of n operations, then
calculates the amortized cost to be T(n) / n.
The accounting method determines the individual cost of each operation, combining its immediate execution time
and its influence on the running time of future operations. Usually, many short-running operations accumulate a
"debt" of unfavorable state in small increments, while rare long-running operations decrease it drastically.
16
Amortized analysis
The potential method is like the accounting method, but overcharges operations early to compensate for
undercharges later.
Common use
In common usage, an "amortized algorithm" is one that an amortized analysis has shown to perform well.
Online algorithms commonly use amortized analysis.
References
Allan Borodin and Ran El-Yaniv (1998). Online Computation and Competitive Analysis [1]. Cambridge
University Press. pp.20,141.
[1] http:/ / www. cs. technion. ac. il/ ~rani/ book. html
Accounting method
In the field of analysis of algorithms in computer science, the accounting method is a method of amortized analysis
based on accounting. The accounting method often gives a more intuitive account of the amortized cost of an
operation than either aggregate analysis or the potential method. Note, however, that this does not guarantee such
analysis will be immediately obvious; often, choosing the correct parameters for the accounting method requires as
much knowledge of the problem and the complexity bounds one is attempting to prove as the other two methods.
The accounting method is most naturally suited for proving an O(1) bound on time. The method as explained here is
for proving such a bound.
The method
A set of elementary operations which will be used in the algorithm is chosen and their costs are arbitrarily set to 1.
The fact that the costs of these operations may differ in reality presents no difficulty in principle. What is important
is that each elementary operation has a constant cost.
Each aggregate operation is assigned a "payment". The payment is intended to cover the cost of elementary
operations needed to complete this particular operation, with some of the payment left over, placed in a pool to be
used later.
The difficulty with problems that require amortized analysis is that, in general, some of the operations will require
greater than constant cost. This means that no constant payment will be enough to cover the worst case cost of an
operation, in and of itself. With proper selection of payment, however, this is no longer a difficulty; the expensive
operations will only occur when there is sufficient payment in the pool to cover their costs.
17
Accounting method
18
Examples
A few examples will help to illustrate the use of the accounting method.
Table expansion
It is often necessary to create a table before it is known how much space is needed. One possible strategy is to double
the size of the table when it is full. Here we will use the accounting method to show that the amortized cost of an
insertion operation in such a table is O(1).
Before looking at the procedure in detail, we need some definitions. Let T be a table, E an element to insert, num(T)
the number of elements in T, and size(T) the allocated size of T. We assume the existence of operations
create_table(n), which creates an empty table of size n, for now assumed to be free, and elementary_insert(T,E),
which inserts element E into a table T that already has space allocated, with a cost of 1.
The following pseudocode illustrates the table insertion procedure:
function table_insert(T,E)
if num(T) = size(T)
U := create_table(2 size(T))
for each F in T
elementary_insert(U,F)
T := U
elementary_insert(T,E)
Without amortized analysis, the best bound we can show for n insert operations is O(n2) this is due to the loop at
line 4 that performs num(T) elementary insertions.
For analysis using the accounting method, we assign a payment of 3 to each table insertion. Although the reason for
this is not clear now, it will become clear during the course of the analysis.
Assume that initially the table is empty with size(T) = m. The first m insertions therefore do not require reallocation
and only have cost 1 (for the elementary insert). Therefore, when num(T) = m, the pool has (3 - 1)m = 2m.
Inserting element m + 1 requires reallocation of the table. Creating the new table on line 3 is free (for now). The loop
on line 4 requires m elementary insertions, for a cost of m. Including the insertion on the last line, the total cost for
this operation is m + 1. After this operation, the pool therefore has 2m + 3 - (m + 1) = m + 2.
Next, we add another m - 1 elements to the table. At this point the pool has m + 2 + 2(m - 1) = 3m. Inserting an
additional element (that is, element 2m + 1) can be seen to have cost 2m + 1 and a payment of 3. After this operation,
the pool has 3m + 3 - (2m + 1) = m + 2. Note that this is the same amount as after inserting element m + 1. In fact,
we can show that this will be the case for any number of reallocations.
It can now be made clear why the payment for an insertion is 3. 1 goes to inserting the element the first time it is
added to the table, 1 goes to moving it the next time the table is expanded, and 1 goes to moving one of the elements
that was already in the table the next time the table is expanded.
We initially assumed that creating a table was free. In reality, creating a table of size n may be as expensive as O(n).
Let us say that the cost of creating a table of size n is n. Does this new cost present a difficulty? Not really; it turns
out we use the same method to show the amortized O(1) bounds. All we have to do is change the payment.
When a new table is created, there is an old table with m entries. The new table will be of size 2m. As long as the
entries currently in the table have added enough to the pool to pay for creating the new table, we will be all right.
We cannot expect the first
We must then rely on the last
entries to help pay for the new table. Those entries already paid for the current table.
entries to pay the cost
to the payment
Accounting method
19
References
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms,
Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Section 17.2: The accounting method,
pp.410412.
Potential method
In computational complexity theory, the potential method is a method used to analyze the amortized time and space
complexity of a data structure, a measure of its performance over sequences of operations that smooths out the cost
of infrequent but expensive operations.
where C is a non-negative constant of proportionality (in units of time) that must remain fixed throughout the
analysis. That is, the amortized time is defined to be the actual time taken by the operation plus C times the
difference in potential caused by the operation.
. In more
detail,
where the sequence of potential function values forms a telescoping series in which all terms other than the initial
and final potential function values cancel in pairs, and where the final inequality arises from the assumptions that
and
. Therefore, amortized time can be used to provide accurate predictions about
the actual time of sequences of operations, even though the amortized time for an individual operation may vary
widely from its actual time.
Potential method
Example
A dynamic array is a data structure for maintaining an array of items, allowing both random access to positions
within the array and the ability to increase the array size by one. It is available in Java as the "ArrayList" type and in
Python as the "list" type. A dynamic array may be implemented by a data structure consisting of an array A of items,
of some length N, together with a number nN representing the positions within the array that have been used so
far. With this structure, random accesses to the dynamic array may be implemented by accessing the same cell of the
internal array A, and when n<N an operation that increases the dynamic array size may be implemented simply by
incrementingn. However, when n=N, it is necessary to resize A, and a common strategy for doing so is to double its
size, replacing A by a new array of length2n.[1]
This structure may be analyzed using a potential function =2nN. Since the resizing strategy always causes A to
be at least half-full, this potential function is always non-negative, as desired. When an increase-size operation does
not lead to a resize operation, increases by 2, a constant. Therefore, the constant actual time of the operation and
the constant increase in potential combine to give a constant amortized time for an operation of this type. However,
when an increase-size operation causes a resize, the potential value of n prior to the resize decreases to zero after the
resize. Allocating a new internal array A and copying all of the values from the old internal array to the new one
takes O(n) actual time, but (with an appropriate choice of the constant of proportionality C) this is entirely cancelled
by the decrease of n in the potential function, leaving again a constant total amortized time for the operation. The
other operations of the data structure (reading and writing array cells without changing the array size) do not cause
the potential function to change and have the same constant amortized time as their actual time.
Therefore, with this choice of resizing strategy and potential function, the potential method shows that all dynamic
array operations take constant amortized time. Combining this with the inequality relating amortized time and actual
time over sequences of operations, this shows that any sequence of n dynamic array operations takes O(n) actual
time in the worst case, despite the fact that some of the individual operations may themselves take a linear amount of
time.
20
Potential method
Applications
The potential function method is commonly used to analyze Fibonacci heaps, a form of priority queue in which
removing an item takes logarithmic amortized time, and all other operations take constant amortized time.[2] It may
also be used to analyze splay trees, a self-adjusting form of binary search tree with logarithmic amortized time per
operation.[3]
References
[1] Goodrich and Tamassia, 1.5.2 Analyzing an Extendable Array Implementation, pp. 139141; Cormen et al., 17.4 Dynamic tables, pp.
416424.
[2] Cormen et al., Chapter 20, "Fibonacci Heaps", pp. 476497.
[3] Goodrich and Tamassia, Section 3.4, "Splay Trees", pp. 185194.
21
22
Sequences
Array data type
In computer science, an array type is a data type that is meant to describe a collection of elements (values or
variables), each selected by one or more indices (identifying keys) that can be computed at run time by the program.
Such a collection is usually called an array variable, array value, or simply array.[1] By analogy with the
mathematical concepts of vector and matrix, array types with one and two indices are often called vector type and
matrix type, respectively.
Language support for array types may include certain built-in array data types, some syntactic constructions (array
type constructors) that the programmer may use to define such types and declare array variables, and special notation
for indexing array elements. For example, in the Pascal programming language, the declaration type MyTable
= array [1..4,1..2] of integer, defines a new array data type called MyTable. The declaration var
A: MyTable then defines a variable A of that type, which is an aggregate of eight elements, each being an integer
variable identified by two indices. In the Pascal program, those elements are denoted A[1,1], A[1,2],
A[2,1], A[4,2].[2] Special array types are often defined by the language's standard libraries.
Arrays are distinguished from lists in that arrays allow random access, while lists only allow sequential access.
Dynamic lists are also more common and easier to implement than dynamic arrays. Array types are distinguished
from record types mainly because they allow the element indices to be computed at run time, as in the Pascal
assignment A[I,J] := A[N-I,2*J]. Among other things, this feature allows a single iterative statement to
process arbitrarily many elements of an array variable.
In more theoretical contexts, especially in type theory and in the description of abstract algorithms, the terms "array"
and "array type" sometimes refer to an abstract data type (ADT) also called abstract array or may refer to an
associative array, a mathematical model with the basic operations and behavior of a typical array type in most
languages basically, a collection of elements that are selected by indices computed at run-time.
Depending on the language, array types may overlap (or be identified with) other data types that describe aggregates
of values, such as lists and strings. Array types are often implemented by array data structures, but sometimes by
other means, such as hash tables, linked lists, or search trees.
History
Assembly languages and low-level languages like BCPL[3] generally have no syntactic support for arrays.
Because of the importance of array structures for efficient computation, the earliest high-level programming
languages, including FORTRAN (1957), COBOL (1960), and Algol 60 (1960), provided support for
multi-dimensional arrays.
Abstract arrays
An array data structure can be mathematically modeled as an abstract data structure (an abstract array) with two
operations
get(A, I): the data stored in the element of the array A whose indices are the integer tuple I.
set(A,I,V): the array that results by setting the value of that element to V.
These operations are required to satisfy the axioms[4]
Implementations
In order to effectively implement variables of such types as array structures (with indexing done by pointer
arithmetic), many languages restrict the indices to integer data types (or other types that can be interpreted as
integers, such as bytes and enumerated types), and require that all elements have the same data type and storage size.
Most of those languages also restrict each index to a finite interval of integers, that remains fixed throughout the
lifetime of the array variable. In some compiled languages, in fact, the index ranges may have to be known at
compile time.
On the other hand, some programming languages provide more liberal array types, that allow indexing by arbitrary
values, such as floating-point numbers, strings, objects, references, etc.. Such index values cannot be restricted to an
interval, much less a fixed interval. So, these languages usually allow arbitrary new elements to be created at any
time. This choice precludes the implementation of array types as array data structures. That is, those languages use
array-like syntax to implement a more general associative array semantics, and must therefore be implemented by a
hash table or some other search data structure.
Language support
Multi-dimensional arrays
The number of indices needed to specify an element is called the dimension, dimensionality, or rank of the array
type. (This nomenclature conflicts with the concept of dimension in linear algebra,[5] where it is the number of
elements. Thus, an array of numbers with 5 rows and 4 columns, hence 20 elements, is said to have dimension 2 in
computing contexts, but represents a matrix with dimension 4-by-5 or 20 in mathematics. Also, the computer science
meaning of "rank" is similar to its meaning in tensor algebra but not to the linear algebra concept of rank of a
matrix.)
Many languages support only one-dimensional arrays. In those languages, a
multi-dimensional array is typically represented by an Iliffe vector, a
one-dimensional array of references to arrays of one dimension less. A
two-dimensional array, in particular, would be implemented as a vector of pointers
to its rows. Thus an element in row i and column j of an array A would be accessed
by double indexing (A[i][j] in typical notation). This way of emulating
multi-dimensional arrays allows the creation of ragged or jagged arrays, where each
row may have a different size or, in general, where the valid range of each index depends on the values of all
preceding indices.
This representation for multi-dimensional arrays is quite prevalent in C and C++ software. However, C and C++ will
use a linear indexing formula for multi-dimensional arrays that are declared as such, e.g. by int A[10][20] or
int A[m][n], instead of the traditional int **A.[6]:p.81
23
Indexing notation
Most programming languages that support arrays support the store and select operations, and have special syntax for
indexing. Early languages used parentheses, e.g. A(i,j), as in FORTRAN; others choose square brackets, e.g.
A[i,j] or A[i][j], as in Algol 60 and Pascal.
Index types
Array data types are most often implemented as array structures: with the indices restricted to integer (or totally
ordered) values, index ranges fixed at array creation time, and multilinear element addressing. This was the case in
most "third generation" languages, and is still the case of most systems programming languages such as Ada, C, and
C++. In some languages, however, array data types have the semantics of associative arrays, with indices of arbitrary
type and dynamic element creation. This is the case in some scripting languages such as Awk and Lua, and of some
array types provided by standard C++ libraries.
Bounds checking
Some languages (like Pascal and Modula) perform bounds checking on every access, raising an exception or
aborting the program when any index is out of its valid range. Compilers may allow these checks to be turned off to
trade safety for speed. Other languages (like FORTRAN and C) trust the programmer and perform no checks. Good
compilers may also analyze the program to determine the range of possible values that the index may have, and this
analysis may lead to bounds-checking elimination.
Index origin
Some languages, such as C, provide only zero-based array types, for which the minimum valid value for any index is
0. This choice is convenient for array implementation and address computations. With a language such as C, a
pointer to the interior of any array can be defined that will symbolically act as a pseudo-array that accommodates
negative indices. This works only because C does not check an index against bounds when used.
Other languages provide only one-based array types, where each index starts at 1; this is the traditional convention in
mathematics for matrices and mathematical sequences. A few languages, such as Pascal, support n-based array
types, whose minimum legal indices are chosen by the programmer. The relative merits of each choice have been the
subject of heated debate. Zero-based indexing has a natural advantage to one-based indexing in avoiding off-by-one
or fencepost errors.[7]
See comparison of programming languages (array) for the base indices used by various languages.
Highest index
The relation between numbers appearing in an array declaration and the index of that array's last element also varies
by language. In many languages (such as C), one should specify the number of elements contained in the array;
whereas in others (such as Pascal and Visual Basic .NET) one should specify the numeric value of the index of the
last element. Needless to say, this distinction is immaterial in languages where the indices start at 1.
Array algebra
Some programming languages (including APL, Matlab, and newer versions of Fortran) directly support array
programming, where operations and functions defined for certain data types are implicitly extended to arrays of
elements of those types. Thus one can write A+B to add corresponding elements of two arrays A and B. The
multiplication operation may be merely distributed over corresponding elements of the operands (APL) or may be
interpreted as the matrix product of linear algebra (Matlab).
24
Slicing
An array slicing operation takes a subset of the elements of an array-typed entity (value or variable) and then
assembles them as another array-typed entity, possibly with other indices. If array types are implemented as array
structures, many useful slicing operations (such as selecting a sub-array, swapping indices, or reversing the direction
of the indices) can be performed very efficiently by manipulating the dope vector of the structure. The possible
slicings depend on the implementation details: for example, FORTRAN allows slicing off one column of a matrix
variable, but not a row, and treat it as a vector; whereas C allow slicing off a row from a matrix, but not a column.
On the other hand, other slicing operations are possible when array types are implemented in other ways.
Resizing
Some languages allow dynamic arrays (also called resizable, growable, or extensible): array variables whose index
ranges may be expanded at any time after creation, without changing the values of its current elements.
For one-dimensional arrays, this facility may be provided as an operation "append(A,x)" that increases the size of
the array A by one and then sets the value of the last element to x. Other array types (such as Pascal strings) provide a
concatenation operator, which can be used together with slicing to achieve that effect and more. In some languages,
assigning a value to an element of an array automatically extends the array, if necessary, to include that element. In
other array types, a slice can be replaced by an array of different size" with subsequent elements being renumbered
accordingly as in Python's list assignment "A[5:5] = [10,20,30]", that inserts three new elements (10,20, and 30)
before element "A[5]". Resizable arrays are conceptually similar to lists, and the two concepts are synonymous in
some languages.
An extensible array can be implemented as a fixed-size array, with a counter that records how many elements are
actually in use. The append operation merely increments the counter; until the whole array is used, when the
append operation may be defined to fail. This is an implementation of a dynamic array with a fixed capacity, as in
the string type of Pascal. Alternatively, the append operation may re-allocate the underlying array with a
larger size, and copy the old elements to the new area.
25
References
[1] Robert W. Sebesta (2001) Concepts of Programming Languages. Addison-Wesley. 4th edition (1998), 5th edition (2001), ISBN
0-201-38596-1 ISBN13: 9780201385960
[2] K. Jensen and Niklaus Wirth, PASCAL User Manual and Report. Springer. Paperback edition (2007) 184 pages, ISBN 3-540-06950-X ISBN
978-3540069508
[3] John Mitchell, Concepts of Programming Languages. Cambridge University Press.
[4] Lukham, Suzuki (1979), "Verification of array, record, and pointer operations in Pascal". ACM Transactions on Programming Languages and
Systems 1(2), 226244.
[5] see the definition of a matrix
[6] Brian W. Kernighan and Dennis M. Ritchie (1988), The C programming Language. Prentice-Hall, 205 pages.
[7] Edsger W. Dijkstra, Why numbering should start at zero (http:/ / www. cs. utexas. edu/ users/ EWD/ transcriptions/ EWD08xx/ EWD831.
html)
External links
NIST's Dictionary of Algorithms and Data Structures: Array (http://www.nist.gov/dads/HTML/array.html)
26
History
The first digital computers used machine-language programming to set up and access array structures for data tables,
vector and matrix computations, and for many other purposes. Von Neumann wrote the first array-sorting program
(merge sort) in 1945, during the building of the first stored-program computer.[3]p.159 Array indexing was originally
done by self-modifying code, and later using index registers and indirect addressing. Some mainframes designed in
the 1960s, such as the Burroughs B5000 and its successors, had special instructions for array indexing that included
index-bounds checking.[citation needed].
Assembly languages generally have no special support for arrays, other than what the machine itself provides. The
earliest high-level programming languages, including FORTRAN (1957), COBOL (1960), and ALGOL 60 (1960),
had support for multi-dimensional arrays, and so has C (1972). In C++ (1983), class templates exist for
multi-dimensional arrays whose dimension is fixed at runtime as well as for runtime-flexible arrays.
Applications
Arrays are used to implement mathematical vectors and matrices, as well as other kinds of rectangular tables. Many
databases, small and large, consist of (or include) one-dimensional arrays whose elements are records.
Arrays are used to implement other data structures, such as heaps, hash tables, deques, queues, stacks, strings, and
VLists.
One or more large arrays are sometimes used to emulate in-program dynamic memory allocation, particularly
memory pool allocation. Historically, this has sometimes been the only way to allocate "dynamic memory" portably.
Arrays can be used to determine partial or complete control flow in programs, as a compact alternative to (otherwise
repetitive) multiple IF statements. They are known in this context as control tables and are used in conjunction with
a purpose built interpreter whose control flow is altered according to values contained in the array. The array may
contain subroutine pointers (or relative subroutine numbers that can be acted upon by SWITCH statements) that
direct the path of the execution.
27
One-dimensional arrays
A one-dimensional array (or single dimension array) is a type of linear array. Accessing its elements involves a
single subscript which can either represent a row or column index.
As an example consider the C declaration int anArrayName[10];
Syntax : datatype anArrayname[sizeofArray];
In the given example the array can contain 10 elements of any value available to the int type. In C, the array
element indices are 0-9 inclusive in this case. For example, the expressions anArrayName[0] and
anArrayName[9] are the first and last elements respectively.
For a vector with linear addressing, the element with index i is located at the address B + c i, where B is a fixed
base address and c a fixed constant, sometimes called the address increment or stride.
If the valid element indices begin at 0, the constant B is simply the address of the first element of the array. For this
reason, the C programming language specifies that array indices always begin at 0; and many programmers will call
that element "zeroth" rather than "first".
However, one can choose the index of the first element by an appropriate choice of the base address B. For example,
if the array has five elements, indexed 1 through 5, and the base address B is replaced by B + 30c, then the indices of
those same elements will be 31 to 35. If the numbering does not start at 0, the constant B may not be the address of
any element.
Multidimensional arrays
For a two-dimensional array, the element with indices i,j would have address B + c i + d j, where the coefficients c
and d are the row and column address increments, respectively.
More generally, in a k-dimensional array, the address of an element with indices i1, i2, , ik is
B + c1 i1 + c2 i2 + + ck ik.
For example: int a[3][2];
This means that array a has 3 rows and 2 columns, and the array is of integer type. Here we can store 6 elements they
are stored linearly but starting from first row linear then continuing with second row. The above array will be stored
as a11, a12, a13, a21, a22, a23.
This formula requires only k multiplications and k additions, for any array that can fit in memory. Moreover, if any
coefficient is a fixed power of 2, the multiplication can be replaced by bit shifting.
The coefficients ck must be chosen so that every valid index tuple maps to the address of a distinct element.
If the minimum legal value for every index is 0, then B is the address of the element whose indices are all zero. As in
the one-dimensional case, the element indices may be changed by changing the base address B. Thus, if a
two-dimensional array has rows and columns indexed from 1 to 10 and 1 to 20, respectively, then replacing B by B +
c1 - 3 c1 will cause them to be renumbered from 0 through 9 and 4 through 23, respectively. Taking advantage of
this feature, some languages (like FORTRAN 77) specify that array indices begin at 1, as in mathematical tradition;
while other languages (like Fortran 90, Pascal and Algol) let the user choose the minimum value for each index.
Dope vectors
The addressing formula is completely defined by the dimension d, the base address B, and the increments c1, c2, ,
ck. It is often useful to pack these parameters into a record called the array's descriptor or stride vector or dope
vector. The size of each element, and the minimum and maximum values allowed for each index may also be
included in the dope vector. The dope vector is a complete handle for the array, and is a convenient way to pass
arrays as arguments to procedures. Many useful array slicing operations (such as selecting a sub-array, swapping
indices, or reversing the direction of the indices) can be performed very efficiently by manipulating the dope vector.
28
29
Compact layouts
Often the coefficients are chosen so that the elements occupy a contiguous area of memory. However, that is not
necessary. Even if arrays are always created with contiguous elements, some array slicing operations may create
non-contiguous sub-arrays from them.
There are two systematic compact layouts for a two-dimensional array. For example, consider the matrix
In the row-major order layout (adopted by C for statically declared arrays), the elements in each row are stored in
consecutive positions and all of the elements of a row have a lower address than any of the elements of a consecutive
row:
1 2 3 4 5 6 7 8 9
In column-major order (traditionally used by Fortran), the elements in each column are consecutive in memory and
all of the elements of a column have a lower address than any of the elements of a consecutive column:
1 4 7 2 5 8 3 6 9
For arrays with three or more indices, "row major order" puts in consecutive positions any two elements whose index
tuples differ only by one in the last index. "Column major order" is analogous with respect to the first index.
In systems which use processor cache or virtual memory, scanning an array is much faster if successive elements are
stored in consecutive positions in memory, rather than sparsely scattered. Many algorithms that use
multidimensional arrays will scan them in a predictable order. A programmer (or a sophisticated compiler) may use
this information to choose between row- or column-major layout for each array. For example, when computing the
product AB of two matrices, it would be best to have A stored in row-major order, and B in column-major order.
Array resizing
Static arrays have a size that is fixed when they are created and consequently do not allow elements to be inserted or
removed. However, by allocating a new array and copying the contents of the old array to it, it is possible to
effectively implement a dynamic version of an array; see dynamic array. If this operation is done infrequently,
insertions at the end of the array require only amortized constant time.
Some array data structures do not reallocate storage, but do store a count of the number of elements of the array in
use, called the count or size. This effectively makes the array a dynamic array with a fixed maximum size or
capacity; Pascal strings are examples of this.
30
Non-linear formulas
More complicated (non-linear) formulas are occasionally used. For a compact two-dimensional triangular array, for
instance, the addressing formula is a polynomial of degree 2.
Efficiency
Both store and select take (deterministic worst case) constant time. Arrays take linear (O(n)) space in the number of
elements n that they hold.
In an array with element size k and on a machine with a cache line size of B bytes, iterating through an array of n
elements requires the minimum of ceiling(nk/B) cache misses, because its elements occupy contiguous memory
locations. This is roughly a factor of B/k better than the number of cache misses needed to access n elements at
random memory locations. As a consequence, sequential iteration over an array is noticeably faster in practice than
iteration over many other data structures, a property called locality of reference (this does not mean however, that
using a perfect hash or trivial hash within the same (local) array, will not be even faster - and achievable in constant
time). Libraries provide low-level optimized facilities for copying ranges of memory (such as memcpy) which can be
used to move contiguous blocks of array elements significantly faster than can be achieved through individual
element access. The speedup of such optimized routines varies by array element size, architecture, and
implementation.
Memory-wise, arrays are compact data structures with no per-element overhead. There may be a per-array overhead,
e.g. to store index bounds, but this is language-dependent. It can also happen that elements stored in an array require
less memory than the same elements stored in individual variables, because several array elements can be stored in a
single word; such arrays are often called packed arrays. An extreme (but commonly used) case is the bit array, where
every bit represents a single element. A single octet can thus hold up to 256 different combinations of up to 8
different conditions, in the most compact form.
Array accesses with statically predictable access patterns are a major source of data parallelism.
Dynamic Balanced
array
tree
Random access
list
Indexing
(n)
(1)
(1)
(log n)
(log n)
Insert/delete at beginning
(1)
N/A
(n)
(log n)
(1)
Insert/delete at end
(n)
last element is
unknown
(1)
last element is known
N/A
Insert/delete in middle
search time +
[4][5][6]
(1)
N/A
(n)
(1) amortized
(n)
(n)
(n)
(n)
Growable arrays are similar to arrays but add the ability to insert and delete elements; adding and deleting at the end
is particularly efficient. However, they reserve linear ((n)) additional storage, whereas arrays do not reserve
additional storage.
Associative arrays provide a mechanism for array-like functionality without huge storage overheads when the index
values are sparse. For example, an array that contains values only at indexes 1 and 2 billion may benefit from using
such a structure. Specialized associative arrays with integer keys include Patricia tries, Judy arrays, and van Emde
Meaning of dimension
The dimension of an array is the number of indices needed to select an element. Thus, if the array is seen as a
function on a set of possible index combinations, it is the dimension of the space of which its domain is a discrete
subset. Thus a one-dimensional array is a list of data, a two-dimensional array a rectangle of data, a
three-dimensional array a block of data, etc.
This should not be confused with the dimension of the set of all matrices with a given domain, that is, the number of
elements in the array. For example, an array with 5 rows and 4 columns is two-dimensional, but such matrices form a
20-dimensional space. Similarly, a three-dimensional vector can be represented by a one-dimensional array of size
three.
References
[1] David R. Richardson (2002), The Book on Data Structures. iUniverse, 112 pages. ISBN 0-595-24039-9, ISBN 978-0-595-24039-5.
[2] T. Veldhuizen. Arrays in Blitz++. In Proc. of the 2nd Int. Conf. on Scientific Computing in Object-Oriented Parallel Environments
(ISCOPE), LNCS 1505, pages 223-220. Springer, 1998.
[3] Donald Knuth, The Art of Computer Programming, vol. 3. Addison-Wesley
[4] Gerald Kruse. CS 240 Lecture Notes (http:/ / www. juniata. edu/ faculty/ kruse/ cs240/ syllabus. htm): Linked Lists Plus: Complexity
Trade-offs (http:/ / www. juniata. edu/ faculty/ kruse/ cs240/ linkedlist2. htm). Juniata College. Spring 2008.
[5] Day 1 Keynote - Bjarne Stroustrup: C++11 Style (http:/ / channel9. msdn. com/ Events/ GoingNative/ GoingNative-2012/
Keynote-Bjarne-Stroustrup-Cpp11-Style) at GoingNative 2012 on channel9.msdn.com from minute 45 or foil 44
[6] Number crunching: Why you should never, ever, EVER use linked-list in your code again (http:/ / kjellkod. wordpress. com/ 2012/ 02/ 25/
why-you-should-never-ever-ever-use-linked-list-in-your-code-again/ ) at kjellkod.wordpress.com
[7] Counted B-Tree (http:/ / www. chiark. greenend. org. uk/ ~sgtatham/ algorithms/ cbtree. html)
31
Dynamic array
32
Dynamic array
In computer science, a dynamic array, growable array, resizable
array, dynamic table, mutable array, or array list is a random
access, variable-size list data structure that allows elements to be added
or removed. It is supplied with standard libraries in many modern
mainstream programming languages.
A dynamic array is not the same thing as a dynamically allocated array,
which is a fixed-size array whose size is fixed when the array is
allocated, although a dynamic array may use such a fixed-size array as
a back end.[1]
In applications where the logical size is bounded, the fixed-size data structure suffices. This may be short-sighted, as
more space may be needed later. A philosophical programmer may prefer to write the code to make every array
capable of resizing from the outset, then return to using fixed-size arrays during program optimization. Resizing the
underlying array is an expensive task, typically involving copying the entire contents of the array.
Dynamic array
33
Performance
Linked list Array
Dynamic Balanced
array
tree
Random access
list
Indexing
(n)
(1)
(1)
(log n)
(log n)
Insert/delete at beginning
(1)
N/A
(n)
(log n)
(1)
Insert/delete at end
(n)
last element is
unknown
(1)
last element is known
N/A
Insert/delete in middle
search time +
[3][4][5]
(1)
N/A
(n)
(1) amortized
(n)
(n)
(n)
(n)
The dynamic array has performance similar to an array, with the addition of new operations to add and remove
elements from the end:
Dynamic arrays benefit from many of the advantages of arrays, including good locality of reference and data cache
utilization, compactness (low memory use), and random access. They usually have only a small fixed additional
overhead for storing information about the size and capacity. This makes dynamic arrays an attractive tool for
building cache-friendly data structures. However, in languages like Python or Java that enforce reference semantics,
the dynamic array generally will not store the actual data, but rather it will store references to the data that resides in
other areas of memory. In this case, accessing items in the array sequentially will actually involve accessing multiple
non-contiguous areas of memory, so many the advantages of the cache-friendliness of this data structure are lost.
Compared to linked lists, dynamic arrays have faster indexing (constant time versus linear time) and typically faster
iteration due to improved locality of reference; however, dynamic arrays require linear time to insert or delete at an
arbitrary location, since all following elements must be moved, while linked lists can do this in constant time. This
disadvantage is mitigated by the gap buffer and tiered vector variants discussed under Variants below. Also, in a
highly fragmented memory region, it may be expensive or impossible to find contiguous space for a large dynamic
array, whereas linked lists do not require the whole data structure to be stored contiguously.
A balanced tree can store a list while providing all operations of both dynamic arrays and linked lists reasonably
efficiently, but both insertion at the end and iteration over the list are slower than for a dynamic array, in theory and
in practice, due to non-contiguous storage and tree traversal/manipulation overhead.
Dynamic array
Variants
Gap buffers are similar to dynamic arrays but allow efficient insertion and deletion operations clustered near the
same arbitrary location. Some deque implementations use array deques, which allow amortized constant time
insertion/removal at both ends, instead of just one end.
Goodrich presented a dynamic array algorithm called Tiered Vectors that provided O(n1/2) performance for order
preserving insertions or deletions from the middle of the array.
Hashed Array Tree (HAT) is a dynamic array algorithm published by Sitarski in 1996. Hashed Array Tree wastes
order n1/2 amount of storage space, where n is the number of elements in the array. The algorithm has O(1)
amortized performance when appending a series of objects to the end of a Hashed Array Tree.
In a 1999 paper, Brodnik et al. describe a tiered dynamic array data structure, which wastes only n1/2 space for n
elements at any point in time, and they prove a lower bound showing that any dynamic array must waste this much
space if the operations are to remain amortized constant time. Additionally, they present a variant where growing and
shrinking the buffer has not only amortized but worst-case constant time.
Bagwell (2002) presented the VList algorithm, which can be adapted to implement a dynamic array.
Language support
C++'s std::vector is an implementation of dynamic arrays, as are the ArrayList[6] classes supplied with the
Java API and the .NET Framework. The generic List<> class supplied with version 2.0 of the .NET Framework is
also implemented with dynamic arrays. Smalltalk's OrderedCollection is a dynamic array with dynamic start
and end-index, making the removal of the first element also O(1). Python's list datatype implementation is a
dynamic array. Delphi and D implement dynamic arrays at the language's core. Ada's
Ada.Containers.Vectors generic package provides dynamic array implementation for a given subtype. Many
scripting languages such as Perl and Ruby offer dynamic arrays as a built-in primitive data type. Several
cross-platform frameworks provide dynamic array implementations for C: CFArray and CFMutableArray in
Core Foundation; GArray and GPtrArray in GLib.
References
[1] See, for example, the source code of java.util.ArrayList class from OpenJDK 6 (http:/ / hg. openjdk. java. net/ jdk6/ jdk6/ jdk/ file/
e0e25ac28560/ src/ share/ classes/ java/ util/ ArrayList. java).
[2] List object implementation (http:/ / svn. python. org/ projects/ python/ trunk/ Objects/ listobject. c) from python.org, retrieved 2011-09-27.
[3] Gerald Kruse. CS 240 Lecture Notes (http:/ / www. juniata. edu/ faculty/ kruse/ cs240/ syllabus. htm): Linked Lists Plus: Complexity
Trade-offs (http:/ / www. juniata. edu/ faculty/ kruse/ cs240/ linkedlist2. htm). Juniata College. Spring 2008.
[4] Day 1 Keynote - Bjarne Stroustrup: C++11 Style (http:/ / channel9. msdn. com/ Events/ GoingNative/ GoingNative-2012/
Keynote-Bjarne-Stroustrup-Cpp11-Style) at GoingNative 2012 on channel9.msdn.com from minute 45 or foil 44
[5] Number crunching: Why you should never, ever, EVER use linked-list in your code again (http:/ / kjellkod. wordpress. com/ 2012/ 02/ 25/
why-you-should-never-ever-ever-use-linked-list-in-your-code-again/ ) at kjellkod.wordpress.com
[6] Javadoc on
34
Dynamic array
External links
NIST Dictionary of Algorithms and Data Structures: Dynamic array (http://www.nist.gov/dads/HTML/
dynamicarray.html)
VPOOL (http://www.bsdua.org/libbsdua.html#vpool) - C language implementation of dynamic array.
CollectionSpy (http://www.collectionspy.com) A Java profiler with explicit support for debugging
ArrayList- and Vector-related issues.
Open Data Structures - Chapter 2 - Array-Based Lists (http://opendatastructures.org/versions/edition-0.1e/
ods-java/2_Array_Based_Lists.html)
Linked list
In computer science, a linked list is a data structure consisting of a group of nodes which together represent a
sequence. Under the simplest form, each node is composed of a data and a reference (in other words, a link) to the
next node in the sequence; more complex variants add additional links. This structure allows for efficient insertion or
removal of elements from any position in the sequence.
A linked list whose nodes contain two fields: an integer value and a link to the next node. The last node is linked to a terminator used to signify the
end of the list.
Linked lists are among the simplest and most common data structures. They can be used to implement several other
common abstract data types, including lists (the abstract data type), stacks, queues, associative arrays, and
S-expressions, though it is not uncommon to implement the other data structures directly without using a list as the
basis of implementation.
The principal benefit of a linked list over a conventional array is that the list elements can easily be inserted or
removed without reallocation or reorganization of the entire structure because the data items need not be stored
contiguously in memory or on disk. Linked lists allow insertion and removal of nodes at any point in the list, and can
do so with a constant number of operations if the link previous to the link being added or removed is maintained
during list traversal.
On the other hand, simple linked lists by themselves do not allow random access to the data, or any form of efficient
indexing. Thus, many basic operations such as obtaining the last node of the list (assuming that the last node is
not maintained as separate node reference in the list structure), or finding a node that contains a given datum, or
locating the place where a new node should be inserted may require scanning most or all of the list elements.
History
Linked lists were developed in 1955-56 by Allen Newell, Cliff Shaw and Herbert A. Simon at RAND Corporation as
the primary data structure for their Information Processing Language. IPL was used by the authors to develop several
early artificial intelligence programs, including the Logic Theory Machine, the General Problem Solver, and a
computer chess program. Reports on their work appeared in IRE Transactions on Information Theory in 1956, and
several conference proceedings from 1957 to 1959, including Proceedings of the Western Joint Computer
Conference in 1957 and 1958, and Information Processing (Proceedings of the first UNESCO International
Conference on Information Processing) in 1959. The now-classic diagram consisting of blocks representing list
nodes with arrows pointing to successive list nodes appears in "Programming the Logic Theory Machine" by Newell
and Shaw in Proc. WJCC, February 1957. Newell and Simon were recognized with the ACM Turing Award in 1975
for having "made basic contributions to artificial intelligence, the psychology of human cognition, and list
35
Linked list
processing". The problem of machine translation for natural language processing led Victor Yngve at Massachusetts
Institute of Technology (MIT) to use linked lists as data structures in his COMIT programming language for
computer research in the field of linguistics. A report on this language entitled "A programming language for
mechanical translation" appeared in Mechanical Translation in 1958.
LISP, standing for list processor, was created by John McCarthy in 1958 while he was at MIT and in 1960 he
published its design in a paper in the Communications of the ACM, entitled "Recursive Functions of Symbolic
Expressions and Their Computation by Machine, Part I". One of LISP's major data structures is the linked list. By
the early 1960s, the utility of both linked lists and languages which use these structures as their primary data
representation was well established. Bert Green of the MIT Lincoln Laboratory published a review article entitled
"Computer languages for symbol manipulation" in IRE Transactions on Human Factors in Electronics in March
1961 which summarized the advantages of the linked list approach. A later review article, "A Comparison of
list-processing computer languages" by Bobrow and Raphael, appeared in Communications of the ACM in April
1964.
Several operating systems developed by Technical Systems Consultants (originally of West Lafayette Indiana, and
later of Chapel Hill, North Carolina) used singly linked lists as file structures. A directory entry pointed to the first
sector of a file, and succeeding portions of the file were located by traversing pointers. Systems using this technique
included Flex (for the Motorola 6800 CPU), mini-Flex (same CPU), and Flex9 (for the Motorola 6809 CPU). A
variant developed by TSC for and marketed by Smoke Signal Broadcasting in California, used doubly linked lists in
the same manner.
The TSS/360 operating system, developed by IBM for the System 360/370 machines, used a double linked list for
their file system catalog. The directory structure was similar to Unix, where a directory could contain files and/or
other directories and extend to any depth. A utility flea was created to fix file system problems after a crash, since
modified portions of the file catalog were sometimes in memory when a crash occurred. Problems were detected by
comparing the forward and backward links for consistency. If a forward link was corrupt, then if a backward link to
the infected node was found, the forward link was set to the node with the backward link. A humorous comment in
the source code where this utility was invoked stated "Everyone knows a flea collar gets rid of bugs in cats".
A singly linked list whose nodes contain two fields: an integer value and a link to the next node
36
Linked list
A doubly linked list whose nodes contain three fields: an integer value, the link forward to the next node, and the link backward to the previous
node
A technique known as XOR-linking allows a doubly linked list to be implemented using a single link field in each
node. However, this technique requires the ability to do bit operations on addresses, and therefore may not be
available in some high-level languages.
Circular list
In the last node of a list, the link field often contains a null reference, a special value used to indicate the lack of
further nodes. A less common convention is to make it point to the first node of the list; in that case the list is said to
be circular or circularly linked; otherwise it is said to be open or linear.
In the case of a circular doubly linked list, the only change that occurs is that the end, or "tail", of the said list is
linked back to the front, or "head", of the list and vice versa.
Sentinel nodes
In some implementations, an extra sentinel or dummy node may be added before the first data record and/or after
the last one. This convention simplifies and accelerates some list-handling algorithms, by ensuring that all links can
be safely dereferenced and that every list (even one that contains no data elements) always has a "first" and "last"
node.
Empty lists
An empty list is a list that contains no data records. This is usually the same as saying that it has zero nodes. If
sentinel nodes are being used, the list is usually said to be empty when it has only sentinel nodes.
37
Linked list
38
Hash linking
The link fields need not be physically part of the nodes. If the data records are stored in an array and referenced by
their indices, the link field may be stored in a separate array with the same indices as the data records.
List handles
Since a reference to the first node gives access to the whole list, that reference is often called the address, pointer,
or handle of the list. Algorithms that manipulate linked lists usually get such handles to the input lists and return the
handles to the resulting lists. In fact, in the context of such algorithms, the word "list" often means "list handle". In
some situations, however, it may be convenient to refer to a list by a handle that consists of two links, pointing to its
first and last nodes.
Combining alternatives
The alternatives listed above may be arbitrarily combined in almost every way, so one may have circular doubly
linked lists without sentinels, circular singly linked lists with sentinels, etc.
Tradeoffs
As with most choices in computer programming and design, no method is well suited to all circumstances. A linked
list data structure might work well in one case, but cause problems in another. This is a list of some of the common
tradeoffs involving linked list structures.
Dynamic Balanced
array
tree
Random access
list
Indexing
(n)
(1)
(1)
(log n)
(log n)
Insert/delete at beginning
(1)
N/A
(n)
(log n)
(1)
Insert/delete at end
(n)
last element is
unknown
(1)
last element is known
N/A
Insert/delete in middle
search time +
[1][2][3]
(1)
N/A
(n)
(1) amortized
(n)
(n)
(n)
(n)
A dynamic array is a data structure that allocates all elements contiguously in memory, and keeps a count of the
current number of elements. If the space reserved for the dynamic array is exceeded, it is reallocated and (possibly)
copied, an expensive operation.
Linked lists have several advantages over dynamic arrays. Insertion or deletion of an element at a specific point of a
list, assuming that we have a pointer to the node (before the one to be removed, or before the insertion point)
already, is a constant-time operation (otherwise without this reference it is O(n)), whereas insertion in a dynamic
array at random locations will require moving half of the elements on average, and all the elements in the worst case.
While one can "delete" an element from an array in constant time by somehow marking its slot as "vacant", this
causes fragmentation that impedes the performance of iteration.
Moreover, arbitrarily many elements may be inserted into a linked list, limited only by the total memory available;
while a dynamic array will eventually fill up its underlying array data structure and will have to reallocate an
Linked list
expensive operation, one that may not even be possible if memory is fragmented, although the cost of reallocation
can be averaged over insertions, and the cost of an insertion due to reallocation would still be amortized O(1). This
helps with appending elements at the array's end, but inserting into (or removing from) middle positions still carries
prohibitive costs due to data moving to maintain contiguity. An array from which many elements are removed may
also have to be resized in order to avoid wasting too much space.
On the other hand, dynamic arrays (as well as fixed-size array data structures) allow constant-time random access,
while linked lists allow only sequential access to elements. Singly linked lists, in fact, can only be traversed in one
direction. This makes linked lists unsuitable for applications where it's useful to look up an element by its index
quickly, such as heapsort. Sequential access on arrays and dynamic arrays is also faster than on linked lists on many
machines, because they have optimal locality of reference and thus make good use of data caching.
Another disadvantage of linked lists is the extra storage needed for references, which often makes them impractical
for lists of small data items such as characters or boolean values, because the storage overhead for the links may
exceed by a factor of two or more the size of the data. In contrast, a dynamic array requires only the space for the
data itself (and a very small amount of control data).[4] It can also be slow, and with a nave allocator, wasteful, to
allocate memory separately for each new element, a problem generally solved using memory pools.
Some hybrid solutions try to combine the advantages of the two representations. Unrolled linked lists store several
elements in each list node, increasing cache performance while decreasing memory overhead for references. CDR
coding does both these as well, by replacing references with the actual data referenced, which extends off the end of
the referencing record.
A good example that highlights the pros and cons of using dynamic arrays vs. linked lists is by implementing a
program that resolves the Josephus problem. The Josephus problem is an election method that works by having a
group of people stand in a circle. Starting at a predetermined person, you count around the circle n times. Once you
reach the nth person, take them out of the circle and have the members close the circle. Then count around the circle
the same n times and repeat the process, until only one person is left. That person wins the election. This shows the
strengths and weaknesses of a linked list vs. a dynamic array, because if you view the people as connected nodes in a
circular linked list then it shows how easily the linked list is able to delete nodes (as it only has to rearrange the links
to the different nodes). However, the linked list will be poor at finding the next person to remove and will need to
search through the list until it finds that person. A dynamic array, on the other hand, will be poor at deleting nodes
(or elements) as it cannot remove one node without individually shifting all the elements up the list by one.
However, it is exceptionally easy to find the nth person in the circle by directly referencing them by their position in
the array.
The list ranking problem concerns the efficient conversion of a linked list representation into an array. Although
trivial for a conventional computer, solving this problem by a parallel algorithm is complicated and has been the
subject of much research.
A balanced tree has similar memory access patterns and space overhead to a linked list while permitting much more
efficient indexing, taking O(log n) time instead of O(n) for a random access. However, insertion and deletion
operations are more expensive due to the overhead of tree manipulations to maintain balance. Schemes exist for trees
to automatically maintain themselves in a balanced state: AVL trees or red-black trees.
39
Linked list
40
Linked list
41
Linked list
42
The following code inserts a node after an existing node in a singly linked list. The diagram shows how it works.
Inserting a node before an existing one cannot be done directly; instead, one must keep track of the previous node
and insert a node after it.
:= list.firstNode
list.firstNode := newNode
Similarly, we have functions for removing the node after a given node, and for removing a node from the beginning
of the list. The diagram demonstrates the former. To find and remove a particular node, one must again keep track of
the previous element.
Linked list
function removeBeginning(List list) // remove first node
obsoleteNode := list.firstNode
list.firstNode := list.firstNode.next // point past deleted node
destroy obsoleteNode
Notice that removeBeginning() sets list.firstNode to null when removing the last node in the list.
Since we can't iterate backwards, efficient insertBefore or removeBefore operations are not possible.
Appending one linked list to another can be inefficient unless a reference to the tail is kept as part of the List
structure, because we must traverse the entire first list in order to find the tail, and then append the second list to this.
Thus, if two linearly linked lists are each of length , list appending has asymptotic time complexity of
. In
the Lisp family of languages, list appending is provided by the append procedure.
Many of the special cases of linked list operations can be eliminated by including a dummy element at the front of
the list. This ensures that there are no special cases for the beginning of the list and renders both
insertBeginning() and removeBeginning() unnecessary. In this case, the first useful data in the list will
be found at list.firstNode.next.
43
Linked list
44
newNode.next := node.next
node.next := newNode
Suppose that "L" is a variable pointing to the last node of a circular linked list (or null if the list is empty). To append
"newNode" to the end of the list, one may do
insertAfter(L, newNode)
L := newNode
To insert "newNode" at the beginning of the list, one may do
insertAfter(L, newNode)
if L = null
L := newNode
Next Prev
Name
Balance
Jones, John
123.45
-1
Smith, Joseph
234.56
-1
Adams, Adam
0.00
2 (listHead) 4
3
4
Another, Anita
876.54
5
6
7
In the above example, ListHead would be set to 2, the location of the first entry in the list. Notice that entry 3 and
5 through 7 are not part of the list. These cells are available for any additions to the list. By creating a ListFree
Linked list
integer variable, a free list could be created to keep track of what cells are available. If all entries are in use, the size
of the array would have to be increased or some elements would have to be deleted before new entries could be
stored in the list.
The following code would traverse the list and display names and account balance:
i := listHead
while i 0 // loop through the list
print i, Records[i].name, Records[i].balance // print entry
i := Records[i].next
When faced with a choice, the advantages of this approach include:
The linked list is relocatable, meaning it can be moved about in memory at will, and it can also be quickly and
directly serialized for storage on disk or transfer over a network.
Especially for a small list, array indexes can occupy significantly less space than a full pointer on many
architectures.
Locality of reference can be improved by keeping the nodes together in memory and by periodically rearranging
them, although this can also be done in a general store.
Nave dynamic memory allocators can produce an excessive amount of overhead storage for each node allocated;
almost no allocation overhead is incurred per node in this approach.
Seizing an entry from a pre-allocated array is faster than using dynamic memory allocation for each node, since
dynamic memory allocation typically requires a search for a free memory block of the desired size.
This approach has one main disadvantage, however: it creates and manages a private memory space for its nodes.
This leads to the following issues:
It increases complexity of the implementation.
Growing a large array when it is full may be difficult or impossible, whereas finding space for a new linked list
node in a large, general memory pool may be easier.
Adding elements to a dynamic array will occasionally (when it is full) unexpectedly take linear (O(n)) instead of
constant time (although it's still an amortized constant).
Using a general memory pool leaves more memory for other data if the list is smaller than expected or if many
nodes are freed.
For these reasons, this approach is mainly used for languages that do not support dynamic memory allocation. These
disadvantages are also mitigated if the maximum size of the list is known at the time the array is created.
Language support
Many programming languages such as Lisp and Scheme have singly linked lists built in. In many functional
languages, these lists are constructed from nodes, each called a cons or cons cell. The cons has two fields: the car, a
reference to the data for that node, and the cdr, a reference to the next node. Although cons cells can be used to build
other data structures, this is their primary purpose.
In languages that support abstract data types or templates, linked list ADTs or templates are available for building
linked lists. In other languages, linked lists are typically built using references together with records.
45
Linked list
46
Linked list
47
print information about family
aMember := aFamily.members // get head of list of this family's members
while aMember null // loop through list of members
print information about member
aMember := aMember.next
aFamily := aFamily.next
Linked list
Speeding up search
Finding a specific element in a linked list, even if it is sorted, normally requires O(n) time (linear search). This is one
of the primary disadvantages of linked lists over other data structures. In addition to the variants discussed above,
below are two simple ways to improve search time.
In an unordered list, one simple heuristic for decreasing average search time is the move-to-front heuristic, which
simply moves an element to the beginning of the list once it is found. This scheme, handy for creating simple caches,
ensures that the most recently used items are also the quickest to find again.
Another common approach is to "index" a linked list using a more efficient external data structure. For example, one
can build a red-black tree or hash table whose elements are references to the linked list nodes. Multiple such indexes
can be built on a single list. The disadvantage is that these indexes may need to be updated each time a node is added
or removed (or at least, before that index is used again).
48
Linked list
Notes
[1] Gerald Kruse. CS 240 Lecture Notes (http:/ / www. juniata. edu/ faculty/ kruse/ cs240/ syllabus. htm): Linked Lists Plus: Complexity
Trade-offs (http:/ / www. juniata. edu/ faculty/ kruse/ cs240/ linkedlist2. htm). Juniata College. Spring 2008.
[2] Day 1 Keynote - Bjarne Stroustrup: C++11 Style (http:/ / channel9. msdn. com/ Events/ GoingNative/ GoingNative-2012/
Keynote-Bjarne-Stroustrup-Cpp11-Style) at GoingNative 2012 on channel9.msdn.com from minute 45 or foil 44
[3] Number crunching: Why you should never, ever, EVER use linked-list in your code again (http:/ / kjellkod. wordpress. com/ 2012/ 02/ 25/
why-you-should-never-ever-ever-use-linked-list-in-your-code-again/ ) at kjellkod.wordpress.com
[4] The amount of control data required for a dynamic array is usually of the form UNIQ-math-0-5cd073dcfe757d11-QINU , where
UNIQ-math-1-5cd073dcfe757d11-QINU is a per-array constant, UNIQ-math-2-5cd073dcfe757d11-QINU is a per-dimension constant, and
UNIQ-math-3-5cd073dcfe757d11-QINU is the number of dimensions. UNIQ-math-4-5cd073dcfe757d11-QINU and
UNIQ-math-5-5cd073dcfe757d11-QINU are typically on the order of 10 bytes.
[5] Ford, William and Topp, William Data Structures with C++ using STL Second Edition (2002). Prentice-Hall. ISBN 0-13-085850-1, pp.
466-467
Footnotes
References
Juan, Angel (2006). "Ch20 Data Structures; ID06 - PROGRAMMING with JAVA (slide part of the book "Big
Java", by CayS. Horstmann)" (http://www.uoc.edu/in3/emath/docs/java/ch20.pdf) (PDF). p.3
"Definition of a linked list" (http://nist.gov/dads/HTML/linkedList.html). National Institute of Standards and
Technology. 2004-08-16. Retrieved 2004-12-14.
Antonakos, James L.; Mansfield, Kenneth C., Jr. (1999). Practical Data Structures Using C/C++. Prentice-Hall.
pp.165190. ISBN0-13-280843-9.
Collins, William J. (2005) [2002]. Data Structures and the Java Collections Framework. New York: McGraw
Hill. pp.239303. ISBN0-07-282379-8.
Cormen, Thomas H.; Charles E. Leiserson; Ronald L. Rivest; Clifford Stein (2003). Introduction to Algorithms.
MIT Press. pp.205213 & 501505. ISBN0-262-03293-7.
Cormen, Thomas H.; Charles E. Leiserson; Ronald L. Rivest; Clifford Stein (2001). "10.2: Linked lists".
Introduction to Algorithms (2md ed.). MIT Press. pp.204209. ISBN0-262-03293-7.
Green, Bert F. Jr. (1961). "Computer Languages for Symbol Manipulation". IRE Transactions on Human Factors
in Electronics (2): 38. doi: 10.1109/THFE2.1961.4503292 (http://dx.doi.org/10.1109/THFE2.1961.
4503292).
McCarthy, John (1960). "Recursive Functions of Symbolic Expressions and Their Computation by Machine, Part
I" (http://www-formal.stanford.edu/jmc/recursive.html). Communications of the ACM 3 (4): 184. doi:
10.1145/367177.367199 (http://dx.doi.org/10.1145/367177.367199).
Knuth, Donald (1997). "2.2.3-2.2.5". Fundamental Algorithms (3rd ed.). Addison-Wesley. pp.254298.
ISBN0-201-89683-4.
Newell, Allen; Shaw, F. C. (1957). "Programming the Logic Theory Machine". Proceedings of the Western Joint
Computer Conference: 230240.
Parlante, Nick (2001). "Linked list basics" (http://cslibrary.stanford.edu/103/LinkedListBasics.pdf). Stanford
University. Retrieved 2009-09-21.
Sedgewick, Robert (1998). Algorithms in C. Addison Wesley. pp.90109. ISBN0-201-31452-5.
Shaffer, Clifford A. (1998). A Practical Introduction to Data Structures and Algorithm Analysis. New Jersey:
Prentice Hall. pp.77102. ISBN0-13-660911-2.
Wilkes, Maurice Vincent (1964). "An Experiment with a Self-compiling Compiler for a Simple List-Processing
Language". Annual Review in Automatic Programming (Pergamon Press) 4 (1): 1. doi:
10.1016/0066-4138(64)90013-8 (http://dx.doi.org/10.1016/0066-4138(64)90013-8).
Wilkes, Maurice Vincent (1964). "Lists and Why They are Useful". Proceeds of the ACM National Conference,
Philadelphia 1964 (ACM) (P64): F11.
49
Linked list
50
External links
Introduction to Circular Linked (http://scanftree.com/Data_Structure/Circular) from the Scanftree
Description (http://nist.gov/dads/HTML/linkedList.html) from the Dictionary of Algorithms and Data
Structures
Introduction to Linked Lists (http://cslibrary.stanford.edu/103/), Stanford University Computer Science
Library
Linked List Problems (http://cslibrary.stanford.edu/105/), Stanford University Computer Science Library
Open Data Structures - Chapter 3 - Linked Lists (http://opendatastructures.org/versions/edition-0.1e/ods-java/
3_Linked_Lists.html)
Patent for the idea of having nodes which are in several linked lists simultaneously (http://www.google.com/
patents?vid=USPAT7028023) (note that this technique was widely used for many decades before the patent was
granted)
A doubly-linked list whose nodes contain three fields: an integer value, the link to the next node, and the link to the previous node.
The two node links allow traversal of the list in either direction. While adding or removing a node in a doubly-linked
list requires changing more links than the same operations on a singly linked list, the operations are simpler and
potentially more efficient (for nodes other than first nodes) because there is no need to keep track of the previous
node during traversal or no need to traverse the list to find the previous node, so that its link can be modified.
51
Basic algorithms
Open doubly-linked lists
record DoublyLinkedNode {
prev // A reference to the previous node
next // A reference to the next node
data // Data or a reference to data
}
record DoublyLinkedList {
DoublyLinkedNode firstNode
DoublyLinkedNode lastNode
}
52
:= someNode
:= someNode
53
Advanced concepts
Asymmetric doubly-linked list
An asymmetric doubly-linked list is somewhere between the singly-linked list and the regular doubly-linked list. It
shares some features with the singly linked list (single-direction traversal) and others from the doubly-linked list
(ease of modification)
It is a list where each node's previous link points not to the previous node, but to the link to itself. While this makes
little difference between nodes (it just points to an offset within the previous node), it changes the head of the list: It
allows the first node to modify the firstNode link easily.[1][2]
As long as a node is in a list, its previous link is never null.
Inserting a node
To insert a node before another, we change the link that pointed to the old node, using the prev link; then set the new
node's next link to point to the old node, and change that node's prev link accordingly.
function insertBefore(Node node, Node newNode)
if node.prev == null
error "The node is not in a list"
newNode.prev := node.prev
atAddress(newNode.prev) := newNode
newNode.next := node
node.prev = addressOf(newNode.next)
function insertAfter(Node node, Node newNode)
newNode.next := node.next
if newNode.next != null
newNode.next.prev = addressOf(newNode.next)
node.next := newNode
newNode.prev := addressOf(node.next)
Deleting a node
To remove a node, we simply modify the link pointed by prev, regardless of whether the node was the first one of
the list.
function remove(Node node)
atAddress(node.prev) := node.next
if node.next != null
node.next.prev = node.prev
destroy node
54
55
References
[1] http:/ / www. codeofhonor. com/ blog/ avoiding-game-crashes-related-to-linked-lists
[2] https:/ / github. com/ webcoyote/ coho/ blob/ master/ Base/ List. h
A stack may be implemented to have a bounded capacity. If the stack is full and does not contain enough space to
accept an entity to be pushed, the stack is then considered to be in an overflow state. The pop operation removes an
item from the top of the stack. A pop either reveals previously concealed items or results in an empty stack, but, if
the stack is empty, it goes into underflow state, which means no items are present in stack to be removed.
A stack is a restricted data structure, because only a small number of operations are performed on it. The nature of
the pop and push operations also means that stack elements have a natural order. Elements are removed from the
stack in the reverse order to the order of their addition. Therefore, the lower elements are those that have been on the
stack the longest.[1]
History
The stack was first proposed in 1946, in the computer design of Alan M. Turing (who used the terms "bury" and
"unbury") as a means of calling and returning from subroutines.Wikipedia:Please clarify The Germans Klaus
Samelson and Friedrich L. Bauer proposed the idea in 1955 and filed a patent in 1957. The same concept was
developed, independently, by the Australian Charles Leonard Hamblin in the first half of 1957.[2]
Abstract definition
A stack is a basic computer science data structure and can be defined in an abstract, implementation-free manner, or
it can be generally defined as a linear list of items in which all additions and deletion are restricted to one end that is
Top.
This is a VDM (Vienna Development Method) description of a stack:[3]
Function signatures:
init: -> Stack
push: N x Stack -> Stack
top: Stack -> (N U ERROR)
pop: Stack -> Stack
isempty: Stack -> Boolean
Inessential operations
In many implementations, a stack has more operations than "push" and "pop". An example is "top of stack", or
"peek", which observes the top-most element without removing it from the stack.[4] Since this can be done with a
"pop" and a "push" with the same data, it is not essential. An underflow condition can occur in the "stack top"
operation if the stack is empty, the same as "pop". Often implementations have a function which just returns if the
stack is empty.
Software stacks
Implementation
In most high level languages, a stack can be easily implemented either through an array or a linked list. What
identifies the data structure as a stack in either case is not the implementation but the interface: the user is only
allowed to pop or push items onto the array or linked list, with few other helper operations. The following will
demonstrate both implementations, using C.
Array
The array implementation aims to create an array where the first element (usually at the zero-offset) is the bottom.
That is, array[0] is the first element pushed onto the stack and the last element popped off. The program must
keep track of the size, or the length of the stack. The stack itself can therefore be effectively implemented as a
two-element structure in C:
typedef struct {
size_t size;
int items[STACKSIZE];
} STACK;
The push() operation is used both to initialize the stack, and to store values to it. It is responsible for inserting
(copying) the value into the ps->items[] array and for incrementing the element counter (ps->size). In a
responsible C implementation, it is also necessary to check whether the array is already full to prevent an overrun.
void push(STACK *ps, int x)
{
if (ps->size == STACKSIZE) {
fputs("Error: stack overflow\n", stderr);
abort();
} else
ps->items[ps->size++] = x;
}
56
57
The pop() operation is responsible for removing a value from the stack, and decrementing the value of
ps->size. A responsible C implementation will also need to check that the array is not already empty.
int pop(STACK *ps)
{
if (ps->size == 0){
fputs("Error: stack underflow\n", stderr);
abort();
} else
return ps->items[--ps->size];
}
If we use a dynamic array, then we can implement a stack that can grow or shrink as much as needed. The size of the
stack is simply the size of the dynamic array. A dynamic array is a very efficient implementation of a stack, since
adding items to or removing items from the end of a dynamic array is amortized O(1) time.
Linked list
The linked-list implementation is equally simple and straightforward. In fact, a simple singly linked list is sufficient
to implement a stackit only requires that the head node or element can be removed, or popped, and a node can
only be inserted by becoming the new head node.
Unlike the array implementation, our structure typedef corresponds not to the entire stack structure, but to a single
node:
typedef struct stack {
int data;
struct stack *next;
} STACK;
Such a node is identical to a typical singly linked list node, at least to those that are implemented in C.
The push() operation both initializes an empty stack, and adds a new node to a non-empty one. It works by
receiving a data value to push onto the stack, along with a target stack, creating a new node by allocating memory for
it, and then inserting it into a linked list as the new head:
void push(STACK **head, int value)
{
STACK *node = malloc(sizeof(STACK));
if (node == NULL){
fputs("Error: no space available for node\n", stderr);
abort();
} else {
/* initialize node */
node->data = value;
node->next = empty(*head) ? NULL : *head; /* insert new head if
any */
*head = node;
}
}
A pop() operation removes the head from the linked list, and assigns the pointer to the head to the previous second
node. It checks whether the list is empty before popping from it:
58
59
Hardware stacks
A common use of stacks at the architecture level is as a means of allocating and accessing memory.
A typical stack, storing local data and call information for nested procedure calls (not
necessarily nested procedures!). This stack grows downward from its origin. The
stack pointer points to the current topmost datum on the stack. A push operation
decrements the pointer and copies the data to the stack; a pop operation copies data
from the stack and then increments the pointer. Each procedure called in the program
stores procedure return information (in yellow) and local data (in other colors) by
pushing them onto the stack. This type of stack implementation is extremely
common, but it is vulnerable to buffer overflow attacks (see the text).
60
Swap or exchange: the two topmost items on the stack exchange places.
Rotate (or Roll): the n topmost items are moved on the stack in a rotating fashion. For example, if n=3, items 1, 2,
and 3 on the stack are moved to positions 2, 3, and 1 on the stack, respectively. Many variants of this operation
are possible, with the most common being called left rotate and right rotate.
Stacks are either visualized growing from the bottom up (like real-world stacks), or, with the top of the stack in a
fixed position (see image [note in the image, the top (28) is the stack 'bottom', since the stack 'top' is where items are
pushed or popped from]), a coin holder, a Pez dispenser, or growing from left to right, so that "topmost" becomes
"rightmost". This visualization may be independent of the actual structure of the stack in memory. This means that a
right rotate will move the first element to the third position, the second to the first and the third to the second. Here
are two equivalent visualizations of this process:
apple
banana
cucumber
cucumber
banana
apple
===right rotate==>
banana
cucumber
apple
===left rotate==>
apple
cucumber
banana
A stack is usually represented in computers by a block of memory cells, with the "bottom" at a fixed location, and
the stack pointer holding the address of the current "top" cell in the stack. The top and bottom terminology are used
irrespective of whether the stack actually grows towards lower memory addresses or towards higher memory
addresses.
Pushing an item on to the stack adjusts the stack pointer by the size of the item (either decrementing or incrementing,
depending on the direction in which the stack grows in memory), pointing it to the next cell, and copies the new top
item to the stack area. Depending again on the exact implementation, at the end of a push operation, the stack pointer
may point to the next unused location in the stack, or it may point to the topmost item in the stack. If the stack points
to the current topmost item, the stack pointer will be updated before a new item is pushed onto the stack; if it points
to the next available location in the stack, it will be updated after the new item is pushed onto the stack.
Popping the stack is simply the inverse of pushing. The topmost item in the stack is removed and the stack pointer is
updated, in the opposite order of that used in the push operation.
Hardware support
Stack in main memory
Most CPUs have registers that can be used as stack pointers. Processor families like the x86, Z80, 6502, and many
others have special instructions that implicitly use a dedicated (hardware) stack pointer to conserve opcode space.
Some processors, like the PDP-11 and the 68000, also have special addressing modes for implementation of stacks,
typically with a semi-dedicated stack pointer as well (such as A7 in the 68000). However, in most processors, several
different registers may be used as additional stack pointers as needed (whether updated via addressing modes or via
add/sub instructions).
Stack in registers or dedicated memory
The x87 floating point architecture is an example of a set of registers organised as a stack where direct access to
individual registers (relative the current top) is also possible. As with stack-based machines in general, having the
top-of-stack as an implicit argument allows for a small machine code footprint with a good usage of bus bandwidth
and code caches, but it also prevents some types of optimizations possible on processors permitting random access to
the register file for all (two or three) operands. A stack structure also makes superscalar implementations with
61
register renaming (for speculative execution) somewhat more complex to implement, although it is still feasible, as
exemplified by modern x87 implementations.
Sun SPARC, AMD Am29000, and Intel i960 are all examples of architectures using register windows within a
register-stack as another strategy to avoid the use of slow main memory for function arguments and return values.
There are also a number of small microprocessors that implements a stack directly in hardware and some
microcontrollers have a fixed-depth stack that is not directly accessible. Examples are the PIC microcontrollers, the
Computer Cowboys MuP21, the Harris RTX line, and the Novix NC4016. Many stack-based microprocessors were
used to implement the programming language Forth at the microcode level. Stacks were also used as a basis of a
number of mainframes and mini computers. Such machines were called stack machines, the most famous being the
Burroughs B5000.
Applications
Stacks have numerous applications. We see stacks in everyday life, from the books in our library, to the sheaf of
papers that we keep in our printer tray. All of them follow the Last In First Out (LIFO) logic, that is when we add a
book to a pile of books, we add it to the top of the pile, whereas when we remove a book from the pile, we generally
remove it from the top of the pile.
Given below are a few applications of stacks in the world of computers:
62
return error
end if
n = floor(n / 2)
end while
while s is not empty do
output(s.pop())
end while
end function
Towers of Hanoi
One of the most interesting applications of
stacks can be found in solving a puzzle
called Tower of Hanoi. According to an old
Brahmin story, the existence of the universe
is calculated in terms of the time taken by a
number of monks, who are working all the
time, to move 64 disks from one pole to
another. But there are some rules about how
this should be done, which are:
Towers of Hanoi
63
64
Tower of Hanoi
65
The C++ code for this solution can be implemented in two ways:
First implementation (using stacks implicitly by recursion)
#include <stdio.h>
void TowersofHanoi(int n, int a, int b, int c)
{
if(n > 0)
{
TowersofHanoi(n-1, a, c, b);
//recursion
printf("> Move top disk from tower %d to tower %d.\n", a, b);
TowersofHanoi(n-1, c, b, a);
//recursion
}
}
Second implementation (using stacks explicitly)
// Global variable , tower [1:3] are three towers
arrayStack<int> tower[4];
void TowerofHanoi(int n)
{
// Preprocessor for moveAndShow.
for (int d = n; d > 0; d--)
tower[1].push(d);
moveAndShow(n, 1, 2, 3);
tower 3 using
//initialize
//add disk d to tower 1
/*move n disks from tower 1 to
tower 2 as intermediate tower*/
}
void moveAndShow(int n, int a, int b, int c)
{
// Move the top n disks from tower a to tower b showing states.
// Use tower c for intermediate storage.
if(n > 0)
{
moveAndShow(n-1, a, c, b);
//recursion
66
int d = tower[a].top();
x to top of
tower[a].pop();
tower[c].push(d);
showState();
moveAndShow(n-1, b, a, c);
}
}
Opening bracket
Numbers
Operators
Closing bracket
New line character
(2.1)
Number
(2.2)
Operator
(2.3)
Closing brackets
(2.4)
67
push into the stack
Go to step (2.4)
(2.5)
Result: The evaluation of the fully parenthesized infix expression is printed as follows:
Input String: (((2 * 5) - (1 * 2)) / (11 - 9))
Input Symbol Stack (from bottom to top)
(
((
(((
(((2
(((2*
(((2*5
( ( 10
( ( 10 -
( ( 10 - (
( ( 10 - ( 1
( ( 10 - ( 1 *
( ( 10 - ( 1 * 2
( ( 10 - 2
1 * 2 = 2 & Push
(8
10 - 2 = 8 & Push
(8/
(8/(
11
( 8 / ( 11
( 8 / ( 11 -
( 8 / ( 11 - 9
(8/2
11 - 9 = 2 & Push
8 / 2 = 4 & Push
New line
Empty
Opening parentheses
Numbers
Operators
Closing parentheses
New line character (\n)
Operation
2 * 5 = 10 and push
68
We do not know what to do if an operator is read as an input character. By implementing the priority rule for
operators, we have a solution to this problem.
The Priority rule: we should perform a comparative priority check if an operator is read, and then push it. If the stack
top contains an operator of priority higher than or equal to the priority of the input operator, then we pop it and print
it. We keep on performing the priority check until the top of stack either contains an operator of lower priority or if it
does not contain an operator.
Data Structure Requirement for this problem: a character stack and an integer stack
Algorithm:
1. Read an input character
2. Actions that will be performed at the end of each input
Opening parentheses
Number
Operator
(2.1)
(2.2)
(2.3)
(2.4)
(2.5)
Result: The evaluation of an infix expression that is not fully parenthesized is printed as follows:
Input String: (2 * 5 - 1 * 2) / (11 - 9)
69
Input Symbol Character Stack (from bottom to top) Integer Stack (from bottom to top)
(
(*
(*
(*
Operation performed
2
Push as * has higher priority
25
Since '-' has less priority, we do 2 * 5 = 10
(-
10
(-
10 1
(-*
10 1
(-*
10 1 2
(-
10 2
/(
11
/(
8 11
/(-
8 11
/(-
8 11 9
82
New line
Operator
(2.2)
70
/-
/-*
/-*2
/-*25
/-*25*
/-*25*1
/-*25*12
/-*25*12-
11
/ - * 2 5 * 1 2 - 11
/ - * 2 5 * 1 2 - 11 9
\n
/ - * 2 5 * 1 2 - 11
/-*25*12-
9 11
/-*25*12
/-*25*1
22
/-*25*
221
/-*25
22
/-*2
225
/-*
2252
/-
2 2 10
5 * 2 = 10
28
10 - 2 = 8
Stack is empty
8/2=4
Stack is empty
Print 4
11 - 9 = 2
1*2=2
71
Input
Operation
Push operand 1
Push operand 2, 1
Push operand 4, 2, 1
Multiply
8, 1
Add
Push operand 3, 9
Add
12
The final result, 12, lies on the top of the stack at the end of the calculation.
Example in C
#include<stdio.h>
int main()
{
int a[100], i;
printf("To pop enter -1\n");
for(i = 0;;)
{
printf("Push ");
scanf("%d", &a[i]);
if(a[i] == -1)
{
if(i == 0)
{
printf("Underflow\n");
}
else
{
printf("pop = %d\n", a[--i]);
}
}
else
{
i++;
}
}
}
72
{ADT of STACK}
{dictionary}
const
mark = '.';
var
data : stack;
f : text;
cc : char;
ccInt, cc1, cc2 : integer;
{functions}
IsOperand (cc : char) : boolean;
{JUST Prototype}
{return TRUE if cc is operand}
ChrToInt (cc : char) : integer;
{JUST Prototype}
{change char to integer}
Operator (cc1, cc2 : integer) : integer;
{JUST Prototype}
{operate two operands}
{algorithms}
begin
assign (f, cc);
reset (f);
read (f, cc); {first
if (cc = mark) then
begin
writeln ('empty
end
else
begin
repeat
if (IsOperand
begin
ccInt :=
elmt}
archives !');
(cc)) then
ChrToInt (cc);
73
Opening parentheses
Numbers
Operators
Closing parentheses
New line character (\n)
(2.1)
Number
(2.2)
Operator
(2.3)
Closing parentheses
(2.4)
(2.5)
STOP
Therefore, the final output after conversion of an infix expression to a postfix expression is as follows:
Input
74
Operation
Stack (after
op)
Output on
monitor
((
(((
(2.2) Print it
(2.2) Print it
(2.4) Pop from the stack: Since popped element is '+' print it
(((
81+
(2.4) Pop from the stack: Since popped element is '(' we ignore it and read next
character
((
81+
((-
((-(
(2.2) Print it
(2.2) Print it
(2.4) Pop from the stack: Since popped element is '-' print it
((-(
(2.4) Pop from the stack: Since popped element is '(' we ignore it and read next
character
((-
(2.4) Pop from the stack: Since popped element is '-' print it
((
(2.4) Pop from the stack: Since popped element is '(' we ignore it and read next
character
(/
(/(
11
(2.2) Print it
(2.2) Print it
(2.4) Pop from the stack: Since popped element is '-' print it
(/(
(2.4) Pop from the stack: Since popped element is '(' we ignore it and read next
character
(/
(2.4) Pop from the stack: Since popped element is '/' print it
(2.4) Pop from the stack: Since popped element is '(' we ignore it and read next
character
Stack is empty
New line
character
(2.5) STOP
8
(((+
8
81
81+7
((-(81+74
81+74-
81+74--
8 1 + 7 4 - - 11
(/(8 1 + 7 4 - - 11 9
8 1 + 7 4 - - 11 9 -
8 1 + 7 4 - - 11 9 - /
75
76
The car 4 is moved to output track. No other cars can be moved to output
track at this time.
The next car 8 is moved to holding track H1.
Car 5 is output from input track. Car 6 is moved to output track from H2, so is the 7 from H3,8 from H1 & 9 from
H3.
[]
Backtracking
Another important application of stacks is backtracking. Consider a simple example of finding the correct path in a
maze. There are a series of points, from the starting point to the destination. We start from one point. To reach the
final destination, there are several paths. Suppose we choose a random path. After following a certain path, we
realise that the path we have chosen is wrong. So we need to find a way by which we can return to the beginning of
that path. This can be done with the use of stacks. With the help of stacks, we remember the point where we have
reached. This is done by pushing that point into the stack. In case we end up on the wrong path, we can pop the last
point from the stack and thus return to the last point and continue our quest to find the right path. This is called
backtracking.
Quicksort
Sorting means arranging the list of elements in a particular order. In case of numbers, it could be in ascending order,
or in the case of letters, alphabetic order.
Quicksort is an algorithm of the divide and conquer type. In this method, to sort a set of numbers, we reduce it to
two smaller sets, and then sort these smaller sets.
This can be explained with the help of the following example:
Suppose A is a list of the following numbers:
In the reduction step, we find the final position of one of the numbers. In this case, let us assume that we have to find
the final position of 48, which is the first number in the list.
To accomplish this, we adopt the following method. Begin with the last number, and move from right to left.
Compare each number with 48. If the number is smaller than 48, we stop at that number and swap it with 48.
In our case, the number is 24. Hence, we swap 24 and 48.
The numbers 96 and 72 to the right of 48, are greater than 48. Now beginning with 24, scan the numbers in the
opposite direction, that is from left to right. Compare every number with 48 until you find a number that is greater
than 48.
In this case, it is 60. Therefore we swap 48 and 60.
Note that the numbers 12, 24 and 36 to the left of 48 are all smaller than 48. Now, start scanning numbers from 60,
in the right to left direction. As soon as you find lesser number, swap it with 48.
In this case, it is 44. Swap it with 48. The final result is:
77
78
Now, beginning with 44, scan the list from left to right, until you find a number greater than 48.
Such a number is 84. Swap it with 48. The final result is:
Now, beginning with 84, traverse the list from right to left, until you reach a number lesser than 48. We do not find
such a number before reaching 48. This means that all the numbers in the list have been scanned and compared with
48. Also, we notice that all numbers less than 48 are to the left of it, and all numbers greater than 48, are to its right.
The final partitions look as follows:
Therefore, 48 has been placed in its proper position and now our task is reduced to sorting the two partitions. This
above step of creating partitions can be repeated with every partition containing 2 or more elements. As we can
process only a single partition at a time, we should be able to keep track of the other partitions, for future processing.
This is done by using two stacks called LOWERBOUND and UPPERBOUND, to temporarily store these partitions.
The addresses of the first and last elements of the partitions are pushed into the LOWERBOUND and
UPPERBOUND stacks respectively. Now, the above reduction step is applied to the partitions only after its
boundary values are popped from the stack.
We can understand this from the following example:
Take the above list A with 12 elements. The algorithm starts by pushing the boundary values of A, that is 1 and 12
into the LOWERBOUND and UPPERBOUND stacks respectively. Therefore the stacks look as follows:
LOWERBOUND:
UPPERBOUND:
12
To perform the reduction step, the values of the stack top are popped from the stack. Therefore, both the stacks
become empty.
LOWERBOUND:
{empty}
UPPERBOUND: {empty}
Now, the reduction step causes 48 to be fixed to the 5th position and creates two partitions, one from position 1 to 4
and the other from position 6 to 12. Hence, the values 1 and 6 are pushed into the LOWERBOUND stack and 4 and
79
1, 6
UPPERBOUND: 4, 12
For applying the reduction step again, the values at the stack top are popped. Therefore, the values 6 and 12 are
popped. Therefore the stacks look like:
LOWERBOUND:
UPPERBOUND: 4
The reduction step is now applied to the second partition, that is from the 6th to 12th element.
After the reduction step, 98 is fixed in the 11th position. So, the second partition has only one element. Therefore, we
push the upper and lower boundary values of the first partition onto the stack. So, the stacks are as follows:
LOWERBOUND:
1, 6
UPPERBOUND:
4, 10
The processing proceeds in the following way and ends when the stacks do not contain any upper and lower bounds
of the partition to be processed, and the list gets sorted.
80
81
An element once popped from the stack N is never pushed back into it. Therefore,
So, the running time of all the statements in the while loop is O(
The running time of all the steps in the algorithm is calculated by adding the time taken by all these steps. The run
time of each step is O( ). Hence the running time complexity of this algorithm is O( ).
Security
Some computing environments use stacks in ways that may make them vulnerable to security breaches and attacks.
Programmers working in such environments must take special care to avoid the pitfalls of these implementations.
For example, some programming languages use a common stack to store both data local to a called procedure and
the linking information that allows the procedure to return to its caller. This means that the program moves data into
and out of the same stack that contains critical return addresses for the procedure calls. If data is moved to the wrong
location on the stack, or an oversized data item is moved to a stack location that is not large enough to contain it,
return information for procedure calls may be corrupted, causing the program to fail.
Malicious parties may attempt a stack smashing attack that takes advantage of this type of implementation by
providing oversized data input to a program that does not check the length of input. Such a program may copy the
data in its entirety to a location on the stack, and in so doing it may change the return addresses for procedures that
have called it. An attacker can experiment to find a specific type of data that can be provided to such a program such
that the return address of the current procedure is reset to point to an area within the stack itself (and within the data
provided by the attacker), which in turn contains instructions that carry out unauthorized operations.
This type of attack is a variation on the buffer overflow attack and is an extremely frequent source of security
breaches in software, mainly because some of the most popular compilers use a shared stack for both data and
procedure calls, and do not verify the length of data items. Frequently programmers do not write code to verify the
size of data items, either, and when an oversized or undersized data item is copied to the stack, a security breach may
occur.
82
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
Further reading
Donald Knuth. The Art of Computer Programming, Volume 1: Fundamental Algorithms, Third
Edition.Addison-Wesley, 1997. ISBN 0-201-89683-4. Section 2.2.1: Stacks, Queues, and Deques, pp.238243.
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms,
Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Section 10.1: Stacks and queues,
pp.200204.
External links
83
Queue implementation
Theoretically, one characteristic of a queue is that it does not have a specific capacity. Regardless of how many
elements are already contained, a new element can always be added. It can also be empty, at which point removing
an element will be impossible until a new element has been added again.
Fixed length arrays are limited in capacity, but it is not true that items need to be copied towards the head of the
queue. The simple trick of turning the array into a closed circle and letting the head and tail drift around endlessly in
that circle makes it unnecessary to ever move items stored in the array. If n is the size of the array, then computing
indices modulo n will turn the array into a circle. This is still the conceptually simplest way to construct a queue in a
high level language, but it does admittedly slow things down a little, because the array indices must be compared to
zero and the array size, which is comparable to the time taken to check whether an array index is out of bounds,
which some languages do, but this will certainly be the method of choice for a quick and dirty implementation, or for
any high level language that does not have pointer syntax. The array size must be declared ahead of time, but some
implementations simply double the declared array size when overflow occurs. Most modern languages with objects
or pointers can implement or come with libraries for dynamic lists. Such data structures may have not specified fixed
capacity limit besides memory constraints. Queue overflow results from trying to add an element onto a full queue
and queue underflow happens when trying to remove an element from an empty queue.
A bounded queue is a queue limited to a fixed number of items.
There are several efficient implementations of FIFO queues. An efficient implementation is one that can perform the
operationsenqueuing and dequeuingin O(1) time.
84
References
General
Donald Knuth. The Art of Computer Programming, Volume 1: Fundamental Algorithms, Third Edition.
Addison-Wesley, 1997. ISBN 0-201-89683-4. Section 2.2.1: Stacks, Queues, and Deques, pp.238243.
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms,
Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Section 10.1: Stacks and queues,
pp.200204.
William Ford, William Topp. Data Structures with C++ and STL, Second Edition. Prentice Hall, 2002. ISBN
0-13-085850-1. Chapter 8: Queues and Priority Queues, pp.386390.
Adam Drozdek. Data Structures and Algorithms in C++, Third Edition. Thomson Course Technology, 2005.
ISBN 0-534-49182-0. Chapter 4: Stacks and Queues, pp.137169.
Citations
[1]
[2]
[3]
[4]
http:/ / download. oracle. com/ javase/ 7/ docs/ api/ java/ util/ Queue. html
http:/ / download. oracle. com/ javase/ 7/ docs/ api/ java/ util/ LinkedList. html
http:/ / download. oracle. com/ javase/ 7/ docs/ api/ java/ util/ ArrayDeque. html
http:/ / www. php. net/ manual/ en/ class. splqueue. php
External links
Queues with algo and 'c' programe (http://scanftree.com/Data_Structure/Queues)
STL Quick Reference (http://www.halpernwightsoftware.com/stdlib-scratch/quickref.html#containers14)
VBScript implementation of stack, queue, deque, and Red-Black Tree (http://www.ludvikjerabek.com/
downloads.html)
Paul E. Black, Bounded queue (http:/ / www. nist. gov/ dads/ HTML/ boundedqueue. html) at the NIST Dictionary
of Algorithms and Data Structures.
85
Double-ended queue
86
Double-ended queue
In computer science, a double-ended queue (dequeue, often abbreviated to deque, pronounced deck) is an abstract
data type that generalizes a queue, for which elements can be added to or removed from either the front (head) or
back (tail).[1] It is also often called a head-tail linked list, though properly this refers to a specific data structure
implementation (see below).
Naming conventions
Deque is sometimes written dequeue, but this use is generally deprecated in technical literature or technical writing
because dequeue is also a verb meaning "to remove from a queue". Nevertheless, several libraries and some writers,
such as Aho, Hopcroft, and Ullman in their textbook Data Structures and Algorithms, spell it dequeue. John
Mitchell, author of Concepts in Programming Languages, also uses this terminology.
Operations
The basic operations on a deque are enqueue and dequeue on either end. Also generally implemented are peek
operations, which return the value at that end without dequeuing it.
Names vary between languages; major implementations include:
operation common
name(s)
Ada
C++
Java
push
PHP
array_push
Python
append
Ruby
push
JavaScript
insert
element
at back
inject,
snoc
Append
push_back
insert
element
at front
push,
cons
Prepend
remove
last
element
eject
Delete_Last
pop_back
pollLast
pop
array_pop
pop
pop
pop
remove
first
element
pop
Delete_First
pop_front
pollFirst
shift
array_shift
popleft
shift
shift
Last_Element
back
peekLast
$array[-1] end
<obj>[-1]
last
<obj>[<obj>.length
- 1]
examine
last
element
offerLast
Perl
push
Double-ended queue
examine
first
element
First_Element front
87
peekFirst
$array[0]
reset
<obj>[0]
first
<obj>[0]
Implementations
There are at least two common ways to efficiently implement a deque: with a modified dynamic array or with a
doubly linked list.
The dynamic array approach uses a variant of a dynamic array that can grow from both ends, sometimes called array
deques. These array deques have all the properties of a dynamic array, such as constant-time random access, good
locality of reference, and inefficient insertion/removal in the middle, with the addition of amortized constant-time
insertion/removal at both ends, instead of just one end. Three common implementations include:
Storing deque contents in a circular buffer, and only resizing when the buffer becomes full. This decreases the
frequency of resizings.
Allocating deque contents from the center of the underlying array, and resizing the underlying array when either
end is reached. This approach may require more frequent resizings and waste more space, particularly when
elements are only inserted at one end.
Storing contents in multiple smaller arrays, allocating additional arrays at the beginning or end as needed.
Indexing is implemented by keeping a dynamic array containing pointers to each of the smaller arrays.
Language support
Ada's
containers
provides
the
generic
packages
Ada.Containers.Vectors
and
Ada.Containers.Doubly_Linked_Lists, for the dynamic array and linked list implementations,
respectively.
C++'s Standard Template Library provides the class templates std::deque and std::list, for the multiple
array and linked list implementations, respectively.
As of Java 6, Java's Collections Framework provides a new Deque [2] interface that provides the functionality of
insertion and removal at both ends. It is implemented by classes such as ArrayDeque [3] (also new in Java 6) and
LinkedList [2], providing the dynamic array and linked list implementations, respectively. However, the
ArrayDeque, contrary to its name, does not support random access.
Python 2.4 introduced the collections module with support for deque objects.
As of PHP 5.3, PHP's SPL extension contains the 'SplDoublyLinkedList' class that can be used to implement Deque
datastructures. Previously to make a Deque structure the array functions array_shift/unshift/pop/push had to be used
instead.
GHC's Data.Sequence [3] module implements an efficient, functional deque structure in Haskell. The implementation
uses 2-3 finger trees annotated with sizes. There are other (fast) possibilities to implement purely functional (thus
also persistent) double queues (most using heavily lazy evaluation).[4][5] Kaplan and Tarjan were the first to
implement optimal confluently persistent catenable deques.[6] Their implementation was strictly purely functional in
the sense that it did not use lazy evaluation. Okasaki simplified the data structure by using lazy evaluation with a
bootstrapped data structure and degrading the performance bounds from worst-case to amortized. Kaplan, Okasaki,
and Tarjan produced a simpler, non-bootstrapped, amortized version that can be implemented either using lazy
evaluation or more efficiently using mutation in a broader but still restricted fashion. Mihaesau and Tarjan created a
simpler (but still highly complex) strictly purely functional implementation of catenable deques, and also a much
simpler implementation of strictly purely functional non-catenable deques, both of which have optimal worst-case
bounds.
Double-ended queue
Complexity
In a doubly linked list implementation and assuming no allocation/deallocation overhead, the time complexity of
all deque operations is O(1). Additionally, the time complexity of insertion or deletion in the middle, given an
iterator, is O(1); however, the time complexity of random access by index is O(n).
In a growing array, the amortized time complexity of all deque operations is O(1). Additionally, the time
complexity of random access by index is O(1); but the time complexity of insertion or deletion in the middle is
O(n).
Applications
One example where a deque can be used is the A-Steal job scheduling algorithm.[7] This algorithm implements task
scheduling for several processors. A separate deque with threads to be executed is maintained for each processor. To
execute the next thread, the processor gets the first element from the deque (using the "remove first element" deque
operation). If the current thread forks, it is put back to the front of the deque ("insert element at front") and a new
thread is executed. When one of the processors finishes execution of its own threads (i.e. its deque is empty), it can
"steal" a thread from another processor: it gets the last element from the deque of another processor ("remove last
element") and executes it.
References
[1] Donald Knuth. The Art of Computer Programming, Volume 1: Fundamental Algorithms, Third Edition. Addison-Wesley, 1997. ISBN
0-201-89683-4. Section 2.2.1: Stacks, Queues, and Deques, pp. 238243.
[2] http:/ / download. oracle. com/ javase/ 7/ docs/ api/ java/ util/ Deque. html
[3] http:/ / www. haskell. org/ ghc/ docs/ latest/ html/ libraries/ containers/ Data-Sequence. html
[4] www.cs.cmu.edu/~rwh/theses/okasaki.pdf C. Okasaki, "Purely Functional Data Structures", September 1996
[5] Adam L. Buchsbaum and Robert E. Tarjan. Confluently persistent deques via data structural bootstrapping. Journal of Algorithms,
18(3):513547, May 1995. (pp. 58, 101, 125)
[6] Haim Kaplan and Robert E. Tarjan. Purely functional representations of catenable sorted lists. In ACM Symposium on Theory of Computing,
pages 202211, May 1996. (pp. 4, 82, 84, 124)
[7] See p.22.
External links
SGI STL Documentation: deque<T, Alloc> (http://www.sgi.com/tech/stl/Deque.html)
Code Project: An In-Depth Study of the STL Deque Container (http://www.codeproject.com/KB/stl/
vector_vs_deque.aspx)
Diagram of a typical STL deque implementation (http://pages.cpsc.ucalgary.ca/~kremer/STL/1024x768/
deque.html)
Deque implementation in C (http://www.martinbroadhurst.com/articles/deque.html)
VBScript implementation of stack, queue, deque, and Red-Black Tree (http://www.ludvikjerabek.com/
downloads.html)
Multiple implementations of non-catenable deques in Haskell (https://code.google.com/p/deques/source/
browse/)
88
Circular buffer
89
Circular buffer
A circular buffer, cyclic buffer or ring buffer is a data structure
that uses a single, fixed-size buffer as if it were connected
end-to-end. This structure lends itself easily to buffering data
streams.
Uses
The useful property of a circular buffer is that it does not need to
have its elements shuffled around when one is consumed. (If a
non-circular buffer were used then it would be necessary to shift
all elements when one is consumed.) In other words, the circular
buffer is well suited as a FIFO buffer while a standard,
non-circular buffer is well suited as a LIFO buffer.
Circular buffering makes a good implementation strategy for a
queue that has fixed maximum size. Should a maximum size be
adopted for a queue, then a circular buffer is a completely ideal
implementation; all queue operations are constant time. However,
expanding a circular buffer requires shifting memory, which is
comparatively costly. For arbitrarily expanding queues, a Linked
list approach may be preferred instead.
How it works
A circular buffer first starts empty and of some predefined length.
For example, this is a 7-element buffer:
Assume that a 1 is written into the middle of the buffer (exact starting location does not matter in a circular buffer):
Then assume that two more elements are added 2 & 3 which get appended after the 1:
If two elements are then removed from the buffer, the oldest values inside the buffer are removed. The two elements
removed, in this case, are 1 & 2 leaving the buffer with just a 3:
Circular buffer
A consequence of the circular buffer is that when it is full and a subsequent write is performed, then it starts
overwriting the oldest data. In this case, two more elements A & B are added and they overwrite the 3 & 4:
Alternatively, the routines that manage the buffer could prevent overwriting the data and return an error or raise an
exception. Whether or not data is overwritten is up to the semantics of the buffer routines or the application using the
circular buffer.
Finally, if two elements are now removed then what would be returned is not 3 & 4 but 5 & 6 because A & B
overwrote the 3 & the 4 yielding the buffer with:
Alternatively, a fixed-length buffer with two integers to keep track of indices can be used in languages that do not
have pointers.
Taking a couple of examples from above. (While there are numerous ways to label the pointers and exact semantics
can vary, this is one way to do it.)
This image shows a partially full buffer:
This image shows a full buffer with two elements having been overwritten:
What to note about the second one is that after each element is overwritten then the start pointer is incremented as
well.
90
Circular buffer
91
Difficulties
Full / Empty Buffer Distinction
A small disadvantage of relying on pointers or relative indices of the start and end of data is, that in the case the
buffer is entirely full, both pointers point to the same element:
the same slot, the buffer is empty. If the end (write) pointer refers to the slot preceding the one referred to by the start
(read) pointer, the buffer is full.
The advantage is:
The solution is simple and robust.
The disadvantages are:
One slot is lost, so it is a bad compromise when the buffer size is small or the slot is big or is implemented in
hardware.
The full test requires a modulo operation
Example implementation, 'C' language
/* Circular buffer example, keeps one slot open */
#include <stdio.h>
#include <stdlib.h>
/* Opaque buffer element type. This would be defined by the
application. */
typedef struct { int value; } ElemType;
/* Circular buffer object */
typedef struct {
int
size;
/* maximum number of elements
int
start; /* index of oldest element
int
end;
/* index at which to write new element
ElemType
*elems; /* vector of elements
*/
*/
*/
*/
Circular buffer
} CircularBuffer;
void cbInit(CircularBuffer *cb, int size) {
cb->size = size + 1; /* include empty elem */
cb->start = 0;
cb->end
= 0;
cb->elems = (ElemType *)calloc(cb->size, sizeof(ElemType));
}
void cbFree(CircularBuffer *cb) {
free(cb->elems); /* OK if null */ }
int cbIsFull(CircularBuffer *cb) {
return (cb->end + 1) % cb->size == cb->start; }
int cbIsEmpty(CircularBuffer *cb) {
return cb->end == cb->start; }
/* Write an element, overwriting oldest element if buffer is full. App
can
choose to avoid the overwrite by checking cbIsFull(). */
void cbWrite(CircularBuffer *cb, ElemType *elem) {
cb->elems[cb->end] = *elem;
cb->end = (cb->end + 1) % cb->size;
if (cb->end == cb->start)
cb->start = (cb->start + 1) % cb->size; /* full, overwrite */
}
/* Read oldest element. App must ensure !cbIsEmpty() first. */
void cbRead(CircularBuffer *cb, ElemType *elem) {
*elem = cb->elems[cb->start];
cb->start = (cb->start + 1) % cb->size;
}
int main(int argc, char **argv) {
CircularBuffer cb;
ElemType elem = {0};
int testBufferSize = 10; /* arbitrary size */
cbInit(&cb, testBufferSize);
/* Fill buffer with test elements 3 times */
for (elem.value = 0; elem.value < 3 * testBufferSize; ++ elem.value)
cbWrite(&cb, &elem);
/* Remove and print all elements */
while (!cbIsEmpty(&cb)) {
92
Circular buffer
93
cbRead(&cb, &elem);
printf("%d\n", elem.value);
}
cbFree(&cb);
return 0;
}
Use a Fill Count
This approach replaces the end pointer with a counter that tracks the number of readable items in the buffer. This
unambiguously indicates when the buffer is empty or full and allows use of all buffer slots.
The performance impact should be negligible, since this approach adds the costs of maintaining the counter and
computing the tail slot on writes but eliminates the need to maintain the end pointer and simplifies the fullness test.
The advantage is:
The test for full/empty is simple
The disadvantages are:
You need modulo for read and write
Read and write operation must share the counter field, so it requires synchronization in multi-threaded situation.
Note: When using semaphores in a Producer-consumer model, the semaphores act as a fill count.
Differences from previous example
/* This approach replaces the CircularBuffer 'end' field with the
'count' field and changes these functions: */
void cbInit(CircularBuffer *cb, int size) {
cb->size = size;
cb->start = 0;
cb->count = 0;
cb->elems = (ElemType *)calloc(cb->size, sizeof(ElemType));
}
int cbIsFull(CircularBuffer *cb) {
return cb->count == cb->size; }
int cbIsEmpty(CircularBuffer *cb) {
return cb->count == 0; }
void cbWrite(CircularBuffer *cb, ElemType *elem) {
int end = (cb->start + cb->count) % cb->size;
cb->elems[end] = *elem;
if (cb->count == cb->size)
cb->start = (cb->start + 1) % cb->size; /* full, overwrite */
else
++ cb->count;
}
Circular buffer
94
It is easy to see above that when the pointers (including the extra msb bit) are equal, the buffer is empty, while if the
pointers differ only by the extra msb bit, the buffer is full.
The advantages are:
The test for full/empty is simple
No modulo operation is needed
The source and sink of data can implement independent policies for dealing with a full buffer and overrun while
adhering to the rule that only the source of data modifies the write count and only the sink of data modifies the
read count. This can result in elegant and robust circular buffer implementations even in multi-threaded
environments.
The disadvantage is:
You need one more bit for read and write pointer
Differences from Always Keep One Slot Open example
/* This approach adds one bit to end and start pointers */
/* Circular buffer object */
typedef struct {
int
size;
/* maximum number of elements
int
start; /* index of oldest element
int
end;
/* index at which to write new element
int
s_msb;
int
e_msb;
ElemType
*elems; /* vector of elements
} CircularBuffer;
void cbInit(CircularBuffer *cb, int size) {
cb->size = size;
cb->start = 0;
cb->end
= 0;
*/
*/
*/
*/
Circular buffer
95
cb->s_msb = 0;
cb->e_msb = 0;
cb->elems = (ElemType *)calloc(cb->size, sizeof(ElemType));
}
int cbIsFull(CircularBuffer *cb) {
return cb->end == cb->start && cb->e_msb != cb->s_msb; }
int cbIsEmpty(CircularBuffer *cb) {
return cb->end == cb->start && cb->e_msb == cb->s_msb; }
void cbIncr(CircularBuffer *cb, int *p, int *msb) {
*p = *p + 1;
if (*p == cb->size) {
*msb ^= 1;
*p = 0;
}
}
void cbWrite(CircularBuffer *cb, ElemType *elem) {
cb->elems[cb->end] = *elem;
if (cbIsFull(cb)) /* full, overwrite moves start pointer */
cbIncr(cb, &cb->start, &cb->s_msb);
cbIncr(cb, &cb->end, &cb->e_msb);
}
void cbRead(CircularBuffer *cb, ElemType *elem) {
*elem = cb->elems[cb->start];
cbIncr(cb, &cb->start, &cb->s_msb);
}
If the size is a power of two, the implementation is simpler and the separate msb variables are no longer necessary,
removing the disadvantage:
Differences from Always Keep One Slot Open example
/* This approach adds one bit to end and start pointers */
/* Circular buffer object */
typedef struct {
int
size;
/* maximum number of elements
int
start; /* index of oldest element
int
end;
/* index at which to write new element
ElemType
*elems; /* vector of elements
} CircularBuffer;
void cbInit(CircularBuffer *cb, int size) {
cb->size = size;
cb->start = 0;
*/
*/
*/
*/
Circular buffer
cb->end
= 0;
cb->elems = (ElemType *)calloc(cb->size, sizeof(ElemType));
}
void cbPrint(CircularBuffer *cb) {
printf("size=0x%x, start=%d, end=%d\n", cb->size, cb->start, cb->end);
}
int cbIsFull(CircularBuffer *cb) {
return cb->end == (cb->start ^ cb->size); /* This inverts the most
significant bit of start before comparison */ }
int cbIsEmpty(CircularBuffer *cb) {
return cb->end == cb->start; }
int cbIncr(CircularBuffer *cb, int p) {
return (p + 1)&(2*cb->size-1); /* start and end pointers
incrementation is done modulo 2*size */
}
void cbWrite(CircularBuffer *cb, ElemType *elem) {
cb->elems[cb->end&(cb->size-1)] = *elem;
if (cbIsFull(cb)) /* full, overwrite moves start pointer */
cb->start = cbIncr(cb, cb->start);
cb->end = cbIncr(cb, cb->end);
}
void cbRead(CircularBuffer *cb, ElemType *elem) {
*elem = cb->elems[cb->start&(cb->size-1)];
cb->start = cbIncr(cb, cb->start);
}
Read / Write Counts
Another solution is to keep counts of the number of items written to and read from the circular buffer. Both counts
are stored in signed integer variables with numerical limits larger than the number of items that can be stored and are
allowed to wrap freely.
The unsigned difference (write_count - read_count) always yields the number of items placed in the buffer and not
yet retrieved. This can indicate that the buffer is empty, partially full, completely full (without waste of a storage
location) or in a state of overrun.
The advantage is:
The source and sink of data can implement independent policies for dealing with a full buffer and overrun while
adhering to the rule that only the source of data modifies the write count and only the sink of data modifies the
read count. This can result in elegant and robust circular buffer implementations even in multi-threaded
environments.
The disadvantage is:
You need two additional variables.
96
Circular buffer
Absolute indices
It is possible to optimize the previous solution by using indices instead of pointers: indices can store read/write
counts instead of the offset from start of the buffer, the separate variables in the above solution are removed and
relative indices are obtained on the fly by division modulo the buffer's length.
The advantage is:
No extra variables are needed.
The disadvantages are:
Every access needs an additional modulo operation.
If counter wrap is possible, complex logic can be needed if the buffer's length is not a divisor of the counter's
capacity.
On binary computers, both of these disadvantages disappear if the buffer's length is a power of twoat the cost of a
constraint on possible buffers lengths.
Record last operation
Another solution is to keep a flag indicating whether the most recent operation was a read or a write. If the two
pointers are equal, then the flag will show whether the buffer is full or empty: if the most recent operation was a
write, the buffer must be full, and conversely if it was a read, it must be empty.
The advantages are:
Only a single bit needs to be stored (which may be particularly useful if the algorithm is implemented in
hardware)
The test for full/empty is simple
The disadvantage is:
You need an extra variable
Read and write operation must share the flag, so it probably require synchronization in multi-threaded situation.
Chunked Buffer
Much more complex are different chunks of data in the same circular buffer. The writer is not only writing elements
to the buffer, it also assigns these elements to chunks [citation needed].
The reader should not only be able to read from the buffer, it should also get informed about the chunk borders.
Example: The writer is reading data from small files, writing them into the same circular buffer. The reader is
reading the data, but needs to know when and which file is starting at a given position.
Optimization
A circular-buffer implementation may be optimized by mapping the underlying buffer to two contiguous regions of
virtual memory. (Naturally, the underlying buffers length must then equal some multiple of the systems page size.)
Reading from and writing to the circular buffer may then be carried out with greater efficiency by means of direct
memory access; those accesses which fall beyond the end of the first virtual-memory region will automatically wrap
around to the beginning of the underlying buffer. When the read offset is advanced into the second virtual-memory
region, both offsetsread and writeare decremented by the length of the underlying buffer.[1]
97
Circular buffer
98
Circular buffer
address =
mmap (buffer->address, buffer->count_bytes, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_SHARED, file_descriptor, 0);
if (address != buffer->address)
report_exceptional_condition ();
address = mmap (buffer->address + buffer->count_bytes,
buffer->count_bytes, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_SHARED, file_descriptor, 0);
if (address != buffer->address + buffer->count_bytes)
report_exceptional_condition ();
status = close (file_descriptor);
if (status)
report_exceptional_condition ();
}
void
ring_buffer_free (struct ring_buffer *buffer)
{
int status;
status = munmap (buffer->address, buffer->count_bytes << 1);
if (status)
report_exceptional_condition ();
}
void *
ring_buffer_write_address (struct ring_buffer *buffer)
{
/*** void pointer arithmetic is a constraint violation. ***/
return buffer->address + buffer->write_offset_bytes;
}
void
ring_buffer_write_advance (struct ring_buffer *buffer,
unsigned long count_bytes)
{
buffer->write_offset_bytes += count_bytes;
}
void *
ring_buffer_read_address (struct ring_buffer *buffer)
{
return buffer->address + buffer->read_offset_bytes;
99
Circular buffer
}
void
ring_buffer_read_advance (struct ring_buffer *buffer,
unsigned long count_bytes)
{
buffer->read_offset_bytes += count_bytes;
if (buffer->read_offset_bytes >= buffer->count_bytes)
{
buffer->read_offset_bytes -= buffer->count_bytes;
buffer->write_offset_bytes -= buffer->count_bytes;
}
}
unsigned long
ring_buffer_count_bytes (struct ring_buffer *buffer)
{
return buffer->write_offset_bytes - buffer->read_offset_bytes;
}
unsigned long
ring_buffer_count_free_bytes (struct ring_buffer *buffer)
{
return buffer->count_bytes - ring_buffer_count_bytes (buffer);
}
void
ring_buffer_clear (struct ring_buffer *buffer)
{
buffer->write_offset_bytes = 0;
buffer->read_offset_bytes = 0;
}
/*Note, that initial anonymous mmap() can be avoided - after initial
mmap() for descriptor fd,
you can try mmap() with hinted address as (buffer->address +
buffer->count_bytes) and if it fails another one with hinted address as (buffer->address buffer->count_bytes).
Make sure MAP_FIXED is not used in such case, as under certain
situations it could end with segfault.
The advantage of such approach is, that it avoids requirement to map
twice the amount you need initially
(especially useful e.g. if you want to use hugetlbfs and the allowed
amount is limited)
and in context of gcc/glibc - you can avoid certain feature macros
100
Circular buffer
(MAP_ANONYMOUS usually requires one of: _BSD_SOURCE, _SVID_SOURCE or
_GNU_SOURCE).*/
Variants
Perhaps the most common version of the circular buffer uses 8-bit bytes as elements.
Some implementations of the circular buffer use fixed-length elements that are bigger than 8-bit bytes -- 16-bit
integers for audio buffers, 53-byte ATM cells for telecom buffers, etc. Each item is contiguous and has the correct
data alignment, so software reading and writing these values can be faster than software that handles non-contiguous
and non-aligned values.
Ping-pong buffering can be considered a very specialized circular buffer with exactly two large fixed-length
elements.
The Bip Buffer is very similar to a circular buffer, except it always returns contiguous blocks (which can be variable
length).
External links
[1] Simon Cooke. "The Bip Buffer - The Circular Buffer with a Twist" (http:/ / www. codeproject. com/ Articles/ 3479/
The-Bip-Buffer-The-Circular-Buffer-with-a-Twist). 2003.
101
102
Dictionaries
Associative array
In computer science, an associative array, map, symbol table, or dictionary is an abstract data type composed of a
collection of
pairs, such that each possible key appears at most once in the collection.
Operations associated with this data type allow:
The dictionary problem is the task of designing a data structure that implements an associative array. A standard
solution to the dictionary problem is a hash table; in some cases it is also possible to solve the problem using directly
addressed arrays, binary search trees, or other more specialized structures.
Many programming languages include associative arrays as primitive data types, and they are available in software
libraries for many others. Content-addressable memory is a form of direct hardware-level support for associative
arrays.
Associative arrays have many applications including such fundamental programming patterns as memoization and
the decorator pattern.[1]
Operations
In an associative array, the association between a key and a value is often known as a "binding", and the same word
"binding" may also be used to refer to the process of creating a new association.
The operations that are usually defined for an associative array are:
Add or insert: add a new
pair to the collection, binding the new key to its new value. The
key to a new value. As with an insertion, the arguments to this operation are the key and the value.
Remove or delete: remove a
pair from the collection, unbinding a given key from its value. The
argument to this operation is the key.
Lookup: find the value (if any) that is bound to a given key. The argument to this operation is the key, and the
value is returned from the operation. If no value is found, some associative array implementations raise an
exception.
In addition, associative arrays may also include other operations such as determining the number of bindings or
constructing an iterator to loop over all the bindings. Usually, for such an operation, the order in which the bindings
are returned may be arbitrary.
A multimap generalizes an associative array by allowing multiple values to be associated with a single key.[2] A
bidirectional map is a related abstract data type in which the bindings operate in both directions: each value must be
associated with a unique key, and a second lookup operation takes a value as argument and looks up the key
associated with that value.
Associative array
Example
Suppose that the set of loans made by a library is to be represented in a data structure. Each book in a library may be
checked out only by a single library patron at a time. However, a single patron may be able to check out multiple
books. Therefore, the information about which books are checked out to which patrons may be represented by an
associative array, in which the books are the keys and the patrons are the values. For instance (using notation from
Python, or JSON (JavaScript Object Notation), in which a binding is represented by placing a colon between the key
and the value), the current checkouts may be represented by an associative array
{
"Great Expectations": "John",
"Pride and Prejudice": "Alice",
"Wuthering Heights": "Alice"
}
A lookup operation with the key "Great Expectations" in this array would return the name of the person who checked
out that book, John. If John returns his book, that would cause a deletion operation in the associative array, and if Pat
checks out another book, that would cause an insertion operation, leading to a different state:
{
"Pride and Prejudice": "Alice",
"The Brothers Karamazov": "Pat",
"Wuthering Heights": "Alice"
}
In this new state, the same lookup as before, with the key "Great Expectations", would raise an exception, because
this key is no longer present in the array.
Implementation
For dictionaries with very small numbers of bindings, it may make sense to implement the dictionary using an
association list, a linked list of bindings. With this implementation, the time to perform the basic dictionary
operations is linear in the total number of bindings; however, it is easy to implement and the constant factors in its
running time are small.
Another very simple implementation technique, usable when the keys are restricted to a narrow range of integers, is
direct addressing into an array: the value for a given key k is stored at the array cell A[k], or if there is no binding for
k then the cell stores a special sentinel value that indicates the absence of a binding. As well as being simple, this
technique is fast: each dictionary operation takes constant time. However, the space requirement for this structure is
the size of the entire keyspace, making it impractical unless the keyspace is small.
The most frequently used general purpose implementation of an associative array is with a hash table: an array of
bindings, together with a hash function that maps each possible key into an array index. The basic idea of a hash
table is that the binding for a given key is stored at the position given by applying the hash function to that key, and
that lookup operations are performed by looking at that cell of the array and using the binding found there. However,
hash table based dictionaries must be prepared to handle collisions that occur when two keys are mapped by the hash
function to the same index, and many different collision resolution strategies have been developed for dealing with
this situation, often based either on open addressing (looking at a sequence of hash table indices instead of a single
index, until finding either the given key or an empty cell) or on hash chaining (storing a small association list instead
of a single binding in each hash table cell).
Dictionaries may also be stored in binary search trees or in data structures specialized to a particular type of keys
such as radix trees, tries, Judy arrays, or van Emde Boas trees, but these implementation methods are less efficient
103
Associative array
than hash tables as well as placing greater restrictions on the types of data that they can handle. The advantages of
these alternative structures come from their ability to handle operations beyond the basic ones of an associative
array, such as finding the binding whose key is the closest to a queried key, when the query is not itself present in the
set of bindings.
Language support
Associative arrays can be implemented in any programming language as a package and many language systems
provide them as part of their standard library. In some languages, they are not only built into the standard system, but
have special syntax, often using array-like subscripting.
Built-in syntactic support for associative arrays was introduced by SNOBOL4, under the name "table". MUMPS
made multi-dimensional associative arrays, optionally persistent, its key data structure. SETL supported them as one
possible implementation of sets and maps. Most modern scripting languages, starting with AWK and including Perl,
Tcl, JavaScript, Python, Ruby, and Lua, support associative arrays as a primary container type. In many more
languages, they are available as library functions without special syntax.
In Smalltalk, Objective-C, .NET, Python, and REALbasic they are called dictionaries; in Perl and Ruby they are
called hashes; in C++, Java, Go, Clojure, Scala, OCaml, Haskell they are called maps (see map (C++),
unordered_map (C++), and Map [3]); in Common Lisp and Windows PowerShell, they are called hash tables (since
both typically use this implementation). In PHP, all arrays can be associative, except that the keys are limited to
integers and strings. In JavaScript (see also JSON), all objects behave as associative arrays. In Lua, they are called
tables, and are used as the primitive building block for all data structures. In Visual FoxPro, they are called
Collections.
References
[1] , pp. 597599.
[2] , pp. 389397.
[3] http:/ / download. oracle. com/ javase/ 7/ docs/ api/ java/ util/ Map. html
External links
NIST's Dictionary of Algorithms and Data Structures: Associative Array (http://www.nist.gov/dads/HTML/
assocarray.html)
104
Association list
105
Association list
Association list
Type
associative array
Time complexity
in big O notation
Average Worst case
Space O(n)
O(n)
Search O(n)
O(n)
Insert O(1)
O(1)
Delete O(n)
O(n)
In computer programming and particularly in Lisp, an association list, often referred to as an alist, is a linked list in
which each list element (or node) comprises a key and a value. The association list is said to associate the value with
the key. In order to find the value associated with a given key, each element of the list is searched in turn, starting at
the head, until the key is found. Duplicate keys that appear later in the list are ignored. It is a simple way of
implementing an associative array.
The disadvantage of association lists is that the time to search is O(n), where n is the length of the list. And unless
the list is regularly pruned to remove elements with duplicate keys multiple values associated with the same key will
increase the size of the list, and thus the time to search, without providing any compensatory advantage. One
advantage is that a new element can be added to the list at its head, which can be done in constant time. For quite
small values of n it is more efficient in terms of time and space than more sophisticated strategies such as hash tables
and trees.
In the early development of Lisp, association lists were used to resolve references to free variables in procedures.
Many programming languages, including Lisp, Scheme, OCaml, and Haskell have functions for handling association
lists in their standard library.
References
Hash table
106
Hash table
Hash table
Type
Invented 1953
Time complexity
in big O notation
Average
Worst case
Space
O(n)
O(n)
Search
O(1)
O(n)
Insert
O(1)
O(n)
Delete
O(1)
O(n)
Hash table
Hashing
The idea of hashing is to distribute the entries (key/value pairs) across an array of buckets. Given a key, the
algorithm computes an index that suggests where the entry can be found:
index = f(key, array_size)
Often this is done in two steps:
hash = hashfunc(key)
index = hash % array_size
In this method, the hash is independent of the array size, and it is then reduced to an index (a number between 0 and
array_size1) using the modulus operator (%).
In the case that the array size is a power of two, the remainder operation is reduced to masking, which improves
speed, but can increase problems with a poor hash function.
107
Hash table
Key statistics
A critical statistic for a hash table is called the load factor. This is simply the number of entries divided by the
number of buckets, that is, n/k where n is the number of entries and k is the number of buckets.
If the load factor is kept reasonable, the hash table should perform well, provided the hashing is good. If the load
factor grows too large, the hash table will become slow, or it may fail to work (depending on the method used). The
expected constant time property of a hash table assumes that the load factor is kept below some bound. For a fixed
number of buckets, the time for a lookup grows with the number of entries and so does not achieve the desired
constant time.
Second to that, one can examine the variance of number of entries per bucket. For example, two tables both have
1000 entries and 1000 buckets; one has exactly one entry in each bucket, the other has all entries in the same bucket.
Clearly the hashing is not working in the second one.
A low load factor is not especially beneficial. As load factor approaches 0, the proportion of unused areas in the hash
table increases, but there is not necessarily any reduction in search cost. This results in wasted memory.
Collision resolution
Hash collisions are practically unavoidable when hashing a random subset of a large set of possible keys. For
example, if 2,500 keys are hashed into a million buckets, even with a perfectly uniform random distribution,
according to the birthday problem there is a 95% chance of at least two of the keys being hashed to the same slot.
Therefore, most hash table implementations have some collision resolution strategy to handle such events. Some
common strategies are described below. All these methods require that the keys (or pointers to them) be stored in the
table, together with the associated values.
Separate chaining
In the method known as separate
chaining, each bucket is independent,
and has some sort of list of entries with
the same index. The time for hash table
operations is the time to find the
bucket (which is constant) plus the
time for the list operation. (The
technique is also called open hashing
or closed addressing.)
In a good hash table, each bucket has
zero or one entries, and sometimes two
or three, but rarely more than that.
Therefore, structures that are efficient
in time and space for these cases are
Hash collision resolved by separate chaining.
preferred. Structures that are efficient
for a fairly large number of entries are
not needed or desirable. If these cases happen often, the hashing is not working well, and this needs to be fixed.
108
Hash table
109
Hash collision by separate chaining with head records in the bucket array.
Hash table
110
Open addressing
In another strategy, called open
addressing, all entry records are stored
in the bucket array itself. When a new
entry has to be inserted, the buckets are
examined, starting with the hashed-to
slot and proceeding in some probe
sequence, until an unoccupied slot is
found. When searching for an entry,
the buckets are scanned in the same
sequence, until either the target record
is found, or an unused array slot is
found, which indicates that there is no
such key in the table. The name "open
addressing" refers to the fact that the
location ("address") of the item is not
determined by its hash value. (This
method is also called closed hashing;
it should not be confused with "open
hashing" or "closed addressing" that
usually mean separate chaining.)
Hash collision resolved by open addressing with linear probing (interval=1). Note that
"Ted Baker" has a unique hash, but nevertheless collided with "Sandra Dee", that had
previously collided with "John Smith".
Hash table
111
Double hashing, in which the interval between probes is computed by another hash function
A drawback of all these open addressing schemes is that the number of stored entries cannot exceed the number of
slots in the bucket array. In fact, even with good hash functions, their performance dramatically degrades when the
load factor grows beyond 0.7 or so. Thus a more aggressive resize scheme is needed. Separate linking works
correctly with any load factor, although performance is likely to be reasonable if it is kept below 2 or so. For many
applications, these restrictions mandate the use of dynamic resizing, with its attendant costs.
Open addressing schemes also put more stringent requirements on the hash function: besides distributing the keys
more uniformly over the buckets, the function must also minimize the clustering of hash values that are consecutive
in the probe order. Using separate chaining, the only concern is that too many objects map to the same hash value;
whether they are adjacent or nearby is completely irrelevant.
Open addressing only saves memory if the entries are small (less than four times the size of a pointer) and the load
factor is not too small. If the load factor is close to zero (that is, there are far more buckets than stored entries), open
addressing is wasteful even if each entry is just two words.
Open addressing avoids the time
overhead of allocating each new entry
record, and can be implemented even
in the absence of a memory allocator.
It also avoids the extra indirection
required to access the first entry of
each bucket (that is, usually the only
one). It also has better locality of
reference, particularly with linear
probing. With small record sizes, these
factors can yield better performance
than chaining, particularly for lookups.
Hash tables with open addressing are
also easier to serialize, because they do
not use pointers.
This graph compares the average number of cache misses required to look up elements in
tables with chaining and linear probing. As the table passes the 80%-full mark, linear
probing's performance drastically degrades.
Hash table
Coalesced hashing
A hybrid of chaining and open addressing, coalesced hashing links together chains of nodes within the table itself.
Like open addressing, it achieves space usage and (somewhat diminished) cache advantages over chaining. Like
chaining, it does not exhibit clustering effects; in fact, the table can be efficiently filled to a high density. Unlike
chaining, it cannot have more elements than table slots.
Cuckoo hashing
Another alternative open-addressing solution is cuckoo hashing, which ensures constant lookup time in the worst
case, and constant amortized time for insertions and deletions. It uses two or more hash functions, which means any
key/value pair could be in two or more locations. For lookup, the first hash function is used; if the key/value is not
found, then the second hash function is used, and so on. If a collision happens during insertion, then the key is
re-hashed with the second hash function to map it to another bucket. If all hash functions are used and there is still a
collision, then the key it collided with is removed to make space for the new key, and the old key is re-hashed with
one of the other hash functions, which maps it to another bucket. If that location also results in a collision, then the
process repeats until there is no collision or the process traverses all the buckets, at which point the table is resized.
By combining multiple hash functions with multiple cells per bucket, very high space utilisation can be achieved.
2-choice hashing
2-choice hashing employs 2 different hash functions, h1(x) and h2(x), for the hash table. Both hash functions are used
to compute two table locations. When an object is inserted in the table, then it is placed in the table location that
contains fewer objects (with the default being the h1(x) table location if there is equality in bucket size). 2-choice
hashing employs the principle of the power of two choices.
Hopscotch hashing
Another alternative open-addressing solution is hopscotch hashing, which combines the approaches of cuckoo
hashing and linear probing, yet seems in general to avoid their limitations. In particular it works well even when the
load factor grows beyond 0.9. The algorithm is well suited for implementing a resizable concurrent hash table.
The hopscotch hashing algorithm works by defining a neighborhood of buckets near the original hashed bucket,
where a given entry is always found. Thus, search is limited to the number of entries in this neighborhood, which is
logarithmic in the worst case, constant on average, and with proper alignment of the neighborhood typically requires
one cache miss. When inserting an entry, one first attempts to add it to a bucket in the neighborhood. However, if all
buckets in this neighborhood are occupied, the algorithm traverses buckets in sequence until an open slot (an
unoccupied bucket) is found (as in linear probing). At that point, since the empty bucket is outside the neighborhood,
items are repeatedly displaced in a sequence of hops. (This is similar to cuckoo hashing, but with the difference that
in this case the empty slot is being moved into the neighborhood, instead of items being moved out with the hope of
eventually finding an empty slot.) Each hop brings the open slot closer to the original neighborhood, without
112
Hash table
invalidating the neighborhood property of any of the buckets along the way. In the end, the open slot has been
moved into the neighborhood, and the entry being inserted can be added to it.
Dynamic resizing
To keep the load factor under a certain limit, e.g. under 3/4, many table implementations expand the table when
items are inserted. For example, in Java's HashMap class the default load factor threshold for table expansion is
0.75.
Since buckets are usually implemented on top of a dynamic array and any constant proportion for resizing greater
than 1 will keep the load factor under the desired limit, the exact choice of the constant is determined by the same
space-time tradeoff as for dynamic arrays.
Resizing is accompanied by a full or incremental table rehash whereby existing items are mapped to new bucket
locations.
To limit the proportion of memory wasted due to empty buckets, some implementations also shrink the size of the
tablefollowed by a rehashwhen items are deleted. From the point of space-time tradeoffs, this operation is
similar to the deallocation in dynamic arrays.
Incremental resizing
Some hash table implementations, notably in real-time systems, cannot pay the price of enlarging the hash table all at
once, because it may interrupt time-critical operations. If one cannot avoid dynamic resizing, a solution is to perform
the resizing gradually:
During the resize, allocate the new hash table, but keep the old table unchanged.
In each lookup or delete operation, check both tables.
Perform insertion operations only in the new table.
At each insertion also move r elements from the old table to the new table.
When all elements are removed from the old table, deallocate it.
To ensure that the old table is completely copied over before the new table itself needs to be enlarged, it is necessary
to increase the size of the table by a factor of at least (r + 1)/r during resizing.
113
Hash table
Monotonic keys
If it is known that key values will always increase (or decrease) monotonically, then a variation of consistent hashing
can be achieved by keeping a list of the single most recent key value at each hash table resize operation. Upon
lookup, keys that fall in the ranges defined by these list entries are directed to the appropriate hash functionand
indeed hash tableboth of which can be different for each range. Since it is common to grow the overall number of
entries by doubling, there will only be O(lg(N)) ranges to check, and binary search time for the redirection would be
O(lg(lg(N))). As with consistent hashing, this approach guarantees that any key's hash, once issued, will never
change, even when the hash table is later grown.
Other solutions
Linear hashing is a hash table algorithm that permits incremental hash table expansion. It is implemented using a
single hash table, but with two possible look-up functions.
Another way to decrease the cost of table resizing is to choose a hash function in such a way that the hashes of most
values do not change when the table is resized. This approach, called consistent hashing, is prevalent in disk-based
and distributed hashes, where rehashing is prohibitively costly.
Performance analysis
In the simplest model, the hash function is completely unspecified and the table does not resize. For the best possible
choice of hash function, a table of size k with open addressing has no collisions and holds up to k elements, with a
single comparison for successful lookup, and a table of size k with chaining and n keys has the minimum max(0, n-k)
collisions and O(1 + n/k) comparisons for lookup. For the worst choice of hash function, every insertion causes a
collision, and hash tables degenerate to linear search, with (n) amortized comparisons per insertion and up to n
comparisons for a successful lookup.
Adding rehashing to this model is straightforward. As in a dynamic array, geometric resizing by a factor of b implies
that only n/bi keys are inserted i or more times, so that the total number of insertions is bounded above by bn/(b-1),
which is O(n). By using rehashing to maintain n < k, tables using both chaining and open addressing can have
unlimited elements and perform successful lookup in a single comparison for the best choice of hash function.
In more realistic models, the hash function is a random variable over a probability distribution of hash functions, and
performance is computed on average over the choice of hash function. When this distribution is uniform, the
assumption is called "simple uniform hashing" and it can be shown that hashing with chaining requires (1 + n/k)
comparisons on average for an unsuccessful lookup, and hashing with open addressing requires (1/(1 - n/k)).[4]
Both these bounds are constant, if we maintain n/k < c using table resizing, where c is a fixed constant less than 1.
Features
Advantages
The main advantage of hash tables over other table data structures is speed. This advantage is more apparent when
the number of entries is large. Hash tables are particularly efficient when the maximum number of entries can be
predicted in advance, so that the bucket array can be allocated once with the optimum size and never resized.
If the set of key-value pairs is fixed and known ahead of time (so insertions and deletions are not allowed), one may
reduce the average lookup cost by a careful choice of the hash function, bucket table size, and internal data
structures. In particular, one may be able to devise a hash function that is collision-free, or even perfect (see below).
In this case the keys need not be stored in the table.
114
Hash table
Drawbacks
Although operations on a hash table take constant time on average, the cost of a good hash function can be
significantly higher than the inner loop of the lookup algorithm for a sequential list or search tree. Thus hash tables
are not effective when the number of entries is very small. (However, in some cases the high cost of computing the
hash function can be mitigated by saving the hash value together with the key.)
For certain string processing applications, such as spell-checking, hash tables may be less efficient than tries, finite
automata, or Judy arrays. Also, if each key is represented by a small enough number of bits, then, instead of a hash
table, one may use the key directly as the index into an array of values. Note that there are no collisions in this case.
The entries stored in a hash table can be enumerated efficiently (at constant cost per entry), but only in some
pseudo-random order. Therefore, there is no efficient way to locate an entry whose key is nearest to a given key.
Listing all n entries in some specific order generally requires a separate sorting step, whose cost is proportional to
log(n) per entry. In comparison, ordered search trees have lookup and insertion cost proportional to log(n), but allow
finding the nearest key at about the same cost, and ordered enumeration of all entries at constant cost per entry.
If the keys are not stored (because the hash function is collision-free), there may be no easy way to enumerate the
keys that are present in the table at any given moment.
Although the average cost per operation is constant and fairly small, the cost of a single operation may be quite high.
In particular, if the hash table uses dynamic resizing, an insertion or deletion operation may occasionally take time
proportional to the number of entries. This may be a serious drawback in real-time or interactive applications.
Hash tables in general exhibit poor locality of referencethat is, the data to be accessed is distributed seemingly at
random in memory. Because hash tables cause access patterns that jump around, this can trigger microprocessor
cache misses that cause long delays. Compact data structures such as arrays searched with linear search may be
faster, if the table is relatively small and keys are integers or other short strings. According to Moore's Law, cache
sizes are growing exponentially and so what is considered "small" may be increasing. The optimal performance point
varies from system to system.
Hash tables become quite inefficient when there are many collisions. While extremely uneven hash distributions are
extremely unlikely to arise by chance, a malicious adversary with knowledge of the hash function may be able to
supply information to a hash that creates worst-case behavior by causing excessive collisions, resulting in very poor
performance, e.g. a denial of service attack.[5] In critical applications, universal hashing can be used; a data structure
with better worst-case guarantees may be preferable.[6]
Uses
Associative arrays
Hash tables are commonly used to implement many types of in-memory tables. They are used to implement
associative arrays (arrays whose indices are arbitrary strings or other complicated objects), especially in interpreted
programming languages like AWK, Perl, and PHP.
When storing a new item into a multimap and a hash collision occurs, the multimap unconditionally stores both
items.
When storing a new item into a typical associative array and a hash collision occurs, but the actual keys themselves
are different, the associative array likewise stores both items. However, if the key of the new item exactly matches
the key of an old item, the associative array typically erases the old item and overwrites it with the new item, so
every item in the table has a unique key.
115
Hash table
Database indexing
Hash tables may also be used as disk-based data structures and database indices (such as in dbm) although B-trees
are more popular in these applications.
Caches
Hash tables can be used to implement caches, auxiliary data tables that are used to speed up the access to data that is
primarily stored in slower media. In this application, hash collisions can be handled by discarding one of the two
colliding entriesusually erasing the old item that is currently stored in the table and overwriting it with the new
item, so every item in the table has a unique hash value.
Sets
Besides recovering the entry that has a given key, many hash table implementations can also tell whether such an
entry exists or not.
Those structures can therefore be used to implement a set data structure, which merely records whether a given key
belongs to a specified set of keys. In this case, the structure can be simplified by eliminating all parts that have to do
with the entry values. Hashing can be used to implement both static and dynamic sets.
Object representation
Several dynamic languages, such as Perl, Python, JavaScript, and Ruby, use hash tables to implement objects. In this
representation, the keys are the names of the members and methods of the object, and the values are pointers to the
corresponding member or method.
Implementations
In programming languages
Many programming languages provide hash table functionality, either as built-in associative arrays or as standard
library modules. In C++11, for example, the unordered_map class provides hash tables for keys and values of
arbitrary type.
In PHP 5, the Zend 2 engine uses one of the hash functions from Daniel J. Bernstein to generate the hash values used
in managing the mappings of data pointers stored in a hash table. In the PHP source code, it is labelled as DJBX33A
(Daniel J. Bernstein, Times 33 with Addition).
Python's built-in hash table implementation, in the form of the dict type, as well as Perl's hash type (%) are highly
optimized as they are used internally to implement namespaces.
In the .NET Framework, support for hash tables is provided via the non-generic Hashtable and generic
Dictionary classes, which store key-value pairs, and the generic HashSet class, which stores only values.
116
Hash table
Independent packages
SparseHash [7] (formerly Google SparseHash) An extremely memory-efficient hash_map implementation, with
only 2 bits/entry of overhead. The SparseHash library has several C++ hash map implementations with different
performance characteristics, including one that optimizes for memory use and another that optimizes for speed.
SunriseDD [8] An open source C library for hash table storage of arbitrary data objects with lock-free lookups,
built-in reference counting and guaranteed order iteration. The library can participate in external reference
counting systems or use its own built-in reference counting. It comes with a variety of hash functions and allows
the use of runtime supplied hash functions via callback mechanism. Source code is well documented.
uthash [9] This is an easy-to-use hash table for C structures.
History
The idea of hashing arose independently in different places. In January 1953, H. P. Luhn wrote an internal IBM
memorandum that used hashing with chaining. G. N. Amdahl, E. M. Boehme, N. Rochester, and Arthur Samuel
implemented a program using hashing at about the same time. Open addressing with linear probing (relatively prime
stepping) is credited to Amdahl, but Ershov (in Russia) had the same idea.
References
[1] Charles E. Leiserson, Amortized Algorithms, Table Doubling, Potential Method (http:/ / videolectures. net/ mit6046jf05_leiserson_lec13/ )
Lecture 13, course MIT 6.046J/18.410J Introduction to AlgorithmsFall 2005
[2] Thomas Wang (1997), Prime Double Hash Table (http:/ / www. concentric. net/ ~Ttwang/ tech/ primehash. htm). Retrieved April 27, 2012
[3] Erik Demaine, Jeff Lind. 6.897: Advanced Data Structures. MIT Computer Science and Artificial Intelligence Laboratory. Spring 2003. http:/
/ courses. csail. mit. edu/ 6. 897/ spring03/ scribe_notes/ L2/ lecture2. pdf
[4] Doug Dunham. CS 4521 Lecture Notes (http:/ / www. duluth. umn. edu/ ~ddunham/ cs4521s09/ notes/ ch11. txt). University of Minnesota
Duluth. Theorems 11.2, 11.6. Last modified April 21, 2009.
[5] Alexander Klink and Julian Wlde's Efficient Denial of Service Attacks on Web Application Platforms (http:/ / events. ccc. de/ congress/
2011/ Fahrplan/ attachments/ 2007_28C3_Effective_DoS_on_web_application_platforms. pdf), December 28, 2011, 28th Chaos
Communication Congress. Berlin, Germany.
[6] Crosby and Wallach's Denial of Service via Algorithmic Complexity Attacks (http:/ / www. cs. rice. edu/ ~scrosby/ hash/
CrosbyWallach_UsenixSec2003. pdf).
[7] http:/ / code. google. com/ p/ sparsehash/
[8] http:/ / www. sunrisetel. net/ software/ devtools/ sunrise-data-dictionary. shtml
[9] http:/ / uthash. sourceforge. net/
Further reading
Tamassia, Roberto; Michael T. Goodrich (2006). "Chapter Nine: Maps and Dictionaries". Data structures and
algorithms in Java : [updated for Java 5.0] (4th ed.). Hoboken, N.J.: Wiley. pp.369418. ISBN0-471-73884-0.
McKenzie, B. J.; R. Harries, T.Bell (Feb 1990). "Selecting a hashing algorithm". Software -- Practice & Experience
20 (2): 209224.
External links
A Hash Function for Hash Table Lookup (http://www.burtleburtle.net/bob/hash/doobs.html) by Bob Jenkins.
Hash Tables (http://www.sparknotes.com/cs/searching/hashtables/summary.html) by
SparkNotesexplanation using C
Hash functions (http://www.azillionmonkeys.com/qed/hash.html) by Paul Hsieh
Design of Compact and Efficient Hash Tables for Java (http://blog.griddynamics.com/2011/03/
ultimate-sets-and-maps-for-java-part-i.html) link not working
Libhashish (http://libhashish.sourceforge.net/) hash library
NIST entry on hash tables (http://www.nist.gov/dads/HTML/hashtab.html)
117
Hash table
118
Open addressing hash table removal algorithm from ICI programming language, ici_set_unassign in set.c (http://
ici.cvs.sourceforge.net/ici/ici/set.c?view=markup) (and other occurrences, with permission).
A basic explanation of how the hash table works by Reliable Software (http://www.relisoft.com/book/lang/
pointer/8hash.html)
Lecture on Hash Tables (http://compgeom.cs.uiuc.edu/~jeffe/teaching/373/notes/06-hashing.pdf)
Hash-tables in C (http://task3.cc/308/hash-maps-with-linear-probing-and-separate-chaining/)two simple and
clear examples of hash tables implementation in C with linear probing and chaining
Open Data Structures - Chapter 5 - Hash Tables (http://opendatastructures.org/versions/edition-0.1e/ods-java/
5_Hash_Tables.html)
MIT's Introduction to Algorithms: Hashing 1 (http://ocw.mit.edu/courses/
electrical-engineering-and-computer-science/6-046j-introduction-to-algorithms-sma-5503-fall-2005/
video-lectures/lecture-7-hashing-hash-functions/) MIT OCW lecture Video
MIT's Introduction to Algorithms: Hashing 2 (http://ocw.mit.edu/courses/
electrical-engineering-and-computer-science/6-046j-introduction-to-algorithms-sma-5503-fall-2005/
video-lectures/lecture-8-universal-hashing-perfect-hashing/) MIT OCW lecture Video
How to sort a HashMap (Java) and keep the duplicate entries (http://www.lampos.net/sort-hashmap)
Linear probing
Linear probing is a scheme in computer programming for resolving hash collisions of values of hash functions by
sequentially searching the hash table for a free location. This is accomplished using two values - one as a starting
value and one as an interval between successive values in modular arithmetic. The second value, which is the same
for all keys and known as the stepsize, is repeatedly added to the starting value until a free space is found, or the
entire table is traversed. (In order to traverse the entire table the stepsize should be relatively prime to the arraysize,
which is why the array size is often chosen to be a prime number.)
newLocation = (startingValue + stepSize) % arraySize
This algorithm, which is used in open-addressed hash tables, provides good memory caching (if stepsize is equal to
one), through good locality of reference, but also results in clustering, an unfortunately high probability that where
there has been one collision there will be more. The performance of linear probing is also more sensitive to input
distribution when compared to double hashing, where the stepsize is determined by another hash function applied to
the value instead of a fixed stepsize as in linear probing.
Given an ordinary hash function H(x), a linear probing function (H(x, i)) would be:
Here H(x) is the starting value, n the size of the hash table, and the stepsize is i in this case.
Linear probing
119
References
External links
How Caching Affects Hashing (http://www.siam.org/meetings/alenex05/papers/13gheileman.pdf) by
Gregory L. Heileman and Wenbin Luo 2005.
Open Data Structures - Section 5.2 - LinearHashTable: Linear Probing (http://opendatastructures.org/versions/
edition-0.1e/ods-java/5_2_LinearHashTable_Linear_.html)
Quadratic probing
Quadratic probing is an open addressing scheme in computer programming for resolving collisions in hash
tableswhen an incoming data's hash value indicates it should be stored in an already-occupied slot or bucket.
Quadratic probing operates by taking the original hash index and adding successive values of an arbitrary quadratic
polynomial until an open slot is found.
For a given hash value, the indices generated by linear probing are as follows:
This method results in primary clustering, and as the cluster grows larger, the search for those items hashing within
the cluster becomes less efficient.
An example sequence using quadratic probing is:
Quadratic probing can be a more efficient algorithm in a closed hash table, since it better avoids the clustering
problem that can occur with linear probing, although it is not immune. It also provides good memory caching
because it preserves some locality of reference; however, linear probing has greater locality and, thus, better cache
performance.
Quadratic probing is used in the Berkeley Fast File System to allocate free blocks. The allocation routine chooses a
new cylinder-group when the current is nearly full using quadratic probing, because of the speed it shows in finding
unused cylinder-groups.
Quadratic function
Let h(k) be a hash function that maps an element k to an integer in [0,m-1], where m is the size of the table. Let the
ith probe position for a value k be given by the function
where c2 0. If c2 = 0, then h(k,i) degrades to a linear probe. For a given hash table, the values of c1 and c2 remain
constant.
Examples:
If
For m = 2n, a good choice for the constants are c1 = c2 = 1/2, as the values of h(k,i) for i in [0,m-1] are all distinct.
This leads to a probe sequence of
where the values increase by 1, 2,
3, ...
For prime m > 2, most choices of c1 and c2 will make h(k,i) distinct for i in [0, (m-1)/2]. Such choices include c1 =
c2 = 1/2, c1 = c2 = 1, and c1 = 0, c2 = 1. Because there are only about m/2 distinct probes for a given element, it is
difficult to guarantee that insertions will succeed when the load factor is > 1/2.
Quadratic probing
120
Quadratic probing
Limitations
For linear probing it is a bad idea to let the hash table get nearly full, because performance is degraded as the hash
table gets filled. In the case of quadratic probing, the situation is even more drastic. With the exception of the
triangular number case for a power-of-two-sized hash table, there is no guarantee of finding an empty cell once the
table gets more than half full, or even before the table gets half full if the table size is not prime. This is because at
most half of the table can be used as alternative locations to resolve collisions. If the hash table size is b (a prime
greater than 3), it can be proven that the first
alternative locations including the initial location h(k) are all
distinct and unique. Suppose, we assume two of the alternative locations to be given by
121
Quadratic probing
122
and
, where 0 x, y (b / 2). If these two locations point to the same key space, but x y. Then the followi
have to be true,
As b (table size) is a prime greater than 3, either (x - y) or (x + y) has to be equal to zero. Since x and y are unique, (x
- y) cannot be zero. Also, since 0 x, y (b / 2), (x + y) cannot be zero.
Thus, by contradiction, it can be said that the first (b / 2) alternative locations after h(k) are unique. So an empty key
space can always be found as long as at most (b / 2) locations are filled, i.e., the hash table is not more than half full.
References
External links
Tutorial/quadratic probing (http://research.cs.vt.edu/AVresearch/hashing/quadratic.php)
Double hashing
Double hashing is a computer programming technique used in hash tables to resolve hash collisions, cases when
two different values to be searched for produce the same hash key. It is a popular collision-resolution technique in
open-addressed hash tables. Double hashing is implemented in many popular libraries.
Like linear probing, it uses one hash value as a starting point and then repeatedly steps forward an interval until the
desired value is located, an empty location is reached, or the entire table has been searched; but this interval is
decided using a second, independent hash function (hence the name double hashing). Unlike linear probing and
quadratic probing, the interval depends on the data, so that even values mapping to the same location have different
bucket sequences; this minimizes repeated collisions and the effects of clustering.
Given two randomly, uniformly, and independently selected hash functions
bucket sequence for value k in a hash table
and
and
is:
Generally,
then
. Let
Double hashing approximates uniform open address hashing. That is, start by randomly, uniformly and
independently selecting two universal hash functions
and
to build a double hashing table .
All elements are put in
and
. Given a key
, determining the
-st hash
regardless of the
distribution of the inputs. More percisely, these two uniformly, randomly and independently chosen hash functions
Double hashing
123
are chosen from a set of universal hash functions where pair-wise independence suffices.
Previous results include: Guibas and Szemerdi[2] showed
. Also, Lueker and Molodowitch[3] showed this held assuming ideal randomized functions. Schmidt
and Siegel[4] showed this with
).
The resulting sequence will always remain at the initial hash value. One possible solution is to change the secondary
hash function to:
This ensures that the secondary hash function will always be non zero.
Notes
[1] P. G. Bradford and M. Katehakis: A Probabilistic Study on Combinatorial Expanders and Hashing, SIAM Journal on Computing 2007
(37:1), 83-111. http:/ / citeseerx. ist. psu. edu/ viewdoc/ summary?doi=10. 1. 1. 91. 2647
[2] L. Guibas and E. Szemerdi: The Analysis of Double Hashing, Journal of Computer and System Sciences, 1978, 16, 226-274.
[3] G. S. Lueker and M. Molodowitch: More Analysis of Double Hashing, Combinatorica, 1993, 13(1), 83-96.
[4] J. P. Schmidt and A. Siegel: Double Hashing is Computable and Randomizable with Universal Hash Functions, manuscript.
External links
How Caching Affects Hashing (http://www.siam.org/meetings/alenex05/papers/13gheileman.pdf) by
Gregory L. Heileman and Wenbin Luo 2005.
Hash Table Animation (http://www.cs.pitt.edu/~kirk/cs1501/animations/Hashing.html)
Cuckoo hashing
124
Cuckoo hashing
Cuckoo hashing is a scheme in computer programming for resolving hash
collisions of values of hash functions in a table, with worst-case constant
lookup time. The name derives from the behavior of some species of
cuckoo, where the cuckoo chick pushes the other eggs or young out of the
nest when it hatches; analogously, inserting a new key into a cuckoo hashing
table may push an older key to a different location in the table.
History
Cuckoo hashing was first described by Rasmus Pagh and Flemming Friche
Rodler in 2001.
Theory
The basic idea is to use two hash functions instead of only one. This
provides two possible locations in the hash table for each key. In one of the
commonly used variants of the algorithm, the hash table is split into two
smaller tables of equal size, and each hash function provides an index into
one of these two tables.
When a new key is inserted, a greedy algorithm is used: The new key is
inserted in one of its two possible locations, "kicking out", that is,
displacing, any key that might already reside in this location. This displaced
key is then inserted in its alternative location, again kicking out any key that
might reside there, until a vacant position is found, or the procedure enters
an infinite loop. In the latter case, the hash table is rebuilt in-place using
new hash functions:
There is no need to allocate new tables for the rehashing: We may
simply run through the tables to delete and perform the usual insertion
procedure on all keys found not to be at their intended position in the
table.
Pagh & Rodler,"Cuckoo Hashing"
Lookup requires inspection of just two locations in the hash table, which
takes constant time in the worst case (see Big O notation). This is in contrast to many other hash table algorithms,
which may not have a constant worst-case bound on the time to do a lookup.
It can also be shown that insertions succeed in expected constant time, even considering the possibility of having to
rebuild the table, as long as the number of keys is kept below half of the capacity of the hash table, i.e., the load
factor is below 50%. One method of proving this uses the theory of random graphs: one may form an undirected
graph called the "Cuckoo Graph" that has a vertex for each hash table location, and an edge for each hashed value,
with the endpoints of the edge being the two possible locations of the value. Then, the greedy insertion algorithm for
adding a set of values to a cuckoo hash table succeeds if and only if the Cuckoo Graph for this set of values is a
pseudoforest, a graph with at most one cycle in each of its connected components, as any vertex-induced subgraph
with more edges than vertices corresponds to a set of keys for which there are an insufficient number of slots in the
hash table. This property is true with high probability for a random graph in which the number of edges is less than
Cuckoo hashing
125
Example
The following hashfunctions are given:
h(k) h'(k)
20
50
53
75
100 1
67
105 6
36
39
Columns in the following two tables show the state of the hash tables over time as the elements are inserted.
1. table for h(k)
20 50 53 75 100 67 105 3
36
39
67
67
100
36
0
1
100 67 67
2
3
4
5
6
50 50 50 50
7
8
9
10
20 20 20 20 20
20 53
53
53
75
Cuckoo hashing
126
105 3
36
39
20
20
20
20
36
39
2
3
4
53 53 53
53
50
50
50
53
75 75
75
75
75
75
67
5
6
7
8
9
10
Cycle
If you now wish to insert the element 6, then you get into a cycle. In the last row of the table we find the same initial
situation as at the beginning again.
table 2
50
53
50
53
75
53
67
75
67
100
67
105
100
105
105
36
39
36
39
105
39
100
105
100
67
100
75
67
75
53
75
50
53
50
39
50
36
39
36
36
50
53
50
Cuckoo hashing
References
A cool and practical alternative to traditional hash tables (http://www.ru.is/faculty/ulfar/CuckooHash.pdf),
U. Erlingsson, M. Manasse, F. Mcsherry, 2006.
Cuckoo Hashing for Undergraduates, 2006 (http://www.it-c.dk/people/pagh/papers/cuckoo-undergrad.pdf),
R. Pagh, 2006.
Cuckoo Hashing, Theory and Practice (http://mybiasedcoin.blogspot.com/2007/06/
cuckoo-hashing-theory-and-practice-part.html) (Part 1, Part 2 (http://mybiasedcoin.blogspot.com/2007/06/
cuckoo-hashing-theory-and-practice-part_15.html) and Part 3 (http://mybiasedcoin.blogspot.com/2007/06/
cuckoo-hashing-theory-and-practice-part_19.html)), Michael Mitzenmacher, 2007.
Naor, Moni; Segev, Gil; Wieder, Udi (2008). "History-Independent Cuckoo Hashing" (http://www.wisdom.
weizmann.ac.il/~naor/PAPERS/cuckoo_hi_abs.html). International Colloquium on Automata, Languages and
Programming (ICALP). Reykjavik, Iceland. Retrieved 2008-07-21.
External links
127
Hopscotch hashing
128
Hopscotch hashing
Hopscotch hashing is a scheme in computer programming for resolving hash collisions of values of hash functions
in a table using open addressing. It is also well suited for implementing a concurrent hash table. Hopscotch hashing
was introduced by Maurice Herlihy, Nir Shavit and Moran Tzafrir in 2008. The name is derived from the sequence
of hops that characterize the table's insertion algorithm.
The algorithm uses a single array of n
buckets. For each bucket, its
neighborhood is a small collection of
nearby consecutive buckets (i.e. one
with close indexes to the original
hashed bucket). The desired property
of the neighborhood is that the cost of
finding an item in the buckets of the
neighborhood is close to the cost of
finding it in the bucket itself (for
example, by having buckets in the
neighborhood fall within the same
cache line). The size of the
neighborhood must be sufficient to
accommodate a logarithmic number of
items in the worst case (i.e. it must
accommodate log(n) items), but only a
constant number on average. If some
bucket's neighborhood is filled, the
table is resized.
Hopscotch hashing. Here, H is 4. Gray entries are occupied. In part (a), the item x is
added with a hash value of 6. A linear probe finds that entry 13 is empty. Because 13 is
more than 4 entries away from 6, the algorithm looks for an earlier entry to swap with 13.
The first place to look in is H-1 = 3 entries before, at entry 10. That entry's hop
information bit-map indicates that d, the item at entry 11, can be displaced to 13. After
displacing d, Entry 11 is still too far from entry 6, so the algorithm examines entry 8. The
hop information bit-map indicates that item c at entry 9 can be moved to entry 11. Finally,
a is moved to entry 9. Part (b) shows the table state just after adding x.
Hopscotch hashing
The idea is that hopscotch hashing "moves the empty slot towards the desired bucket". This distinguishes it from
linear probing which leaves the empty slot where it was found, possibly far away from the original bucket, or from
cuckoo hashing that, in order to create a free bucket, moves an item out of one of the desired buckets in the target
arrays, and only then tries to find the displaced item a new place.
To remove an item from the table, one simply removes it from the table entry. If the neighborhood buckets are cache
aligned, then one could apply a reorganization operation in which items are moved into the now vacant location in
order to improve alignment.
One advantage of hopscotch hashing is that it provides good performance at very high table load factors, even ones
exceeding 0.9. Part of this efficiency is due to using a linear probe only to find an empty slot during insertion, not for
every lookup as in the original linear probing hash table algorithm. Another advantage is that one can use any hash
function, in particular simple ones that are close-to-universal.
References
External links
libhash - a C hopscotch hashing implementation (https://code.google.com/p/libhhash/wiki/Intro)
Hash function
A hash function is any algorithm that maps data of
variable length to data of a fixed length. The values
returned by a hash function are called hash values, hash
codes, hash sums, checksums or simply hashes.
Description
Hash functions are primarily used to generate
fixed-length output data that acts as a shortened reference
to the original data. This is useful when the output data is
too cumbersome to use in its entirety.
One practical use is a data structure called a hash table
A hash function that maps names to integers from 0 to 15. There is
where the data is stored associatively. Searching for a
a collision between keys "John Smith" and "Sandra Dee".
person's name in a list is slow, but the hashed value can
be used to store a reference to the original data and retrieve constant time (barring collisions). Another use is in
cryptography, the science of encoding and safeguarding data. It is easy to generate hash values from input data and
easy to verify that the data matches the hash, but hard to 'fake' a hash value to hide malicious data. This is the
principle behind the Pretty Good Privacy algorithm for data validation.
Hash functions are also used to accelerate table lookup or data comparison tasks such as finding items in a database,
detecting duplicated or similar records in a large file, finding similar stretches in DNA sequences, and so on.
A hash function should be referentially transparent (stable), i.e.,if called twice on input that is "equal" (for example,
strings that consist of the same sequence of characters), it should give the same result. There is a construct in many
programming languages that allows the user to override equality and hash functions for an object: if two objects are
equal, their hash codes must be the same. This is crucial to finding an element in a hash table quickly, because two of
the same element would both hash to the same slot.
129
Hash function
Hash functions are destructive, akin to lossy compression, as the original data is lost when hashed. Unlike
compression algorithms, where something resembling the original data can be decompressed from compressed data,
the goal of a hash value is to uniquely identify a reference to the object so that it can be retrieved in its entirety.
Unfortunately, all hash functions that map a larger set of data to a smaller set of data cause collisions. Such hash
functions try to map the keys to the hash values as evenly as possible because collisions become more frequent as
hash tables fill up. Thus, single-digit hash values are frequently restricted to 80% of the size of the table. Depending
on the algorithm used, other properties may be required as well, such as double hashing and linear probing. Although
the idea was conceived in the 1950s, the design of good hash functions is still a topic of active research.
Hash functions are related to (and often confused with) checksums, check digits, fingerprints, randomization
functions, error correcting codes, and cryptographic hash functions. Although these concepts overlap to some extent,
each has its own uses and requirements and is designed and optimized differently. The HashKeeper database
maintained by the American National Drug Intelligence Center, for instance, is more aptly described as a catalog of
file fingerprints than of hash values.
Hash tables
Hash functions are primarily used in hash tables, to quickly locate a data record (e.g.,a dictionary definition) given
its search key (the headword). Specifically, the hash function is used to map the search key to an index; the index
gives the place in the hash table where the corresponding record should be stored. Hash tables, in turn, are used to
implement associative arrays and dynamic sets.
Typically, the domain of a hash function (the set of possible keys) is larger than its range (the number of different
table indexes), and so it will map several different keys to the same index. Therefore, each slot of a hash table is
associated with (implicitly or explicitly) a set of records, rather than a single record. For this reason, each slot of a
hash table is often called a bucket, and hash values are also called bucket indices.
Thus, the hash function only hints at the record's locationit tells where one should start looking for it. Still, in a
half-full table, a good hash function will typically narrow the search down to only one or two entries.
Caches
Hash functions are also used to build caches for large data sets stored in slow media. A cache is generally simpler
than a hashed search table, since any collision can be resolved by discarding or writing back the older of the two
colliding items. This is also used in file comparison.
Bloom filters
Hash functions are an essential ingredient of the Bloom filter, a space-efficient probabilistic data structure that is
used to test whether an element is a member of a set.
130
Hash function
Geometric hashing
This principle is widely used in computer graphics, computational geometry and many other disciplines, to solve
many proximity problems in the plane or in three-dimensional space, such as finding closest pairs in a set of points,
similar shapes in a list of shapes, similar images in an image database, and so on. In these applications, the set of all
inputs is some sort of metric space, and the hashing function can be interpreted as a partition of that space into a grid
of cells. The table is often an array with two or more indices (called a grid file, grid index, bucket grid, and similar
names), and the hash function returns an index tuple. This special case of hashing is known as geometric hashing or
the grid method. Geometric hashing is also used in telecommunications (usually under the name vector quantization)
to encode and compress multi-dimensional signals.
Properties
Good hash functions, in the original sense of the term, are usually required to satisfy certain properties listed below.
Note that different requirements apply to the other related concepts (cryptographic hash functions, checksums, etc.).
Determinism
A hash procedure must be deterministicmeaning that for a given input value it must always generate the same hash
value. In other words, it must be a function of the data to be hashed, in the mathematical sense of the term. This
requirement excludes hash functions that depend on external variable parameters, such as pseudo-random number
generators or the time of day. It also excludes functions that depend on the memory address of the object being
hashed, because that address may change during execution (as may happen on systems that use certain methods of
garbage collection), although sometimes rehashing of the item is possible.
131
Hash function
Uniformity
A good hash function should map the expected inputs as evenly as possible over its output range. That is, every hash
value in the output range should be generated with roughly the same probability. The reason for this last requirement
is that the cost of hashing-based methods goes up sharply as the number of collisionspairs of inputs that are
mapped to the same hash valueincreases. Basically, if some hash values are more likely to occur than others, a
larger fraction of the lookup operations will have to search through a larger set of colliding table entries.
Note that this criterion only requires the value to be uniformly distributed, not random in any sense. A good
randomizing function is (barring computational efficiency concerns) generally a good choice as a hash function, but
the converse need not be true.
Hash tables often contain only a small subset of the valid inputs. For instance, a club membership list may contain
only a hundred or so member names, out of the very large set of all possible names. In these cases, the uniformity
criterion should hold for almost all typical subsets of entries that may be found in the table, not just for the global set
of all possible entries.
In other words, if a typical set of m records is hashed to n table slots, the probability of a bucket receiving many
more than m/n records should be vanishingly small. In particular, if m is less than n, very few buckets should have
more than one or two records. (In an ideal "perfect hash function", no bucket should have more than one record; but
a small number of collisions is virtually inevitable, even if n is much larger than m see the birthday paradox).
When testing a hash function, the uniformity of the distribution of hash values can be evaluated by the chi-squared
test.
Variable range
In many applications, the range of hash values may be different for each run of the program, or may change along
the same run (for instance, when a hash table needs to be expanded). In those situations, one needs a hash function
which takes two parametersthe input data z, and the number n of allowed hash values.
A common solution is to compute a fixed hash function with a very large range (say, 0 to 2321), divide the result
by n, and use the division's remainder. If n is itself a power of 2, this can be done by bit masking and bit shifting.
When this approach is used, the hash function must be chosen so that the result has fairly uniform distribution
between 0 and n1, for any value of n that may occur in the application. Depending on the function, the remainder
may be uniform only for certain values of n, e.g. odd or prime numbers.
We can allow the table size n to not be a power of 2 and still not have to perform any remainder or division
operation, as these computations are sometimes costly. For example, let n be significantly less than 2b. Consider a
pseudorandom number generator (PRNG) function P(key) that is uniform on the interval [0, 2b1]. A hash function
uniform on the interval [0, n-1] is n P(key)/2b. We can replace the division by a (possibly faster) right bit shift:
nP(key) >> b.
132
Hash function
Several algorithms that preserve the uniformity property but require time proportional to n to compute the value of
H(z,n) have been invented.
Data normalization
In some applications, the input data may contain features that are irrelevant for comparison purposes. For example,
when looking up a personal name, it may be desirable to ignore the distinction between upper and lower case letters.
For such data, one must use a hash function that is compatible with the data equivalence criterion being used: that is,
any two inputs that are considered equivalent must yield the same hash value. This can be accomplished by
normalizing the input before hashing it, as by upper-casing all letters.
Continuity
A hash function that is used to search for similar (as opposed to equivalent) data must be as continuous as possible;
two inputs that differ by a little should be mapped to equal or nearly equal hash values.[citation needed]
Note that continuity is usually considered a fatal flaw for checksums, cryptographic hash functions, and other related
concepts. Continuity is desirable for hash functions only in some applications, such as hash tables used in Nearest
neighbor search.
133
Hash function
134
Perfect hashing
A hash function that is injectivethat is, maps each valid
input to a different hash valueis said to be perfect.
With such a function one can directly locate the desired
entry in a hash table, without any additional searching.
Hash function
135
Rolling hash
In some applications, such as substring search, one must compute a hash function h for every k-character substring of
a given n-character string t; where k is a fixed integer, and n is k. The straightforward solution, which is to extract
every such substring s of t and compute h(s) separately, requires a number of operations proportional to kn.
However, with the proper choice of h, one can use the technique of rolling hash to compute all those hashes with an
effort proportional to k+n.
Universal hashing
A universal hashing scheme is a randomized algorithm that selects a hashing function h among a family of such
functions, in such a way that the probability of a collision of any two distinct keys is 1/n, where n is the number of
distinct hash values desiredindependently of the two keys. Universal hashing ensures (in a probabilistic sense) that
the hash function application will behave as well as if it were using a random function, for any distribution of the
input data. It will however have more collisions than perfect hashing, and may require more operations than a
Hash function
special-purpose hash function.
136
Hash function
radicallyWikipedia:Disputed statement for strings such as "Aaaaaaaaaa" and "Aaaaaaaaab".
Locality-sensitive hashing
Locality-sensitive hashing (LSH) is a method of performing probabilistic dimension reduction of high-dimensional
data. The basic idea is to hash the input items so that similar items are mapped to the same buckets with high
probability (the number of buckets being much smaller than the universe of possible input items). This is different
from the conventional hash functions, such as those used in cryptography, as in this case the goal is to maximize the
probability of "collision" of similar items rather than to avoid collisions.
One example of LSH is MinHash algorithm used for finding similar documents (such as web-pages):
Let h be a hash function that maps the members of A and B to distinct integers, and for any set S define hmin(S) to be
the member x of S with the minimum value of h(x). Then hmin(A) = hmin(B) exactly when the minimum hash value of
the union A B lies in the intersection A B. Therefore,
Pr[hmin(A) = hmin(B)] = J(A,B). where J is Jaccard index.
In other words, if r is a random variable that is one when hmin(A) = hmin(B) and zero otherwise, then r is an unbiased
estimator of J(A,B), although it has too high a variance to be useful on its own. The idea of the MinHash scheme is to
reduce the variance by averaging together several variables constructed in the same way.
References
[1] "Robust Audio Hashing for Content Identification by Jaap Haitsma, Ton Kalker and Job Oostveen" (http:/ / citeseer. ist. psu. edu/ rd/
11787382,504088,1,0. 25,Download/ http:/ / citeseer. ist. psu. edu/ cache/ papers/ cs/ 25861/ http:zSzzSzwww. extra. research. philips.
comzSznatlabzSzdownloadzSzaudiofpzSzcbmi01audiohashv1. 0. pdf/ haitsma01robust. pdf)
[2] Bret Mulvey, Evaluation of CRC32 for Hash Tables (http:/ / home. comcast. net/ ~bretm/ hash/ 8. html), in Hash Functions (http:/ / home.
comcast. net/ ~bretm/ hash/ ). Accessed April 10, 2009.
[3] Bret Mulvey, Evaluation of SHA-1 for Hash Tables (http:/ / home. comcast. net/ ~bretm/ hash/ 9. html), in Hash Functions (http:/ / home.
comcast. net/ ~bretm/ hash/ ). Accessed April 10, 2009.
[4] http:/ / citeseer. ist. psu. edu/ viewdoc/ summary?doi=10. 1. 1. 18. 7520 Performance in Practice of String Hashing Functions
137
Hash function
External links
General purpose hash function algorithms (C/C++/Pascal/Java/Python/Ruby) (http://www.partow.net/
programming/hashfunctions/index.html)
Hash Functions and Block Ciphers by Bob Jenkins (http://burtleburtle.net/bob/hash/index.html)
The Goulburn Hashing Function (http://www.webcitation.org/query?url=http://www.geocities.com/
drone115b/Goulburn06.pdf&date=2009-10-25+21:06:51) (PDF) by Mayur Patel
Hash Function Construction for Textual and Geometrical Data Retrieval (http://herakles.zcu.cz/~skala/PUBL/
PUBL_2010/2010_WSEAS-Corfu_Hash-final.pdf) Latest Trends on Computers, Vol.2, pp.483489, CSCC
conference, Corfu, 2010
138
References
[1] Fredman, M. L., Komls, J., and Szemerdi, E. 1984. Storing a Sparse Table with O(1) Worst Case Access Time. J. ACM 31, 3 (Jun. 1984),
538-544 http:/ / portal. acm. org/ citation. cfm?id=1884#
Further reading
Richard J. Cichelli. Minimal Perfect Hash Functions Made Simple, Communications of the ACM, Vol. 23,
Number 1, January 1980.
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms,
Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Section 11.5: Perfect hashing,
pp.245249.
Fabiano C. Botelho, Rasmus Pagh and Nivio Ziviani. "Perfect Hashing for Data Management Applications"
(http://arxiv.org/pdf/cs/0702159).
Fabiano C. Botelho and Nivio Ziviani. "External perfect hashing for very large key sets" (http://homepages.dcc.
ufmg.br/~nivio/papers/cikm07.pdf). 16th ACM Conference on Information and Knowledge Management
(CIKM07), Lisbon, Portugal, November 2007.
Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. "Monotone minimal perfect hashing:
Searching a sorted table with O(1) accesses" (http://vigna.dsi.unimi.it/ftp/papers/
MonotoneMinimalPerfectHashing.pdf). In Proceedings of the 20th Annual ACM-SIAM Symposium On Discrete
Mathematics (SODA), New York, 2009. ACM Press.
Djamal Belazzougui, Paolo Boldi, Rasmus Pagh, and Sebastiano Vigna. "Theory and practise of monotone
minimal perfect hashing" (http://www.siam.org/proceedings/alenex/2009/alx09_013_belazzouguid.pdf). In
Proceedings of the Tenth Workshop on Algorithm Engineering and Experiments (ALENEX). SIAM, 2009.
Douglas C. Schmidt, GPERF: A Perfect Hash Function Generator (http://www.cs.wustl.edu/~schmidt/PDF/
gperf.pdf), C++ Report, SIGS, Vol. 10, No. 10, November/December, 1998.
External links
139
Universal hashing
140
Universal hashing
Using universal hashing (in a randomized algorithm or data structure) refers to selecting a hash function at random
from a family of hash functions with a certain mathematical property (see definition below). This guarantees a low
number of collisions in expectation, even if the data is chosen by an adversary. Many universal families are known
(for hashing integers, vectors, strings), and their evaluation is often very efficient. Universal hashing has numerous
uses in computer science, for example in implementations of hash tables, randomized algorithms, and cryptography.
Introduction
Assume we want to map keys from some universe
algorithm will have to handle some data set
into
of
bins (labelled
). The
is greater than
to be precisely the preimage of a bin. This means that all data keys land in the same bin, making
hashing useless. Furthermore, a deterministic hash function does not allow for rehashing: sometimes the input data
turns out to be bad for the hash function (e.g. there are too many collisions), so one would like to change the hash
function.
The solution to these problems is to pick a function randomly from a family of hash functions. A family of functions
is called a universal family if,
In other words, any two keys of the universe collide with probability at most
drawn randomly from
is
. This is exactly the probability of collision we would expect if the hash function assigned
truly random hash codes to every key. Sometimes, the definition is relaxed to allow collision probability
This concept was introduced by Carter and Wegman in 1977, and has found numerous applications in computer
science (see, for example ). If we have an upper bound of
on the collision probability, we say that we have
-almost universality.
Many, but not all, universal families have the following stronger uniform difference property:
,
when
is
drawn
randomly
is uniformly distributed in
concerned with whether
where
the
the
difference
of two.)
An even stronger condition is pairwise independence: we have this property when
probability that
family
stronger.
(Similarly, a universal family can be XOR universal if
uniformly distributed in
from
is
is a power
we have the
Universal hashing
141
for all
. Unfortunately, the same is not true of (merely) universal families. For example the family made of the
identity function
fails to
be universal.
Mathematical guarantees
For any fixed set
of
in
is
by chaining, this number is proportional to the expected running time of an operation involving the key (for
example a query, insertion or deletion).
2. The expected number of pairs of keys
in with
that collide (
) is bounded above
by
, which is of order
number of collisions is
, is
, the expected
a half.
3. The expected number of keys in bins with at least keys in them is bounded above by
. Thus, if the capacity of each bin is capped to three times the average size (
), the total number of keys in overflowing bins is at most
family whose collision probability is bounded above by
, this result is no longer true.
As the above guarantees hold for any fixed set
adversary has to make this choice before (or independent of) the algorithm's random choice of a hash function. If the
adversary can observe the random choice of the algorithm, randomness serves no purpose, and the situation is the
same as deterministic hashing.
The second and third guarantee are typically used in conjunction with rehashing. For instance, a randomized
algorithm may be prepared to handle some
number of collisions. If it observes too many collisions, it chooses
another random
from the family and repeats. Universality guarantees that the number of repetitions is a geometric
random variable.
Constructions
Since any computer data can be represented as one or more machine words, one generally needs hash functions for
three types of domains: machine words ("integers"); fixed-length vectors of machine words; and variable-length
vectors ("strings").
Hashing integers
This section refers to the case of hashing integers that fit in machines words; thus, operations like multiplication,
addition, division, etc. are cheap machine-level instructions. Let the universe to be hashed be
.
The original proposal of Carter and Wegman was to pick a prime
where
with
and define
. Technically, adding
Universal hashing
142
To see that
between
. Solving for
and
. If
, their difference,
,
.
There are
(since
possible values for the right hand side. Thus the collision probability is
which tends to
for large
is nonzero and
is uniformly distributed in
. The distribution of
up to a difference in probability of
family is
, it follows that
modulo
is
.
To understand the behavior of the hash function, notice that, if
highest-order 'M' bits, then
whether
or
position
. Since
and
has either all 1's or all 0's as its highest order M bits (depending on
is larger. Assume that the least significant set bit of
is a random odd integer and odd integers have inverses in the ring
then bit
, it follows that
. The probability that these bits are all 0's or all 1's is therefore at most
if
appears on
is 1 and
only
bits is tight, as can be shown with
are also
1, which happens with probability
This if
analysis
the example
and
hash function, one can use the multiply-add-shift scheme
which can be implemented in C-like programming languages by
(unsigned) (a*x+b) >> (w-M)
if and
. 'universal'
. To obtain a truly
Universal hashing
where
143
and
and
for all
.
. This differs
Hashing vectors
This section is concerned with hashing a fixed-length vector of machine words. Interpret the input as a vector
of machine words (integers of bits each). If
is a universal family with the uniform
difference property, the following family (dating back to Carter and Wegman) also has the uniform difference
property (and hence is universal):
, where each
If
In practice, if double-precision arithmetic is available, this is instantiated with the multiply-shift hash family of.
Initialize the hash function with a vector
of random odd integers on
bits each. Then if
the number of bins is
for
:
.
It is possible to halve the number of multiplications, which roughly translates to a two-fold speed-up in practice.
Initialize the hash function with a vector
of random odd integers on
bits each. The
following hash family is universal:[2]
.
If double-precision operations are not available, one can interpret the input as a vector of half-words (
integers). The algorithm will then use
multiplications, where
-bit
Thus, the algorithm runs at a "rate" of one multiplication per word of input.
The same scheme can also be used for hashing integers, by interpreting their bits as vectors of bytes. In this variant,
the vector technique is known as tabulation hashing and it provides a practical alternative to multiplication-based
universal hashing schemes.
Strong universality at high speed is also possible. Initialize the hash function with a vector
random integers on
of
bits. Compute
.
bits. Experimentally, it was found to run at 0.2 CPU cycle per byte on recent
Hashing strings
This refers to hashing a variable-sized vector of machine words. If the length of the string can be bounded by a small
number, it is best to use the vector solution from above (conceptually padding the vector with zeros up to the upper
bound). The space required is the maximal length of the string, but the time to evaluate
is just the length of
. As long as zeroes are forbidden in the string, the zero-padding can be ignored when evaluating the hash function
without affecting universality). Note that if zeroes are allowed in the string, then it might be best to append a
fictitious non-zero (e.g., 1) character to all strings prior to padding: this will ensure that universality is not affected.
Universal hashing
144
, let
is chosen
roots modulo
implies that
. Thus, if the
is sufficiently large compared to the length of strings hashed, the family is very close to universal (in
statistical distance).
To mitigate the computational penalty of modular arithmetic, two tricks are used in practice:
1. One chooses the prime
modulo
to be implemented without division (using faster operations like addition and shifts). For instance, on
References
[1] , section 5.3
[2] , Equation 1
Further reading
Knuth, Donald Ervin (1998). [The Art of Computer Programming], Vol. III: Sorting and Searching (2e ed.).
Reading, Mass ; London: Addison-Wesley. ISBN0-201-89685-0. knuth. Unknown parameter |notes= ignored
(help)
External links
Open Data Structures - Section 5.1.1 - Multiplicative Hashing (http://opendatastructures.org/versions/
edition-0.1e/ods-java/5_1_ChainedHashTable_Hashin.html#SECTION00811000000000000000)
K-independent hashing
145
K-independent hashing
A family of hash functions is said to be
-independent or
from the family guarantees that the hash codes of any designated
precise mathematical definitions below). Such families allow good average case performance in randomized
algorithms or data structures, even if the input data is chosen by an adversary. The trade-offs between the degree of
independence and the efficiency of evaluating the hash function are well studied, and many -independent families
have been proposed.
Introduction
The goal of hashing is usually to map keys from some large domain (universe)
bins (labelled
desirable for the hash codes of various keys to "behave randomly". For instance, if the hash code of each key were an
independent random choice in
, the number of keys per bin could be analyzed using the Chernoff bound. A
deterministic hash function cannot offer any such guarantee in an adversarial setting, as the adversary may choose
the keys to be the precisely the preimage of a bin. Furthermore, a deterministic hash function does not allow for
rehashing: sometimes the input data turns out to be bad for the hash function (e.g. there are too many collisions), so
one would like to change the hash function.
The solution to these problems is to pick a function randomly from a large family of hash functions. The randomness
in choosing the hash function can be used to guarantee some desired random behavior of the hash codes of any keys
of interest. The first definition along these lines was universal hashing, which guarantees a low collision probability
for any two designated keys. The concept of -independent hashing, introduced by Wegman and Carter in 1981,
strengthens the guarantees of random behavior to families of
Mathematical Definitions
The strictest definition, introduced by Wegman and Carter under the name "strongly universal hash family", is the
following. A family of hash functions
is -independent if for any distinct keys
and any
, we have:
, as
, as
is uniformly distributed in
.
are
and
is close to 1,
in the analysis of randomized algorithms. Therefore, a more common alternative to dealing with rounding issues is to
prove that the hash family is close in statistical distance to a -independent family, which allows black-box use of
the independence properties.
K-independent hashing
References
Further reading
Motwani, Rajeev; Raghavan, Prabhakar (1995). Randomized Algorithms. Cambridge University Press. p.221.
ISBN0-521-47465-5.
Tabulation hashing
In computer science, tabulation hashing is a method for constructing universal families of hash functions by
combining table lookup with exclusive or operations. It is simple and fast enough to be usable in practice, and has
theoretical properties that (in contrast to some other universal hashing methods) make it usable with linear probing,
cuckoo hashing, and the MinHash technique for estimating the size of set intersections. The first instance of
tabulation hashing is Zobrist hashing (1969). It was later rediscovered by Carter & Wegman (1979) and studied in
more detail by Ptracu & Thorup (2011).
Method
Let p denote the number of bits in a key to be hashed, and q denote the number of bits desired in an output hash
function. Let r be a number smaller than p, and let t be the smallest integer that is at least as large as p/r. For
instance, if r=8, then an r-bit number is a byte, and t is the number of bytes per key.
The key idea of tabulation hashing is to view a key as a vector of t r-bit numbers, use a lookup table filled with
random values to compute a hash value for each of the r-bit numbers representing a given key, and combine these
values with the bitwise binary exclusive or operation. The choice of t and r should be made in such a way that this
table is not too large; e.g., so that it fits into the computer's cache memory.
The initialization phase of the algorithm creates a two-dimensional array T of dimensions 2r by t, and fills the array
with random numbers. Once the array T is initialized, it can be used to compute the hash value h(x) of any given key
x. To do so, partition x into r-bit values, where x0 consists of the low order r bits of x, x1 consists of the next r bits,
etc. (E.g., again, with r=8, xi is just the ith byte of x). Then, use these values as indices into T and combine them
with the exclusive or operation:
h(x) = T[x0,0] T[x1,1] T[x2,2] ...
Universality
Carter & Wegman (1979) define a randomized scheme for generating hash functions to be universal if, for any two
keys, the probability that they collide (that is, they are mapped to the same value as each other) is 1/m, where m is
the number of values that the keys can take on. They defined a stronger property in the subsequent paper Wegman &
Carter (1981): a randomized scheme for generating hash functions is k-independent if, for every k-tuple of keys, and
each possible k-tuple of values, the probability that those keys are mapped to those values is 1/mk. 2-independent
hashing schemes are automatically universal, and any universal hashing scheme can be converted into a
2-independent scheme by storing a random number x in the initialization phase of the algorithm and adding x to each
hash value, so universality is essentially the same as 2-independence, but k-independence for larger values of k is a
stronger property, held by fewer hashing algorithms.
As Ptracu & Thorup (2011) observe, tabulation hashing is 3-independent but not 4-independent. For any single key
x, T[x0,0] is equally likely to take on any hash value, and the exclusive or of T[x0,0] with the remaining table values
does not change this property. For any two keys x and y, x is equally likely to be mapped to any hash value as before,
and there is at least one position i where xiyi; the table value T[yi,i] is used in the calculation of h(y) but not in the
146
Tabulation hashing
calculation of h(x), so even after the value of h(x) has been determined, h(y) is equally likely to be any valid hash
value. Similarly, for any three keys x, y, and z, at least one of the three keys has a position i where its value zi differs
from the other two, so that even after the values of h(x) and h(y) are determined, h(z) is equally likely to be any valid
hash value.
However, this reasoning breaks down for four keys because there are sets of keys w, x, y, and z where none of the
four has a byte value that it does not share with at least one of the other keys. For instance, if the keys have two bytes
each, and w, x, y, and z are the four keys that have either zero or one as their byte values, then each byte value in
each position is shared by exactly two of the four keys. For these four keys, the hash values computed by tabulation
hashing will always satisfy the equation h(w) h(x) h(y) h(z) = 0, whereas for a 4-independent hashing scheme
the same equation would only be satisfied with probability 1/m. Therefore, tabulation hashing is not 4-independent.
Siegel (2004) uses the same idea of using exclusive or operations to combine random values from a table, with a
more complicated algorithm based on expander graphs for transforming the key bits into table indices, to define
hashing schemes that are k-independent for any constant or even logarithmic value of k. However, the number of
table lookups needed to compute each hash value using Siegel's variation of tabulation hashing, while constant, is
still too large to be practical, and the use of expanders in Siegel's technique also makes it not fully constructive.
One limitation of tabulation hashing is that it assumes that the input keys have a fixed number of bits. Lemire (2012)
has studied variations of tabulation hashing that can be applied to variable-length strings, and shown that they can be
universal (2-independent) but not 3-independent.
Application
Because tabulation hashing is a universal hashing scheme, it can be used in any hashing-based algorithm in which
universality is sufficient. For instance, in hash chaining, the expected time per operation is proportional to the sum of
collision probabilities, which is the same for any universal scheme as it would be for truly random hash functions,
and is constant whenever the load factor of the hash table is constant. Therefore, tabulation hashing can be used to
compute hash functions for hash chaining with a theoretical guarantee of constant expected time per operation.
However, universal hashing is not strong enough to guarantee the performance of some other hashing algorithms.
For instance, for linear probing, 5-independent hash functions are strong enough to guarantee constant time
operation, but there are 4-independent hash functions that fail.[1] Nevertheless, despite only being 3-independent,
tabulation hashing provides the same constant-time guarantee for linear probing.
Cuckoo hashing, another technique for implementing hash tables, guarantees constant time per lookup (regardless of
the hash function). Insertions into a cuckoo hash table may fail, causing the entire table to be rebuilt, but such
failures are sufficiently unlikely that the expected time per insertion (using either a truly random hash function or a
hash function with logarithmic independence) is constant. With tabulation hashing, on the other hand, the best bound
known on the failure probability is higher, high enough that insertions cannot be guaranteed to take constant
expected time. Nevertheless, tabulation hashing is adequate to ensure the linear-expected-time construction of a
cuckoo hash table for a static set of keys that does not change as the table is used.
Algorithms such as Karp-Rabin requires the efficient computation of hashing all consecutive sequences of
characters. We typically use rolling hash functions for these problems. Tabulation hashing is used to construct
families of strongly universal functions (for example, hashing by cyclic polynomials).
147
Tabulation hashing
Notes
[1] For the sufficiency of 5-independent hashing for linear probing, see . For examples of weaker hashing schemes that fail, see .
References
Carter, J. Lawrence; Wegman, Mark N. (1979), "Universal classes of hash functions", Journal of Computer and
System Sciences 18 (2): 143154, doi: 10.1016/0022-0000(79)90044-8 (http://dx.doi.org/10.1016/
0022-0000(79)90044-8), MR 532173 (http://www.ams.org/mathscinet-getitem?mr=532173).
Lemire, Daniel (2012), "The universality of iterated hashing over variable-length strings", Discrete Applied
Mathematics 160: 604617, arXiv: 1008.1715 (http://arxiv.org/abs/1008.1715), doi:
10.1016/j.dam.2011.11.009 (http://dx.doi.org/10.1016/j.dam.2011.11.009).
Pagh, Anna; Pagh, Rasmus; Rui, Milan (2009), "Linear probing with constant independence", SIAM Journal on
Computing 39 (3): 11071120, doi: 10.1137/070702278 (http://dx.doi.org/10.1137/070702278), MR
2538852 (http://www.ams.org/mathscinet-getitem?mr=2538852).
Ptracu, Mihai; Thorup, Mikkel (2010), "On the k-independence required by linear probing and minwise
independence" (http://people.csail.mit.edu/mip/papers/kwise-lb/kwise-lb.pdf), Automata, Languages and
Programming, 37th International Colloquium, ICALP 2010, Bordeaux, France, July 6-10, 2010, Proceedings,
Part I, Lecture Notes in Computer Science 6198, Springer, pp.715726, doi: 10.1007/978-3-642-14165-2_60
(http://dx.doi.org/10.1007/978-3-642-14165-2_60).
Ptracu, Mihai; Thorup, Mikkel (2011), "The power of simple tabulation hashing", Proceedings of the 43rd
annual ACM Symposium on Theory of Computing (STOC '11), pp.110, arXiv: 1011.5200 (http://arxiv.org/
abs/1011.5200), doi: 10.1145/1993636.1993638 (http://dx.doi.org/10.1145/1993636.1993638).
Siegel, Alan (2004), "On universal classes of extremely random constant-time hash functions", SIAM Journal on
Computing 33 (3): 505543 (electronic), doi: 10.1137/S0097539701386216 (http://dx.doi.org/10.1137/
S0097539701386216), MR 2066640 (http://www.ams.org/mathscinet-getitem?mr=2066640).
Wegman, Mark N.; Carter, J. Lawrence (1981), "New hash functions and their use in authentication and set
equality", Journal of Computer and System Sciences 22 (3): 265279, doi: 10.1016/0022-0000(81)90033-7 (http:/
/dx.doi.org/10.1016/0022-0000(81)90033-7), MR 633535 (http://www.ams.org/
mathscinet-getitem?mr=633535).
148
149
A cryptographic hash function (specifically, SHA-1) at work. Note that even small
changes in the source input (here in the word "over") drastically change the resulting
output, by the so-called avalanche effect.
Properties
Most cryptographic hash functions are designed to take a string of any length as input and produce a fixed-length
hash value.
A cryptographic hash function must be able to withstand all known types of cryptanalytic attack. As a minimum, it
must have the following properties:
Pre-image resistance
Given a hash h it should be difficult to find any message m such that h = hash(m). This concept is related to
that of one-way function. Functions that lack this property are vulnerable to preimage attacks.
Second pre-image resistance
Given an input m1 it should be difficult to find another input m2 such that m1 m2 and hash(m1) = hash(m2).
Functions that lack this property are vulnerable to second-preimage attacks.
Collision resistance
It should be difficult to find two different messages m1 and m2 such that hash(m1) = hash(m2). Such a pair is
called a cryptographic hash collision. This property is sometimes referred to as strong collision resistance. It
requires a hash value at least twice as long as that required for preimage-resistance; otherwise collisions may
be found by a birthday attack.
Degree of difficulty
In cryptographic practice, difficult generally means almost certainly beyond the reach of any adversary who must
be prevented from breaking the system for as long as the security of the system is deemed important. The meaning
of the term is therefore somewhat dependent on the application, since the effort that a malicious agent may put into
the task is usually proportional to his expected gain. However, since the needed effort usually grows very quickly
with the digest length, even a thousand-fold advantage in processing power can be neutralized by adding a few dozen
bits to the latter.
In some theoretical analyses difficult has a specific mathematical meaning, such as "not solvable in asymptotic
polynomial time". Such interpretations of difficulty are important in the study of provably secure cryptographic hash
functions but do not usually have a strong connection to practical security. For example, an exponential time
algorithm can sometimes still be fast enough to make a feasible attack. Conversely, a polynomial time algorithm
(e.g., one that requires n20 steps for n-digit keys) may be too slow for any practical use.
Illustration
An illustration of the potential use of a cryptographic hash is as follows: Alice poses a tough math problem to Bob
and claims she has solved it. Bob would like to try it himself, but would yet like to be sure that Alice is not bluffing.
Therefore, Alice writes down her solution, computes its hash and tells Bob the hash value (whilst keeping the
solution secret). Then, when Bob comes up with the solution himself a few days later, Alice can prove that she had
the solution earlier by revealing it and having Bob hash it and check that it matches the hash value given to him
before. (This is an example of a simple commitment scheme; in actual practice, Alice and Bob will often be
computer programs, and the secret would be something less easily spoofed than a claimed puzzle solution).
150
Applications
Verifying the integrity of files or messages
An important application of secure hashes is verification of message integrity. Determining whether any changes
have been made to a message (or a file), for example, can be accomplished by comparing message digests calculated
before, and after, transmission (or any other event).
For this reason, most digital signature algorithms only confirm the authenticity of a hashed digest of the message to
be "signed". Verifying the authenticity of a hashed digest of the message is considered proof that the message itself
is authentic.
MD5 or SHA1 hashes are sometimes posted along with files on websites or forums to allow verification of integrity.
This practice is not secure because of the chain of trust problem posted hashes cannot be trusted equally as files,
unless these websites are authenticated by HTTPS but in this case the hashes are redundant.
Password verification
A related application is password verification. Storing all user passwords as cleartext can result in a massive security
breach if the password file is compromised. One way to reduce this danger is to only store the hash digest of each
password. To authenticate a user, the password presented by the user is hashed and compared with the stored hash.
(Note that this approach prevents the original passwords from being retrieved if forgotten or lost, and they have to be
replaced with new ones.) The password is often concatenated with a random, non-secret salt value before the hash
function is applied. The salt is stored with the password hash. Because users have different salts, it is not feasible to
store tables of precomputed hash values for common passwords. Key stretching functions, such as PBKDF2, Bcrypt
or Scrypt, typically use repeated invocations of a cryptographic hash to increase the time required to perform brute
force attacks on stored password digests.
In 2013 a long-term competition was announced to choose a new, standard algorithm for password hashing.
151
MerkleDamgrd construction
A hash function must be able to
process an arbitrary-length message
into a fixed-length output. This can be
achieved by breaking the input up into
a series of equal-sized blocks, and
operating on them in sequence using a
one-way compression function. The
compression function can either be
specially designed for hashing or be
built from a block cipher. A hash
The MerkleDamgrd hash construction.
function
built
with
the
MerkleDamgrd construction is as
resistant to collisions as is its compression function; any collision for the full hash function can be traced back to a
collision in the compression function.
The last block processed should also be unambiguously length padded; this is crucial to the security of this
construction. This construction is called the MerkleDamgrd construction. Most widely used hash functions,
including SHA-1 and MD5, take this form.
The construction has certain inherent flaws, including length-extension and generate-and-paste attacks, and cannot
be parallelized. As a result, many entrants in the current NIST hash function competition are built on different,
sometimes novel, constructions.
152
153
154
Theoretical weaknesses of SHA-1 exist as well,[6][7] suggesting that it may be practical to break within years. New
applications can avoid these problems by using more advanced members of the SHA family, such as SHA-2, or
using techniques such as randomized hashing[8][9] that do not require collision resistance.
However, to ensure the long-term robustness of applications that use hash functions, there was a competition to
design a replacement for SHA-2. On October 2, 2012, Keccak was selected as the winner of the NIST hash function
competition. A version of this algorithm is expected to become a FIPS standard in 2014 under the name SHA-3.[10]
Some of the following algorithms are used often in cryptography; consult the article for each specific algorithm for
more information on the status of each algorithm. Note that this list does not include candidates in the current NIST
hash function competition. For additional hash functions see the box at the bottom of the page.
Algorithm
GOST
256
Internal
state
[11]
size
256
Block size
Length Word
size
size
Rounds
Collision
256
256
32
256
HAVAL
256/224/192/160/128
256
1,024
64
32
160/128/96
MD2
128
384
128
32
864
MD4
128
128
512
64
32
[13]
Yes ( 2105
Yes
Yes ( 263.3
48
Yes ( 3
[15]
[17]
Second
preimage
Preimage
[14]
Yes
(
[13]
2192
)
Yes ( 2192
[13]
)
No
No
Yes ( 273
[16]
)
Yes ( 273
[16]
)
Yes ( 264
[18]
)
Yes ( 278.4
[18]
)
MD5
128
128
512
64
32
64
Yes ( 220.96
[19]
)
Yes ( 2123.4
[20]
)
Yes ( 2123.4
[20]
)
PANAMA
256
8,736
256
32
Yes
No
No
RadioGatn
Up to 608/1,216 (19
words)
58 words
3 words
164
With flaws (
2352 or 2704
[21]
)
No
No
RIPEMD
128
128
512
64
32
48
Yes ( 218
No
No
RIPEMD-128/256
128/256
128/256
512
64
32
64
No
No
No
RIPEMD-160
160
160
512
64
32
80
Yes ( 251:48
[22]
)
No
No
RIPEMD-320
320
320
512
64
32
80
No
No
No
SHA-0
160
160
512
64
32
80
Yes ( 233.6
No
No
SHA-1
160
160
512
64
32
80
Yes ( 251
No
No
SHA-256/224
256/224
256
512
64
32
64
[25]
Theoretical
[26]
( 228.5:24
)
Theoretical
( 2248.4:42
[18]
)
Theoretical
( 2248.4:42
[18]
)
SHA-512/384
512/384
512
1,024
128
64
80
Theoretical (
[26]
232.5:24
)
Theoretical
( 2494.6:42
[18]
)
Theoretical
( 2494.6:42
[18]
)
SHA-3
[27]
224/256/384/512
[15]
[23]
[24]
1600
1600-2*bits
64
24
No
No
No
SHA-3-224
224
1600
1152
64
24
No
No
No
SHA-3-256
256
1600
1088
64
24
No
No
No
155
SHA-3-384
384
1600
832
64
24
No
No
No
SHA-3-512
512
1600
576
64
24
No
No
No
Tiger(2)-192/160/128
192/160/128
192
512
64
64
24
Yes ( 262:19
[28]
)
Yes ( 2184.3
[18]
)
Yes ( 2184.3
[18]
)
WHIRLPOOL
512
512
512
256
10
Yes ( 2120:4.5
[29]
)
No
No
Notes
[1] Note that any two messages that collide the concatenated function also collide each component function, by the nature of concatenation. For
example, if concat(sha1(message1), md5(message1)) == concat(sha1(message2), md5(message2)) then sha1(message1) == sha1(message2)
and md5(message1)==md5(message2). The concatenated function could have other problems that the strongest hash lacks -- for example, it
might leak information about the message when the strongest component does not, or it might be detectably nonrandom when the strongest
component is not -- but it can't be less collision-resistant.
[2] More generally, if an attack can produce a collision in one hash function's internal state, attacking the combined construction is only as
difficult as a birthday attack against the other function(s). For the detailed argument, see the Joux and Finney references that follow.
[3] Antoine Joux. Multicollisions in Iterated Hash Functions. Application to Cascaded Constructions. LNCS 3152/2004, pages 306-316 Full text
(http:/ / www. springerlink. com/ index/ DWWVMQJU0N0A3UGJ. pdf).
[4] http:/ / article. gmane. org/ gmane. comp. encryption. general/ 5154
[5] Alexander Sotirov, Marc Stevens, Jacob Appelbaum, Arjen Lenstra, David Molnar, Dag Arne Osvik, Benne de Weger, MD5 considered
harmful today: Creating a rogue CA certificate (http:/ / www. win. tue. nl/ hashclash/ rogue-ca/ ), accessed March 29, 2009
[6] Xiaoyun Wang, Yiqun Lisa Yin, and Hongbo Yu, Finding Collisions in the Full SHA-1 (http:/ / people. csail. mit. edu/ yiqun/
SHA1AttackProceedingVersion. pdf)
[7] Bruce Schneier, Cryptanalysis of SHA-1 (http:/ / www. schneier. com/ blog/ archives/ 2005/ 02/ cryptanalysis_o. html) (summarizes Wang et
al. results and their implications)
[8] Shai Halevi, Hugo Krawczyk, Update on Randomized Hashing (http:/ / csrc. nist. gov/ groups/ ST/ hash/ documents/
HALEVI_UpdateonRandomizedHashing0824. pdf)
[9] Shai Halevi and Hugo Krawczyk, Randomized Hashing and Digital Signatures (http:/ / www. ee. technion. ac. il/ ~hugo/ rhash/ )
[10] NIST.gov - Computer Security Division - Computer Security Resource Center (http:/ / csrc. nist. gov/ groups/ ST/ hash/ sha-3/ timeline_fips.
html)
[11] The internal state here means the "internal hash sum" after each compression of a data block. Most hash algorithms also internally use some
additional variables such as length of the data compressed so far since that is needed for the length padding in the end. See the
Merkle-Damgrd construction for details.
[12] When omitted, rounds are full number.
[13] http:/ / www. springerlink. com/ content/ 2514122231284103/
[14] There isn't a unique second preimage attack against this hash. However, the second preimage challenge reduces to the ordinary preimage
attack by simply constructing a hash of the given message.
[15] http:/ / www. springerlink. com/ content/ n5vrtdha97a2udkx/
[16] http:/ / eprint. iacr. org/ 2008/ 089. pdf
[17] http:/ / www. springerlink. com/ content/ v6526284mu858v37/
[18] http:/ / eprint. iacr. org/ 2010/ 016. pdf
[19] http:/ / eprint. iacr. org/ 2009/ 223. pdf
[20] http:/ / springerlink. com/ content/ d7pm142n58853467/
[21] http:/ / eprint. iacr. org/ 2008/ 515
[22] http:/ / www. springerlink. com/ content/ 3540l03h1w31n6w7
[23] http:/ / www. springerlink. com/ content/ 3810jp9730369045/
[24] http:/ / eprint. iacr. org/ 2008/ 469. pdf
[25] There is no known attack against the full version of this hash function, however there is an attack against this hashing scheme when the
number of rounds is reduced.
[26] http:/ / eprint. iacr. org/ 2008/ 270. pdf
[27] Although the underlying algorithm Keccak has arbitrary hash lengths, the NIST specified 224, 256, 384 and 512 bits output as valid modes
for SHA-3.
[28] http:/ / www. springerlink. com/ content/ u762587644802p38/
[29] https:/ / www. cosic. esat. kuleuven. be/ fse2009/ slides/ 2402_1150_Schlaeffer. pdf
References
External links
Christof Paar, Jan Pelzl, "Hash Functions" (http://wiki.crypto.rub.de/Buch/movies.php), Chapter 11 of
"Understanding Cryptography, A Textbook for Students and Practitioners". (companion web site contains online
cryptography course that covers hash functions), Springer, 2009.
"The ECRYPT Hash Function Website" (http://ehash.iaik.tugraz.at/wiki/The_eHash_Main_Page)
"Series of mini-lectures about cryptographic hash functions" (http://www.guardtime.com/
educational-series-on-hashes/) by A. Buldas, 2011.
"Cryptographic Hash-Function Basics: Definitions, Implications, and Separations for Preimage Resistance,
Second-Preimage Resistance, and Collision Resistance" (http://citeseerx.ist.psu.edu/viewdoc/
summary?doi=10.1.1.3.6200) by P. Rogaway, T. Shrimpton, 2004
156
157
Sets
Set (abstract data type)
In computer science, a set is an abstract data structure that can store certain values, without any particular order, and
no repeated values. It is a computer implementation of the mathematical concept of a finite set. Unlike most other
collection types, rather than retrieving a specific element from a set, one typically tests a value for membership in a
set.
Some set data structures are designed for static or frozen sets that do not change after they are constructed. Static
sets allow only query operations on their elements such as checking whether a given value is in the set, or
enumerating the values in some arbitrary order. Other variants, called dynamic or mutable sets, allow also the
insertion and deletion of elements from the set.
An abstract data structure is a collection, or aggregate, of data. The data may be booleans, numbers, characters, or
other data structures. If one considers the structure yielded by packaging[1] or indexing,[2] there are four basic data
structures:
1.
2.
3.
4.
In this view, the contents of a set are a bunch, and isolated data items are elementary bunches (elements). Whereas
sets contain elements, bunches consist of elements.
Further structuring may be achieved by considering the multiplicity of elements (sets become multisets, bunches
become hyperbunches) or their homogeneity (a record is a set of fields, not necessarily all of the same type).
Implementations
A set can be implemented in many ways. For example, one can use a list, ignoring the order of the elements and
taking care to avoid repeated values. Sets are often implemented using various flavors of trees, tries, or hash tables.
A set can be seen, and implemented, as a (partial) associative array, in which the value of each key-value pair has the
unit type.
Type theory
In type theory, sets are generally identified with their indicator function: accordingly, a set of values of type
be denoted by
or
may
. (Subtypes and subsets may be modeled by refinement types, and quotient sets may be
Operations
Core set-theoretical operations
One may define the operations of the algebra of sets:
Static sets
Typical operations that may be provided by a static set structure S are:
Dynamic sets
Dynamic set structures typically add:
create(): creates a new, initially empty set structure.
create_with_capacity(n): creates a new set structure, initially empty but capable of holding up to n
elements.
add(S,x): adds the element x to S, if it is not present already.
remove(S, x): removes the element x from S, if it is present.
capacity(S): returns the maximum number of values that S can hold.
Some set structures may allow only some of these operations. The cost of each operation will depend on the
implementation, and possibly also on the particular values stored in the set, and the order in which they are inserted.
Additional operations
There are many other operations that can (in principle) be defined in terms of the above, such as:
Other operations can be defined for sets with elements of a special type:
sum(S): returns the sum of all elements of S for some definition of "sum". For example, over integers or reals, it
may be defined as fold(0, add, S).
nearest(S,x): returns the element of S that is closest in value to x (by some metric).
158
Implementations
Sets can be implemented using various data structures, which provide different time and space trade-offs for various
operations. Some implementations are designed to improve the efficiency of very specialized operations, such as
nearest or union. Implementations described as "general use" typically strive to optimize the element_of,
add, and delete operations.
As sets can be interpreted as a kind of map (by the indicator function), sets are commonly implemented in the same
way as maps (associative arrays), namely, a self-balancing binary search tree for sorted sets (which has O(log n) for
most operations), or a hash table for unsorted sets (which has O(1) average-case, but O(n) worst-case, for most
operations). A sorted linear hash table may be used to provide deterministically ordered sets.
Other popular methods include arrays. In particular a subset of the integers 1..n can be implemented efficiently as an
n-bit bit array, which also support very efficient union and intersection operations. A Bloom map implements a set
probabilistically, using a very compact representation but risking a small chance of false positives on queries.
The Boolean set operations can be implemented in terms of more elementary operations (pop, clear, and add),
but specialized algorithms may yield lower asymptotic time bounds. If sets are implemented as sorted lists, for
example, the naive algorithm for union(S,T) will take code proportional to the length m of S times the length n
of T; whereas a variant of the list merging algorithm will do the job in time proportional to m+n. Moreover, there are
specialized set data structures (such as the union-find data structure) that are optimized for one or more of these
operations, at the expense of others.
Language support
One of the earliest languages to support sets was Pascal; many languages now include it, whether in the core
language or in a standard library.
Java offers the Set [3] interface to support sets (with the HashSet [4] class implementing it using a hash table),
and the SortedSet [5] sub-interface to support sorted sets (with the TreeSet [6] class implementing it using
a binary search tree).
Apple's Foundation framework (part of Cocoa) provides the Objective-C classes NSSet [7], NSMutableSet
[8]
, NSCountedSet [9], NSOrderedSet [10], and NSMutableOrderedSet [11]. The CoreFoundation
APIs provide the CFSet [12] and CFMutableSet [13] types for use in C.
Python has built-in set and frozenset types [14] since 2.4, and since Python 3.0 and 2.7, supports
non-empty set literals using a curly-bracket syntax, e.g.: {x, y, z}.
The .NET Framework provides the generic HashSet [15] and SortedSet [16] classes that implement the
generic ISet [17] interface.
Smalltalk's class library includes Set and IdentitySet, using equality and identity for inclusion test
respectively. Many dialects provide variations for compressed storage (NumberSet, CharacterSet), for
ordering (OrderedSet, SortedSet, etc.) or for weak references (WeakIdentitySet).
Ruby's standard library includes a set [18] module which contains Set and SortedSet classes that
implement sets using hash tables, the latter allowing iteration in sorted order.
OCaml's standard library contains a Set module, which implements a functional set data structure using binary
search trees.
The GHC implementation of Haskell provides a Data.Set [19] module, which implements a functional set data
structure using binary search trees.
The Tcl Tcllib package provides a set module which implements a set data structure based upon TCL lists.
As noted in the previous section, in languages which do not directly support sets but do support associative arrays,
sets can be emulated using associative arrays, by using the elements as keys, and using a dummy value as the values,
which are ignored.
159
In C++
In C++, the Standard Template Library (STL) provides the set template class, which implements a sorted set using
a binary search tree; SGI's STL also provides the hash_set template class, which implements a set using a hash
table.
In sets, the elements themselves are the keys, in contrast to sequenced containers, where elements are accessed using
their (relative or absolute) position. Set elements must have a strict weak ordering.
Multiset
A generalization of the notion of a set is that of a multiset or bag, which is similar to a set but allows repeated
("equal") values (duplicates). This is used in two distinct senses: either equal values are considered identical, and are
simply counted, or equal values are considered equivalent, and are stored as distinct items. For example, given a list
of people (by name) and ages (in years), one could construct a multiset of ages, which simply counts the number of
people of a given age. Alternatively, one can construct a multiset of people, where two people are considered
equivalent if their ages are the same (but may be different people and have different names), in which case each pair
(name, age) must be stored, and selecting on a given age gives all the people of a given age.
Formally, it is possible for objects in computer science to be considered "equal" under some equivalence relation but
still distinct under another relation. Some types of multiset implementations will store distinct equal objects as
separate items in the data structure; while others will collapse it down to one version (the first one encountered) and
keep a positive integer count of the multiplicity of the element.
As with sets, multisets can naturally be implemented using hash table or trees, which yield different performance
characteristics.
The set of all bags over type T is given by the expression bag T. If by multiset one considers equal items identical
and simply counts them, then a multiset can be interpreted as a function from the input domain to the non-negative
integers (natural numbers), generalizing the identification of a set with its indicator function. In some cases a
multiset in this counting sense may be generalized to allow negative values, as in Python.
C++'s Standard Template Library implements both sorted and unsorted multisets. It provides the multiset
class for the sorted multiset, as a kind of associative container, which implements this multiset using a
self-balancing binary search tree. It provides the unordered_multiset class for the unsorted multiset, as a
kind of unordered associative containers, which implements this multiset using a hash table. The unsorted
multiset is standard as of C++11; previously SGI's STL provides the hash_multiset class, which was copied
and eventually standardized.
For Java, third-party libraries provide multiset functionality:
Apache Commons Collections provides the Bag [20] and SortedBag interfaces, with implementing classes
like HashBag and TreeBag.
Google Guava provides the Multiset [21] interface, with implementing classes like HashMultiset and
TreeMultiset.
Apple provides the NSCountedSet [9] class as part of Cocoa, and the CFBag [22] and CFMutableBag [23]
types as part of CoreFoundation.
Python's standard library includes collections.Counter [24], which is similar to a multiset.
Smalltalk includes the Bag class, which can be instantiated to use either identity or equality as predicate for
inclusion test.
Where a multiset data structure is not available, a workaround is to use a regular set, but override the equality
predicate of its items to always return "not equal" on distinct objects (however, such will still not be able to store
multiple occurrences of the same object) or use an associative array mapping the values to their integer multiplicities
(this will not be able to distinguish between equal elements at all).
160
Multisets in SQL
In relational databases, a table can be a (mathematical) set or a multiset, depending on the presence on unicity
constraints on some columns (which turns it into a candidate key).
SQL allows the selection of rows from a relational table: this operation will in general yield a multiset, unless the
keyword DISTINCT is used to force the rows to be all different, or the selection includes the primary (or a
candidate) key.
In ANSI SQL the MULTISET keyword can be used to transform a subquery in a collection expression:
SELECT expression1, expression2... FROM table_name...
is a general select that can be used as subquery expression of another more general query, while
MULTISET(SELECT expression1, expression2... FROM table_name...)
transforms the subquery into a collection expression that can be used in another query, or in assignment to a column
of appropriate collection type.
References
[1] "Packaging" consists in supplying a container for an aggregation of objects in order to turn them into a single object. Consider a function call:
without packaging, a function can be called to act upon a bunch only by passing each bunch element as a separate argument, which
complicates the function's signature considerably (and is just not possible in some programming languages). By packaging the bunch's
elements into a set, the function may now be called upon a single, elementary argument: the set object (the bunch's package).
[2] Indexing is possible when the elements being considered are totally ordered. Being without order, the elements of a multiset (for example) do
not have lesser/greater or preceding/succeeding relationships: they can only be compared in absolute terms (same/different).
[3] http:/ / download. oracle. com/ javase/ 7/ docs/ api/ java/ util/ Set. html
[4] http:/ / download. oracle. com/ javase/ 7/ docs/ api/ java/ util/ HashSet. html
[5] http:/ / download. oracle. com/ javase/ 7/ docs/ api/ java/ util/ SortedSet. html
[6] http:/ / download. oracle. com/ javase/ 7/ docs/ api/ java/ util/ TreeSet. html
[7] http:/ / developer. apple. com/ documentation/ Cocoa/ Reference/ Foundation/ Classes/ NSSet_Class/
[8] http:/ / developer. apple. com/ documentation/ Cocoa/ Reference/ Foundation/ Classes/ NSMutableSet_Class/
[9] http:/ / developer. apple. com/ documentation/ Cocoa/ Reference/ Foundation/ Classes/ NSCountedSet_Class/
[10] http:/ / developer. apple. com/ library/ mac/ #documentation/ Foundation/ Reference/ NSOrderedSet_Class/ Reference/ Reference. html
[11] https:/ / developer. apple. com/ library/ mac/ #documentation/ Foundation/ Reference/ NSMutableOrderedSet_Class/ Reference/ Reference.
html
[12] http:/ / developer. apple. com/ documentation/ CoreFoundation/ Reference/ CFSetRef/
[13] http:/ / developer. apple. com/ documentation/ CoreFoundation/ Reference/ CFMutableSetRef/
[14] http:/ / docs. python. org/ library/ stdtypes. html#set-types-set-frozenset
[15] http:/ / msdn. microsoft. com/ en-us/ library/ bb359438. aspx
[16] http:/ / msdn. microsoft. com/ en-us/ library/ dd412070. aspx
[17] http:/ / msdn. microsoft. com/ en-us/ library/ dd412081. aspx
[18] http:/ / ruby-doc. org/ stdlib/ libdoc/ set/ rdoc/ index. html
[19] http:/ / hackage. haskell. org/ packages/ archive/ containers/ 0. 2. 0. 1/ doc/ html/ Data-Set. html
161
http:/ / commons. apache. org/ collections/ api-release/ org/ apache/ commons/ collections/ Bag. html
http:/ / google-collections. googlecode. com/ svn/ trunk/ javadoc/ com/ google/ common/ collect/ Multiset. html
http:/ / developer. apple. com/ documentation/ CoreFoundation/ Reference/ CFBagRef/
http:/ / developer. apple. com/ documentation/ CoreFoundation/ Reference/ CFMutableBagRef/
http:/ / docs. python. org/ library/ collections. html#collections. Counter
Bit array
A bit array (also known as bitmap, bitset, bit string, or bit vector) is an array data structure that compactly stores
bits. It can be used to implement a simple set data structure. A bit array is effective at exploiting bit-level parallelism
in hardware to perform operations quickly. A typical bit array stores kw bits, where w is the number of bits in the
unit of storage, such as a byte or word, and k is some nonnegative integer. If w does not divide the number of bits to
be stored, some space is wasted due to internal fragmentation.
Definition
A bit array is a mapping from some domain (almost always a range of integers) to values in the set {0, 1}. The
values can be interpreted as dark/light, absent/present, locked/unlocked, valid/invalid, et cetera. The point is that
there are only two possible values, so they can be stored in one bit. The array can be viewed as a subset of the
domain (e.g. {0, 1, 2, ..., n1}), where a 1 bit indicates a number in the set and a 0 bit a number not in the set. This
set data structure uses about n/w words of space, where w is the number of bits in each machine word. Whether the
least significant bit or the most significant bit indicates the smallest-index number is largely irrelevant, but the
former tends to be preferred.
Basic operations
Although most machines are not able to address individual bits in memory, nor have instructions to manipulate
single bits, each bit in a word can be singled out and manipulated using bitwise operations. In particular:
OR can be used to set a bit to one: 11101010 OR 00000100 = 11101110
AND can be used to set a bit to zero: 11101010 AND 11111101 = 11101000
AND together with zero-testing can be used to determine if a bit is set:
11101010 AND 00000001 = 00000000 = 0
11101010 AND 00000010 = 00000010 0
XOR can be used to invert or toggle a bit:
11101010 XOR 00000100 = 11101110
11101110 XOR 00000100 = 11101010
NOT can be used to invert all bits.
NOT 10110010 = 01001101
To obtain the bit mask needed for these operations, we can use a bit shift operator to shift the number 1 to the left by
the appropriate number of places, as well as bitwise negation if necessary.
Given two bit arrays of the same size representing sets, we can compute their union, intersection, and set-theoretic
difference using n/w simple bit operations each (2n/w for difference), as well as the complement of either:
for i from 0 to n/w-1
complement_a[i] := not a[i]
union[i]
:= a[i] or b[i]
intersection[i] := a[i] and b[i]
162
Bit array
163
difference[i]
If we wish to iterate through the bits of a bit array, we can do this efficiently using a doubly nested loop that loops
through each word, one at a time. Only n/w memory accesses are required:
for i from 0 to n/w-1
index := 0
// if needed
word := a[i]
for b from 0 to w-1
value := word and 1 0
word := word shift right 1
// do something with value
index := index + 1
// if needed
Both of these code samples exhibit ideal locality of reference, which will subsequently receive large performance
boost from a data cache. If a cache line is k words, only about n/wk cache misses will occur.
Sorting
Similarly, sorting a bit array is trivial to do in O(n) time using counting sort we count the number of ones k, fill
the last k/w words with ones, set only the low k mod w bits of the next word, and set the rest to zero.
Inversion
Vertical flipping of a one-bit-per-pixel image, or some FFT algorithms, require to flip the bits of individual words
(so b31 b30 ... b0 becomes b0 ... b30 b31). When this operation is not available on the processor, it's
still possible to proceed by successive passes, in this example on 32 bits:
exchange two 16bit halfwords
exchange bytes by pairs (0xddccbbaa -> 0xccddaabb)
...
swap bits by pairs
swap bits (b31 b30 ... b1 b0 -> b30 b31 ... b0 b1)
The last operation can be written ((x&0x55555555)<<1) | (x&0xaaaaaaaa)>>1)).
Bit array
base 2 (see find first set) can also be extended to a bit array in a straightforward manner.
Compression
A bit array is the densest storage for "random" bits, that is, where each bit is equally likely to be 0 or 1, and each one
is independent. But most data is not random, so it may be possible to store it more compactly. For example, the data
of a typical fax image is not random and can be compressed. Run-length encoding is commonly used to compress
these long streams. However, by compressing bit arrays too aggressively we run the risk of losing the benefits due to
bit-level parallelism (vectorization). Thus, instead of compressing bit arrays as streams of bits, we might compress
them as streams bytes or words (see Bitmap index (compression)).
The specific compression technique and implementation details can affect performance. Thus, it might be helpful in
practice to benchmark the various implementations [1].
Examples:
164
Bit array
Applications
Because of their compactness, bit arrays have a number of applications in areas where space or efficiency is at a
premium. Most commonly, they are used to represent a simple group of boolean flags or an ordered sequence of
boolean values.
Bit arrays are used for priority queues, where the bit at index k is set if and only if k is in the queue; this data
structure is used, for example, by the Linux kernel, and benefits strongly from a find-first-zero operation in
hardware.
Bit arrays can be used for the allocation of memory pages, inodes, disk sectors, etc. In such cases, the term bitmap
may be used. However, this term is frequently used to refer to raster images, which may use multiple bits per pixel.
Another application of bit arrays is the Bloom filter, a probabilistic set data structure that can store large sets in a
small space in exchange for a small probability of error. It is also possible to build probabilistic hash tables based on
bit arrays that accept either false positives or false negatives.
Bit arrays and the operations on them are also important for constructing succinct data structures, which use close to
the minimum possible space. In this context, operations like finding the nth 1 bit or counting the number of 1 bits up
to a certain position become important.
Bit arrays are also a useful abstraction for examining streams of compressed data, which often contain elements that
occupy portions of bytes or are not byte-aligned. For example, the compressed Huffman coding representation of a
single 8-bit character can be anywhere from 1 to 255 bits long.
In information retrieval, bit arrays are a good representation for the posting lists of very frequent terms. If we
compute the gaps between adjacent values in a list of strictly increasing integers and encode them using unary
coding, the result is a bit array with a 1 bit in the nth position if and only if n is in the list. The implied probability of
a gap of n is 1/2n. This is also the special case of Golomb coding where the parameter M is 1; this parameter is only
normally selected when -log(2-p)/log(1-p) 1, or roughly the term occurs in at least 38% of documents.
Language support
The C programming language's bitfields, pseudo-objects found in structs with size equal to some number of bits, are
in fact small bit arrays; they are limited in that they cannot span words. Although they give a convenient syntax, the
bits are still accessed using bitwise operators on most machines, and they can only be defined statically (like C's
static arrays, their sizes are fixed at compile-time). It is also a common idiom for C programmers to use words as
small bit arrays and access bits of them using bit operators. A widely available header file included in the X11
system, xtrapbits.h, is a portable way for systems to define bit field manipulation of arrays of bits. A more
explanatory description of aforementioned approach can be found in the comp.lang.c faq [8].
In C++, although individual bools typically occupy the same space as a byte or an integer, the STL type
vector<bool> is a partial template specialization in which bits are packed as a space efficiency optimization.
Since bytes (and not bits) are the smallest addressable unit in C++, the [] operator does not return a reference to an
element, but instead returns a proxy reference. This might seem a minor point, but it means that vector<bool> is
not a standard STL container, which is why the use of vector<bool> is generally discouraged. Another unique
STL class, bitset, creates a vector of bits fixed at a particular size at compile-time, and in its interface and syntax
more resembles the idiomatic use of words as bit sets by C programmers. It also has some additional power, such as
the ability to efficiently count the number of bits that are set. The Boost C++ Libraries provide a
dynamic_bitset class whose size is specified at run-time.
The D programming language provides bit arrays in both of its competing standard libraries. In Phobos, they are
provided in std.bitmanip, and in Tango, they are provided in tango.core.BitArray. As in C++, the []
operator does not return a reference, since individual bits are not directly addressable on most hardware, but instead
returns a bool.
165
Bit array
In Java, the class BitSet [9] creates a bit array that is then manipulated with functions named after bitwise
operators familiar to C programmers. Unlike the bitset in C++, the Java BitSet does not have a "size" state (it
has an effectively infinite size, initialized with 0 bits); a bit can be set or tested at any index. In addition, there is a
class EnumSet [10], which represents a Set of values of an enumerated type internally as a bit vector, as a safer
alternative to bitfields.
The .NET Framework supplies a BitArray collection class. It stores boolean values, supports random access and
bitwise operators, can be iterated over, and its Length property can be changed to grow or truncate it.
Although Standard ML has no support for bit arrays, Standard ML of New Jersey has an extension, the BitArray
structure, in its SML/NJ Library. It is not fixed in size and supports set operations and bit operations, including,
unusually, shift operations.
Haskell likewise currently lacks standard support for bitwise operations, but both GHC and Hugs provide a
Data.Bits module with assorted bitwise functions and operators, including shift and rotate operations and an
"unboxed" array over boolean values may be used to model a Bit array, although this lacks support from the former
module.
In Perl, strings can be used as expandable bit arrays. They can be manipulated using the usual bitwise operators (~
| & ^),[11] and individual bits can be tested and set using the vec function.[12]
In Ruby, you can access (but not set) a bit of an integer (Fixnum or Bignum) using the bracket operator ([]), as if
it were an array of bits.
Apple's Core Foundation library contains CFBitVector [13] and CFMutableBitVector [14] structures.
PL/I supports arrays of bit strings of arbitrary length, which may be either fixed-length or varying. The array
elements may be aligned each element begins on a byte or word boundary or unaligned elements
immediately follow each other with no padding.
Hardware description languages such as VHDL, Verilog, and SystemVerilog natively support bit vectors as these are
used to model storage elements like flip-flops, hardware busses and hardware signals in general. In hardware
verification languages such as OpenVera, e and SystemVerilog, bit vectors are used to sample values from the
hardware models, and to represent data that is transferred to hardware during simulations.
References
[1] https:/ / github. com/ lemire/ simplebitmapbenchmark
[2] http:/ / code. google. com/ p/ compressedbitset/
[3] http:/ / code. google. com/ p/ javaewah/
[4] http:/ / ricerca. mat. uniroma3. it/ users/ colanton/ concise. html
[5] http:/ / github. com/ lemire/ EWAHBoolArray
[6] http:/ / code. google. com/ p/ csharpewah/
[7] http:/ / code. google. com/ p/ sparsebitmap/
[8] http:/ / c-faq. com/ misc/ bitsets. html
[9] http:/ / download. oracle. com/ javase/ 7/ docs/ api/ java/ util/ BitSet. html
[10] http:/ / download. oracle. com/ javase/ 7/ docs/ api/ java/ util/ EnumSet. html
[11] http:/ / perldoc. perl. org/ perlop. html#Bitwise-String-Operators
[12] http:/ / perldoc. perl. org/ functions/ vec. html
[13] http:/ / developer. apple. com/ library/ mac/ #documentation/ CoreFoundation/ Reference/ CFBitVectorRef/ Reference/ reference. html
[14] http:/ / developer. apple. com/ library/ mac/ #documentation/ CoreFoundation/ Reference/ CFMutableBitVectorRef/ Reference/ reference.
html#/ / apple_ref/ doc/ uid/ 20001500
166
Bit array
167
External links
Bloom filter
Part of a series on
Probabilistic
data structures
Bloom filter Quotient filter Skip list
Random trees
Random binary tree Treap
Rapidly exploring random tree
Related
Randomized algorithm
Computer science Portal
A Bloom filter, conceived by Burton Howard Bloom in 1970 is a space-efficient probabilistic data structure that is
used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not;
i.e. a query returns either "inside set (may be wrong)" or "definitely not in set". Elements can be added to the set, but
not removed (though this can be addressed with a "counting" filter). The more elements that are added to the set, the
larger the probability of false positives.
Bloom proposed the technique for applications where the amount of source data would require an impracticably
large hash area in memory if "conventional" error-free hashing techniques were applied. He gave the example of a
hyphenation algorithm for a dictionary of 500,000 words, of which 90% could be hyphenated by following simple
rules but all the remaining 50,000 words required expensive disk access to retrieve their specific patterns. With
unlimited core memory, an error-free hash could be used to eliminate all the unnecessary disk access. But if core
memory was insufficient, a smaller hash area could be used to eliminate most of the unnecessary access. For
example, a hash area only 15% of the error-free size would still eliminate 85% of the disk accesses (Bloom (1970)).
More generally, fewer than 10 bits per element are required for a 1% false positive probability, independent of the
size or number of elements in the set (Bonomi et al. (2006)).
Bloom filter
168
Algorithm description
An empty Bloom filter is a bit array
of m bits, all set to 0. There must also
be k different hash functions defined,
each of which maps or hashes some set
element to one of the m array positions
with a uniform random distribution.
To add an element, feed it to each of
the k hash functions to get k array
positions. Set the bits at all these
positions to 1.
An example of a Bloom filter, representing the set {x, y, z}. The colored arrows show the
positions in the bit array that each set element is mapped to. The element w is not in the
set {x, y, z}, because it hashes to one bit-array position containing 0. For this figure,
m=18 and k=3.
Bloom filter
169
Bloom filter
170
been assumed.
Now test membership of an element that is not in the set. Each of the k array positions computed by the hash
functions is 1 with a probability as above. The probability of all of them being 1, which would cause the algorithm to
erroneously claim that the element is in the set, is often given as
This is not strictly correct as it assumes independence for the probabilities of each bit being set. However, assuming
it is a close approximation we have that the probability of false positives decreases as m (the number of bits in the
array) increases, and increases as n (the number of inserted elements) increases. For a given m and n, the value of k
(the number of hash functions) that minimizes the probability is
which gives
The required number of bits m, given n (the number of inserted elements) and a desired false positive probability p
(and assuming the optimal value of k is used) can be computed by substituting the optimal value of k in the
probability expression above:
Bloom filter
171
This means that for a given false positive probability p, the length of a Bloom filter m is proportionate to the number
of elements being filtered n. While the above formula is asymptotic (i.e. applicable as m,n ), the agreement with
finite values of m,n is also quite good; the false positive probability for a finite bloom filter with m bits, n elements,
and k hash functions is at most
So we can use the asymptotic formula if we pay a penalty for at most half an extra element and at most one fewer bit.
where
is an estimate of the number of items in the filter, N is length of the filter, k is the number of hash
and
.
The size of their union can be estimated as
,
where
is the number of bits set to one in either of the two bloom filters. And the intersection can be
estimated as
,
Using the three formulas together.
Interesting properties
Unlike a standard hash table, a Bloom filter of a fixed size can represent a set with an arbitrary large number of
elements; adding an element never fails due to the data structure "filling up." However, the false positive rate
increases steadily as elements are added until all bits in the filter are set to 1, at which point all queries yield a
positive result.
Union and intersection of Bloom filters with the same size and set of hash functions can be implemented with
bitwise OR and AND operations, respectively. The union operation on Bloom filters is lossless in the sense that
the resulting Bloom filter is the same as the Bloom filter created from scratch using the union of the two sets. The
intersect operation satisfies a weaker property: the false positive probability in the resulting Bloom filter is at
most the false-positive probability in one of the constituent Bloom filters, but may be larger than the false positive
probability in the Bloom filter created from scratch using the intersection of the two sets. There are also more
Bloom filter
172
accurate estimates of intersection and unionWikipedia:Please clarify that are not biased in this way.[citation needed]
Some kinds of superimposed code can be seen as a Bloom filter implemented with physical edge-notched cards.
Examples
Google BigTable and Apache Cassandra use Bloom filters to reduce the disk lookups for non-existent rows or
columns. Avoiding costly disk lookups considerably increases the performance of a database query operation.
The Google Chrome web browser uses a Bloom filter to identify malicious URLs. Any URL is first checked against
a local Bloom filter and only upon a hit a full check of the URL is performed.[1]
The Squid Web Proxy Cache uses Bloom filters for cache digests [2].
Bitcoin uses Bloom filters to verify payments without running a full network node.[3][4]
The Venti archival storage system uses Bloom filters to detect previously stored data.[5]
The SPIN model checker uses Bloom filters to track the reachable state space for large verification problems.[6]
The Cascading analytics framework uses Bloomfilters to speed up asymmetric joins, where one of the joined data
sets is significantly larger than the other (often called Bloom join in the database literature).[7]
Alternatives
Classic Bloom filters use
Bloom filter. However, the space that is strictly necessary for any data structure playing the same role as a Bloom
filter is only
per key (Pagh, Pagh & Rao 2005). Hence Bloom filters use 44% more space than a
hypothetical equivalent optimal data structure. The number of hash functions used to achieve a given false positive
rate is proportional to
which is not optimal as it has been proved that an optimal data structure would need
only a constant number of hash functions independent of the false positive rate.
Stern & Dill (1996) describe a probabilistic structure based on hash tables, hash compaction, which Dillinger &
Manolios (2004b) identify as significantly more accurate than a Bloom filter when each is configured optimally.
Dillinger and Manolios, however, point out that the reasonable accuracy of any given Bloom filter over a wide range
of numbers of additions makes it attractive for probabilistic enumeration of state spaces of unknown size. Hash
compaction is, therefore, attractive when the number of additions can be predicted accurately; however, despite
being very fast in software, hash compaction is poorly suited for hardware because of worst-case linear access time.
Putze, Sanders & Singler (2007) have studied some variants of Bloom filters that are either faster or use less space
than classic Bloom filters. The basic idea of the fast variant is to locate the k hash values associated with each key
into one or two blocks having the same size as processor's memory cache blocks (usually 64 bytes). This will
presumably improve performance by reducing the number of potential memory cache misses. The proposed variants
have however the drawback of using about 32% more space than classic Bloom filters.
The space efficient variant relies on using a single hash function that generates for each key a value in the range
where is the requested false positive rate. The sequence of values is then sorted and compressed using
Golomb coding (or some other compression technique) to occupy a space close to
Bloom filter for a given key, it will suffice to check if its corresponding value is stored in the Bloom filter.
Decompressing the whole Bloom filter for each query would make this variant totally unusable. To overcome this
problem the sequence of values is divided into small blocks of equal size that are compressed separately. At query
time only half a block will need to be decompressed on average. Because of decompression overhead, this variant
may be slower than classic Bloom filters but this may be compensated by the fact that a single hash function need to
be computed.
Another alternative to classic Bloom filter is the one based on space efficient variants of cuckoo hashing. In this case
once the hash table is constructed, the keys stored in the hash table are replaced with short signatures of the keys.
Bloom filter
Those signatures are strings of bits computed using a hash function applied on the keys.
Data synchronization
Bloom filters can be used for approximate data synchronization as in Byers et al. (2004). Counting Bloom filters can
be used to approximate the number of differences between two sets and this approach is described in Agarwal &
Trachtenberg (2006).
Bloomier filters
Chazelle et al. (2004) designed a generalization of Bloom filters that could associate a value with each element that
had been inserted, implementing an associative array. Like Bloom filters, these structures achieve a small space
overhead by accepting a small probability of false positives. In the case of "Bloomier filters", a false positive is
defined as returning a result when the key is not in the map. The map will never return the wrong value for a key that
is in the map.
Compact approximators
Boldi & Vigna (2005) proposed a lattice-based generalization of Bloom filters. A compact approximator associates
to each key an element of a lattice (the standard Bloom filters being the case of the Boolean two-element lattice).
Instead of a bit array, they have an array of lattice elements. When adding a new association between a key and an
element of the lattice, they compute the maximum of the current contents of the k array locations associated to the
key with the lattice element. When reading the value associated to a key, they compute the minimum of the values
173
Bloom filter
174
found in the k locations associated to the key. The resulting value approximates from above the original value.
By using attenuated Bloom filters consisting of multiple layers, services at more than one hop distance can be
discovered while avoiding saturation of the Bloom filter by attenuating (shifting out) bits set by sources further
away.
Bloom filter
Notes
[1]
[2]
[3]
[4]
[5]
[6]
[7]
References
Koucheryavy, Y.; Giambene, G.; Staehle, D.; Barcelo-Arroyo, F.; Braun, T.; Siris, V. (2009), "Traffic and QoS
Management in Wireless Multimedia Networks", COST 290 Final Report (USA): 111
Kubiatowicz, J.; Bindel, D.; Czerwinski, Y.; Geels, S.; Eaton, D.; Gummadi, R.; Rhea, S.; Weatherspoon, H. et al.
(2000), "Oceanstore: An architecture for global-scale persistent storage" (http://ftp.csd.uwo.ca/courses/
CS9843b/papers/OceanStore.pdf), ACM SIGPLAN Notices (USA): 190201 |displayauthors= suggested
(help)
Agarwal, Sachin; Trachtenberg, Ari (2006), "Approximating the number of differences between remote sets"
(http://www.deutsche-telekom-laboratories.de/~agarwals/publications/itw2006.pdf), IEEE Information
Theory Workshop (Punta del Este, Uruguay): 217, doi: 10.1109/ITW.2006.1633815 (http://dx.doi.org/10.
1109/ITW.2006.1633815), ISBN1-4244-0035-X
Ahmadi, Mahmood; Wong, Stephan (2007), "A Cache Architecture for Counting Bloom Filters" (http://www.
ieeexplore.ieee.org/xpls/abs_all.jsp?isnumber=4444031&arnumber=4444089&count=113&index=57), 15th
international Conference on Networks (ICON-2007), p.218, doi: 10.1109/ICON.2007.4444089 (http://dx.doi.
org/10.1109/ICON.2007.4444089), ISBN978-1-4244-1229-7
Almeida, Paulo; Baquero, Carlos; Preguica, Nuno; Hutchison, David (2007), "Scalable Bloom Filters" (http://
gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf), Information Processing Letters 101 (6): 255261, doi:
10.1016/j.ipl.2006.10.007 (http://dx.doi.org/10.1016/j.ipl.2006.10.007)
Byers, John W.; Considine, Jeffrey; Mitzenmacher, Michael; Rost, Stanislav (2004), "Informed content delivery
across adaptive overlay networks", IEEE/ACM Transactions on Networking 12 (5): 767, doi:
10.1109/TNET.2004.836103 (http://dx.doi.org/10.1109/TNET.2004.836103)
Bloom, Burton H. (1970), "Space/Time Trade-offs in Hash Coding with Allowable Errors" (https://dl.acm.org/
citation.cfm?doid=362686.362692), Communications of the ACM 13 (7): 422426, doi: 10.1145/362686.362692
(http://dx.doi.org/10.1145/362686.362692)
Boldi, Paolo; Vigna, Sebastiano (2005), "Mutable strings in Java: design, implementation and lightweight
text-search algorithms", Science of Computer Programming 54 (1): 323, doi: 10.1016/j.scico.2004.05.003 (http:/
/dx.doi.org/10.1016/j.scico.2004.05.003)
Bonomi, Flavio; Mitzenmacher, Michael; Panigrahy, Rina; Singh, Sushil; Varghese, George (2006), "An
Improved Construction for Counting Bloom Filters" (http://theory.stanford.edu/~rinap/papers/esa2006b.pdf),
Algorithms ESA 2006, 14th Annual European Symposium, Lecture Notes in Computer Science 4168,
pp.684695, doi: 10.1007/11841036_61 (http://dx.doi.org/10.1007/11841036_61), ISBN978-3-540-38875-3
Broder, Andrei; Mitzenmacher, Michael (2005), "Network Applications of Bloom Filters: A Survey" (http://
www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf), Internet Mathematics 1 (4): 485509, doi:
175
Bloom filter
10.1080/15427951.2004.10129096 (http://dx.doi.org/10.1080/15427951.2004.10129096)
Chang, Fay; Dean, Jeffrey; Ghemawat, Sanjay; Hsieh, Wilson; Wallach, Deborah; Burrows, Mike; Chandra,
Tushar; Fikes, Andrew et al. (2006), "Bigtable: A Distributed Storage System for Structured Data" (http://
research.google.com/archive/bigtable.html), Seventh Symposium on Operating System Design and
Implementation |displayauthors= suggested (help)
Charles, Denis; Chellapilla, Kumar (2008), "Bloomier Filters: A second look", The Computing Research
Repository (CoRR), arXiv: 0807.0928 (http://arxiv.org/abs/0807.0928)
Chazelle, Bernard; Kilian, Joe; Rubinfeld, Ronitt; Tal, Ayellet (2004), "The Bloomier filter: an efficient data
structure for static support lookup tables" (http://www.ee.technion.ac.il/~ayellet/Ps/nelson.pdf),
Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp.3039
Cohen, Saar; Matias, Yossi (2003), "Spectral Bloom Filters" (http://www.sigmod.org/sigmod03/eproceedings/
papers/r09p02.pdf), Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data,
pp.241252, doi: 10.1145/872757.872787 (http://dx.doi.org/10.1145/872757.872787), ISBN158113634X
Wikipedia:Link rot
Deng, Fan; Rafiei, Davood (2006), "Approximately Detecting Duplicates for Streaming Data using Stable Bloom
Filters" (http://webdocs.cs.ualberta.ca/~drafiei/papers/DupDet06Sigmod.pdf), Proceedings of the ACM
SIGMOD Conference, pp.2536
Dharmapurikar, Sarang; Song, Haoyu; Turner, Jonathan; Lockwood, John (2006), "Fast packet classification
using Bloom filters" (http://www.arl.wustl.edu/~sarang/ancs6819-dharmapurikar.pdf), Proceedings of the
2006 ACM/IEEE Symposium on Architecture for Networking and Communications Systems, pp.6170, doi:
10.1145/1185347.1185356 (http://dx.doi.org/10.1145/1185347.1185356), ISBN1595935800
Dietzfelbinger, Martin; Pagh, Rasmus (2008), "Succinct Data Structures for Retrieval and Approximate
Membership", The Computing Research Repository (CoRR), arXiv: 0803.3693 (http://arxiv.org/abs/0803.
3693)
Swamidass, S. Joshua; Baldi, Pierre (2007), "Mathematical correction for fingerprint similarity measures to
improve chemical retrieval", Journal of chemical information and modeling (ACS Publications) 47 (3): 952964
|accessdate= requires |url= (help)
Dillinger, Peter C.; Manolios, Panagiotis (2004a), "Fast and Accurate Bitstate Verification for SPIN" (http://
www.ccs.neu.edu/home/pete/research/spin-3spin.html), Proceedings of the 11th International Spin
Workshop on Model Checking Software, Springer-Verlag, Lecture Notes in Computer Science 2989
Dillinger, Peter C.; Manolios, Panagiotis (2004b), "Bloom Filters in Probabilistic Verification" (http://www.ccs.
neu.edu/home/pete/research/bloom-filters-verification.html), Proceedings of the 5th International Conference
on Formal Methods in Computer-Aided Design, Springer-Verlag, Lecture Notes in Computer Science 3312
Donnet, Benoit; Baynat, Bruno; Friedman, Timur (2006), "Retouched Bloom Filters: Allowing Networked
Applications to Flexibly Trade Off False Positives Against False Negatives" (http://www.adetti.iscte.pt/
events/CONEXT06/Conext06_Proceedings/papers/13.html), CoNEXT 06 2nd Conference on Future
Networking Technologies
Eppstein, David; Goodrich, Michael T. (2007), "Space-efficient straggler identification in round-trip data streams
via Newton's identities and invertible Bloom filters", Algorithms and Data Structures, 10th International
Workshop, WADS 2007, Springer-Verlag, Lecture Notes in Computer Science 4619, pp.637648, arXiv:
0704.3313 (http://arxiv.org/abs/0704.3313)
Fan, Li; Cao, Pei; Almeida, Jussara; Broder, Andrei (2000), "Summary Cache: A Scalable Wide-Area Web Cache
Sharing Protocol", IEEE/ACM Transactions on Networking 8 (3): 281293, doi: 10.1109/90.851975 (http://dx.
doi.org/10.1109/90.851975). A preliminary version appeared at SIGCOMM '98.
Goel, Ashish; Gupta, Pankaj (2010), "Small subset queries and bloom filters using ternary associative memories,
with applications", ACM Sigmetrics 2010 38: 143, doi: 10.1145/1811099.1811056 (http://dx.doi.org/10.1145/
1811099.1811056)
176
Bloom filter
Kirsch, Adam; Mitzenmacher, Michael (2006), "Less Hashing, Same Performance: Building a Better Bloom
Filter" (http://www.eecs.harvard.edu/~kirsch/pubs/bbbf/esa06.pdf), in Azar, Yossi; Erlebach, Thomas,
Algorithms ESA 2006, 14th Annual European Symposium, Lecture Notes in Computer Science 4168,
Springer-Verlag, Lecture Notes in Computer Science 4168, pp.456467, doi: 10.1007/11841036 (http://dx.doi.
org/10.1007/11841036), ISBN978-3-540-38875-3
Mortensen, Christian Worm; Pagh, Rasmus; Ptracu, Mihai (2005), "On dynamic range reporting in one
dimension", Proceedings of the Thirty-seventh Annual ACM Symposium on Theory of Computing, pp.104111,
doi: 10.1145/1060590.1060606 (http://dx.doi.org/10.1145/1060590.1060606), ISBN1581139608
Pagh, Anna; Pagh, Rasmus; Rao, S. Srinivasa (2005), "An optimal Bloom filter replacement" (http://www.it-c.
dk/people/pagh/papers/bloom.pdf), Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete
Algorithms, pp.823829
Porat, Ely (2008), "An Optimal Bloom Filter Replacement Based on Matrix Solving", The Computing Research
Repository (CoRR), arXiv: 0804.1845 (http://arxiv.org/abs/0804.1845)
Putze, F.; Sanders, P.; Singler, J. (2007), "Cache-, Hash- and Space-Efficient Bloom Filters" (http://algo2.iti.
uni-karlsruhe.de/singler/publications/cacheefficientbloomfilters-wea2007.pdf), in Demetrescu, Camil,
Experimental Algorithms, 6th International Workshop, WEA 2007, Lecture Notes in Computer Science 4525,
Springer-Verlag, Lecture Notes in Computer Science 4525, pp.108121, doi: 10.1007/978-3-540-72845-0 (http:/
/dx.doi.org/10.1007/978-3-540-72845-0), ISBN978-3-540-72844-3
Sethumadhavan, Simha; Desikan, Rajagopalan; Burger, Doug; Moore, Charles R.; Keckler, Stephen W. (2003),
"Scalable hardware memory disambiguation for high ILP processors" (http://www.cs.utexas.edu/users/simha/
publications/lsq.pdf), 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003,
MICRO-36, pp.399410, doi: 10.1109/MICRO.2003.1253244 (http://dx.doi.org/10.1109/MICRO.2003.
1253244), ISBN0-7695-2043-X
Shanmugasundaram, Kulesh; Brnnimann, Herv; Memon, Nasir (2004), "Payload attribution via hierarchical
Bloom filters", Proceedings of the 11th ACM Conference on Computer and Communications Security, pp.3141,
doi: 10.1145/1030083.1030089 (http://dx.doi.org/10.1145/1030083.1030089), ISBN1581139616
Starobinski, David; Trachtenberg, Ari; Agarwal, Sachin (2003), "Efficient PDA Synchronization", IEEE
Transactions on Mobile Computing 2 (1): 40, doi: 10.1109/TMC.2003.1195150 (http://dx.doi.org/10.1109/
TMC.2003.1195150)
Stern, Ulrich; Dill, David L. (1996), "A New Scheme for Memory-Efficient Probabilistic Verification",
Proceedings of Formal Description Techniques for Distributed Systems and Communication Protocols, and
Protocol Specification, Testing, and Verification: IFIP TC6/WG6.1 Joint International Conference, Chapman &
Hall, IFIP Conference Proceedings, pp.333348, CiteSeerX: 10.1.1.47.4101 (http://citeseerx.ist.psu.edu/
viewdoc/summary?doi=10.1.1.47.4101)
Haghighat, Mohammad Hashem; Tavakoli, Mehdi; Kharrazi, Mehdi (2013), "Payload Attribution via Character
Dependent Multi-Bloom Filters" (http://dx.doi.org/10.1109/TIFS.2013.2252341), Transaction on
Information Forensics and Security, IEEE 99, doi: 10.1109/TIFS.2013.2252341 (http://dx.doi.org/10.1109/
TIFS.2013.2252341)
Mitzenmacher, M.; E. Upfal (2005), Probability and computing: Randomized algorithms and probabilistic
analysis (http://books.google.de/books?hl=de&lr=&id=0bAYl6d7hvkC&oi=fnd&pg=PR13&
dq=mitzenmacher&ots=onO_txF9vU&sig=mOuue3kUaWB4QDwJgZj7R4NjQPo), Cambridge University
Press, pp.107112
Mullin, James K. (1990), "Optimal semijoins for distributed database systems", Software Engineering, IEEE
Transactions on 16 (5): 558560
177
Bloom filter
External links
Why Bloom filters work the way they do (Michael Nielsen, 2012) (http://www.michaelnielsen.org/ddi/
why-bloom-filters-work-the-way-they-do/)
Table of false-positive rates for different configurations (http://www.cs.wisc.edu/~cao/papers/
summary-cache/node8.html) from a University of WisconsinMadison website
Interactive Processing demonstration (http://tr.ashcan.org/2008/12/bloomers.html) from ashcan.org
"More Optimal Bloom Filters," Ely Porat (Nov/2007) Google TechTalk video (http://www.youtube.com/
watch?v=947gWqwkhu0) on YouTube
"Using Bloom Filters" (http://www.perl.com/pub/2004/04/08/bloom_filters.html) Detailed Bloom Filter
explanation using Perl
"A Garden Variety of Bloom Filters (http://matthias.vallentin.net/blog/2011/06/
a-garden-variety-of-bloom-filters/) - Explanation and Analysis of Bloom filter variants
Implementations
Implementation in C (http://en.literateprograms.org/Bloom_filter_(C)) from literateprograms.org
Implementation in C++ and Object Pascal (http://www.partow.net/programming/hashfunctions/index.html)
from partow.net
178
MinHash
179
MinHash
In computer science, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a
technique for quickly estimating how similar two sets are. The scheme was invented by Andrei Broder(1997), and
initially used in the AltaVista search engine to detect duplicate web pages and eliminate them from search results. It
has also been applied in large-scale clustering problems, such as clustering documents by the similarity of their sets
of words.
It is a number between 0 and 1; it is 0 when the two sets are disjoint, 1 when they are equal, and strictly between 0
and 1 otherwise. It is a commonly used indicator of the similarity between two sets: two sets are more similar when
their Jaccard index is closer to 1, and more dissimilar when their Jaccard index is closer to 0.
Let h be a hash function that maps the members of A and B to distinct integers, and for any set S define hmin(S) to be
the member x of S with the minimum value of h(x). Then hmin(A) = hmin(B) exactly when the minimum hash value of
the union A B lies in the intersection A B. Therefore,
Pr[hmin(A) = hmin(B)] = J(A,B).
In other words, if r is a random variable that is one when hmin(A) = hmin(B) and zero otherwise, then r is an unbiased
estimator of J(A,B), although it has too high a variance to be useful on its own. The idea of the MinHash scheme is to
reduce the variance by averaging together several variables constructed in the same way.
Algorithm
Variant with many hash functions
The simplest version of the minhash scheme uses k different hash functions, where k is a fixed integer parameter, and
represents each set S by the k values of hmin(S) for these k functions.
To estimate J(A,B) using this version of the scheme, let y be the number of hash functions for which hmin(A) =
hmin(B), and use y/k as the estimate. This estimate is the average of k different 0-1 random variables, each of which is
one when hmin(A) = hmin(B) and zero otherwise, and each of which is an unbiased estimator of J(A,B). Therefore,
their average is also an unbiased estimator, and by standard Chernoff bounds for sums of 0-1 random variables, its
expected error is O(1/k).
Therefore, for any constant > 0 there is a constant k = O(1/2) such that the expected error of the estimate is at
most. For example, 400 hashes would be required to estimate J(A,B) with an expected error less than or equal to
.05.
MinHash
Specifically, let A and B be any two sets. Then X = h(k)(h(k)(A) h(k)(B)) = h(k)(A B) is a set of k elements of A
B, and if h is a random function then any subset of k elements is equally likely to be chosen; that is, X is a simple
random sample of A B. The subset Y = X h(k)(A) h(k)(B) is the set of members of X that belong to the
intersection A B. Therefore, |Y|/k is an unbiased estimator of J(A,B). The difference between this estimator and the
estimator produced by multiple hash functions is that Y always has exactly k members, whereas the multiple hash
functions may lead to a smaller number of sampled elements due to the possibility that two different hash functions
may have the same minima. However, when k is small relative to the sizes of the sets, this difference is negligible.
By standard Chernoff bounds for sampling without replacement, this estimator has expected error O(1/k), matching
the performance of the multiple-hash-function scheme.
Time analysis
The estimator |Y|/k can be computed in time O(k) from the two signatures of the given sets, in either variant of the
scheme. Therefore, when and k are constants, the time to compute the estimated similarity from the signatures is
also constant. The signature of each set can be computed in linear time on the size of the set, so when many pairwise
similarities need to be estimated this method can lead to a substantial savings in running time compared to doing a
full comparison of the members of each set. Specifically, for set size n the many hash variant takes O(n k) time. The
single hash variant is generally faster, requiring O(n log k) time to maintain the sorted list of minima.[citation needed]
different permutations, and therefore that it needs (n) bits to specify a single permutation, still infeasibly large.
Because of this impracticality, two variant notions of min-wise independence have been introduced: restricted
min-wise independent permutations families, and approximate min-wise independent families. Restricted min-wise
independence is the min-wise independence property restricted to certain sets of cardinality at most k. Approximate
min-wise independence has at most a fixed probability of varying from full independence.
Applications
The original applications for MinHash involved clustering and eliminating near-duplicates among web documents,
represented as sets of the words occurring in those documents. Similar techniques have also been used for clustering
and near-duplicate elimination for other types of data, such as images: in the case of image data, an image can be
represented as a set of smaller subimages cropped from it, or as sets of more complex image feature descriptions.[1]
In data mining, Cohen et al. (2001) use MinHash as a tool for association rule learning. Given a database in which
each entry has multiple attributes (viewed as a 0-1 matrix with a row per database entry and a column per attribute)
they use MinHash-based approximations to the Jaccard index to identify candidate pairs of attributes that frequently
co-occur, and then compute the exact value of the index for only those pairs to determine the ones whose frequencies
of co-occurrence are below a given strict threshold.
180
MinHash
Related topics
The MinHash scheme may be seen as an instance of locality sensitive hashing, a collection of techniques for using
hash functions to map large sets of objects down to smaller hash values in such a way that, when two objects have a
small distance from each other, their hash values are likely to be the same. In this instance, the signature of a set may
be seen as its hash value. Other locality sensitive hashing techniques exist for Hamming distance between sets and
cosine distance between vectors; locality sensitive hashing has important applications in nearest neighbor search
algorithms. For large distributed systems, and in particular MapReduce, there exist modified versions of MinHash to
help compute similarities with no dependence on the point dimension.
External links
Mining of Massive Datasets, Ch. 3. Finding similar Items [2]
References
[1] ; .
[2] http:/ / infolab. stanford. edu/ ~ullman/ mmds. html
[3] http:/ / moultano. wordpress. com/ article/ simple-simhashing-3kbzhsxyg4467-6/
[4] http:/ / blogs. msdn. com/ b/ spt/ archive/ 2008/ 06/ 10/ set-similarity-and-min-hash. aspx
[5] http:/ / blogs. msdn. com/ b/ spt/ archive/ 2008/ 06/ 11/ locality-sensitive-hashing-lsh-and-min-hash. aspx
[6] http:/ / mymagnadata. wordpress. com/ 2011/ 01/ 04/ minhash-java-implementation/
[7] https:/ / code. google. com/ p/ google-all-pairs-similarity-search/
[8] http:/ / reference. wolfram. com/ mathematica/ guide/ DistanceAndSimilarityMeasures. html
[9] https:/ / code. google. com/ p/ py-nilsimsa/ source/ browse/ trunk/ nilsimsa/ __init__. py
[10] http:/ / matpalm. com/ resemblance/ simhash/
181
182
above.
Suppose you have a collection of lists and each node of each list contains an object, the name of the list to which it
belongs, and the number of elements in that list. Also assume that the sum of the number of elements in all lists is
(i.e. there are elements overall). We wish to be able to merge any two of these lists, and update all of their nodes
so that they still contain the name of the list to which they belong. The rule for merging the lists
and
is that if
is larger than
vice versa.
into
, and
183
, say
need to
have the name of the list to which it belongs updated. The element will only have its name updated when the list it
belongs to is merged with another list of the same size or of greater size. Each time that happens, the size of the list
to which belongs at least doubles. So finally, the question is "how many times can a number double before it is
the size of ?" (then the list containing will contain all elements). The answer is exactly
. So for
any given element of any given list in the structure described, it will need to be updated
case. Therefore updating a list of
Disjoint-set forests
Disjoint-set forests are data structures where each set is represented by a tree data structure, in which each node
holds a reference to its parent node (see spaghetti stack). They were first described by Bernard A. Galler and Michael
J. Fischer in 1964,[1] although their precise analysis took years.
In a disjoint-set forest, the representative of each set is the root of that set's tree. Find follows parent nodes until it
reaches the root. Union combines two trees into one by attaching the root of one to the root of the other. One way of
implementing these might be:
function MakeSet(x)
x.parent := x
function Find(x)
if x.parent == x
return x
else
return Find(x.parent)
function Union(x, y)
xRoot := Find(x)
yRoot := Find(y)
xRoot.parent := yRoot
In this naive form, this approach is no better than the linked-list approach, because the tree it creates can be highly
unbalanced; however, it can be enhanced in two ways.
The first way, called union by rank, is to always attach the smaller tree to the root of the larger tree, rather than vice
versa. Since it is the depth of the tree that affects the running time, the tree with smaller depth gets added under the
root of the deeper tree, which only increases the depth if the depths were equal. In the context of this algorithm, the
term rank is used instead of depth since it stops being equal to the depth if path compression (described below) is
also used. One-element trees are defined to have a rank of zero, and whenever two trees of the same rank r are
united, the rank of the result is r+1. Just applying this technique alone yields a worst-case running-time of
per MakeSet, Union, or Find operation. Pseudocode for the improved MakeSet and Union:
function MakeSet(x)
x.parent := x
x.rank
:= 0
function Union(x, y)
xRoot := Find(x)
184
yRoot := Find(y)
if xRoot == yRoot
return
// x and y are not already in same set. Merge them.
if xRoot.rank < yRoot.rank
xRoot.parent := yRoot
else if xRoot.rank > yRoot.rank
yRoot.parent := xRoot
else
yRoot.parent := xRoot
xRoot.rank := xRoot.rank + 1
The second improvement, called path compression, is a way of flattening the structure of the tree whenever Find is
used on it. The idea is that each node visited on the way to a root node may as well be attached directly to the root
node; they all share the same representative. To effect this, as Find recursively traverses up the tree, it changes
each node's parent reference to point to the root that it found. The resulting tree is much flatter, speeding up future
operations not only on these elements but on those referencing them, directly or indirectly. Here is the improved
Find:
function Find(x)
if x.parent != x
x.parent := Find(x.parent)
return x.parent
These two techniques complement each other; applied together, the amortized time per operation is only
where
, and
values of . Thus, the amortized running time per operation is effectively a small constant.
In fact, this is asymptotically optimal: Fredman and Saks showed in 1989 that
words must be accessed by
any disjoint-set data structure per operation on average.
Applications
Disjoint-set data structures model the partitioning of a set, for example to keep track of the connected components of
an undirected graph. This model can then be used to determine whether two vertices belong to the same component,
or whether adding an edge between them would result in a cycle. The Union-Find algorithm is used in
high-performance implementations of Unification.
This data structure is used by the Boost Graph Library to implement its Incremental Connected Components
functionality. It is also used for implementing Kruskal's algorithm to find the minimum spanning tree of a graph.
[2]
Note that the implementation as disjoint-set forests doesn't allow deletion of edgeseven without path compression
or the rank heuristic.
History
While the ideas used in disjoint-set forests have long been familiar, Robert Tarjan was the first to prove the upper
bound (and a restricted version of the lower bound) in terms of the inverse Ackermann function, in 1975. Until this
time the best bound on the time per operation, proven by Hopcroft and Ullman, was O(log* n), the iterated logarithm
of n, another slowly growing function (but not quite as slow as the inverse Ackermann function).
Tarjan and Van Leeuwen also developed one-pass Find algorithms that are more efficient in practice while retaining
the same worst-case complexity.
In 2007, Sylvain Conchon and Jean-Christophe Fillitre developed a persistent version of the disjoint-set forest data
structure, allowing previous versions of the structure to be efficiently retained, and formalized its correctness using
the proof assistant Coq.
References
[1] . The paper originating disjoint-set forests.
[2] http:/ / www. boost. org/ libs/ graph/ doc/ incremental_components. html
External links
C++ implementation (http://www.boost.org/libs/disjoint_sets/disjoint_sets.html), part of the Boost C++
libraries
A Java implementation with an application to color image segmentation, Statistical Region Merging (SRM), IEEE
Trans. Pattern Anal. Mach. Intell. 26(11): 14521458 (2004) (http://www.lix.polytechnique.fr/~nielsen/
Srmjava.java)
Java applet: A Graphical Union-Find Implementation (http://www.cs.unm.edu/~rlpm/499/uf.html), by Rory
L. P. McGuire
Wait-free Parallel Algorithms for the Union-Find Problem (http://citeseer.ist.psu.edu/anderson94waitfree.
html), a 1994 paper by Richard J. Anderson and Heather Woll describing a parallelized version of Union-Find
that never needs to block
Python implementation (http://code.activestate.com/recipes/215912-union-find-data-structure/)
Visual explanation and C# code (http://www.mathblog.dk/disjoint-set-data-structure/)
185
Partition refinement
186
Partition refinement
In the design of algorithms, partition
refinement is a technique for representing a
partition of a set as a data structure that
allows the partition to be refined by splitting
its sets into a larger number of smaller sets.
In that sense it is dual to the union-find data
structure, which also maintains a partition
into disjoint sets but in which the operations
merge pairs of sets together. More
specifically, a partition refinement algorithm
maintains a family of disjoint sets Si; at the
start of the algorithm, this is just a single set
containing all the elements in the data
structure. At each step of the algorithm, a set
X is presented to the algorithm, and each set
Si that contains members of X is replaced by
two sets, the intersection Si X and the
difference Si \ X. Partition refinement forms
a key component of several efficient
algorithms on graphs and finite automata.
Data structure
A partition refinement algorithm may be implemented by maintaining an object for each set that stores a collection
of its elements, in a form such as a doubly linked list that allows for rapid deletion, and an object for each element
that points to the set containing it. Alternatively, element identifiers may be stored in an array, ordered by the sets
they belong to, and sets may be represented by start and end indices into this array. With either of these
representations, each set should also have an instance variable that may point to a second set into which it is being
split.
To perform a refinement operation, loop through the elements of X. For each element x, find the set Si containing x,
and check whether a second set for Si has already been formed. If not, create the second set and add Si to a list L of
the sets that are split by the operation. Then, regardless of whether a new second set was formed, remove x from Si
and add it to the second set. In the representation in which all elements are stored in a single array, moving x from
one set to another may be performed by swapping x with the final element of Si and then decrementing the end index
of Si and the start index of the new set. Finally, after all elements of X have been processed in this way, loop through
L, separating each current set Si from the second set that has been split from it, and report both of these sets as newly
formed sets from the refinement operation.
The time to perform the refinement operations in this way is O(|X|), independent of the number of elements or the
total number of sets in the data structure.
Partition refinement
Applications
Possibly the first application of partition refinement was in an algorithm by Hopcroft (1971) for DFA minimization.
In this problem, one is given as input a deterministic finite automaton, and must find an equivalent automaton with
as few states as possible. The algorithm maintains a partition of the states of the input automaton into subsets, with
the property that any two states in different subsets must be mapped to different states of the output automaton;
initially, there are two subsets, one containing all the accepting states and one containing the remaining states. At
each step one of the subsets Si and one of the input symbols x of the automaton are chosen, and the subsets of states
are refined into states for which a transition labeled x would lead to Si, and states for which an x-transition would
lead somewhere else. When a set Si that has already been chosen is split by a refinement, only one of the two
resulting sets (the smaller of the two) needs to be chosen again; in this way, each state participates in the sets X for
O(s log n) refinement steps and the overall algorithm takes time O(ns log n), where n is the number of initial states
and s is the size of the alphabet.
Partition refinement was applied by Sethi (1976) in an efficient implementation of the CoffmanGraham algorithm
for parallel scheduling. Sethi showed that it could be used to construct a lexicographically ordered topological sort of
a given directed acyclic graph in linear time; this lexicographic topological ordering is one of the key steps of the
CoffmanGraham algorithm. In this application, the elements of the disjoint sets are vertices of the input graph and
the sets X used to refine the partition are sets of neighbors of vertices. Since the total number of neighbors of all
vertices is just the number of edges in the graph, the algorithm takes time linear in the number of edges, its input
size.
Partition refinement also forms a key step in lexicographic breadth-first search, a graph search algorithm with
applications in the recognition of chordal graphs and several other important classes of graphs. Again, the disjoint set
elements are vertices and the set X represent sets of neighbors, so the algorithm takes linear time.
References
187
188
Priority queues
Priority queue
In computer science, a priority queue is an abstract data type which is like a regular queue or stack data structure,
but where additionally each element has a "priority" associated with it. In a priority queue, an element with high
priority is served before an element with low priority. If two elements have the same priority, they are served
according to their order in the queue.
stack elements are pulled in last-in first-out-order (e.g. a stack of papers)
queue elements are pulled in first-in first-out-order (e.g. a line in a cafeteria)
It is a common misconception that a priority queue is a heap. A priority queue is an abstract concept like "a list" or
"a map"; just as a list can be implemented with a linked list or an array, a priority queue can be implemented with a
heap or a variety of other methods.
A priority queue must at least support the following operations:
insert_with_priority: add an element to the queue with an associated priority
pull_highest_priority_element: remove the element from the queue that has the highest priority, and
return it
This is also known as "pop_element(Off)", "get_maximum_element" or "get_front(most)_element".
Some conventions reverse the order of priorities, considering lower values to be higher priority, so this may
also be known as "get_minimum_element", and is often referred to as "get-min" in the literature.
This may instead be specified as separate "peek_at_highest_priority_element" and "delete_element" functions,
which can be combined to produce "pull_highest_priority_element".
In addition, peek (in this context often called find-max or find-min), which returns the highest-priority element but
does not modify the queue, is very frequently implemented, and nearly always executes in O(1) time. This operation
and its O(1) performance is crucial to many applications of priority queues.
More advanced implementations may support more complicated operations, such as pull_lowest_priority_element,
inspecting the first few highest- or lowest-priority elements, clearing the queue, clearing subsets of the queue,
performing a batch insert, merging two or more queues into one, incrementing priority of any element, etc.
Similarity to queues
One can imagine a priority queue as a modified queue, but when one would get the next element off the queue, the
highest-priority element is retrieved first.
Stacks and queues may be modeled as particular kinds of priority queues. In a stack, the priority of each inserted
element is monotonically increasing; thus, the last element inserted is always the first retrieved. In a queue, the
priority of each inserted element is monotonically decreasing; thus, the first element inserted is always the first
retrieved.
Priority queue
Implementation
Naive implementations
There are a variety of simple, usually inefficient, ways to implement a priority queue. They provide an analogy to
help one understand what a priority queue is. For instance, one can keep all the elements in an unsorted list.
Whenever the highest-priority element is requested, search through all elements for the one with the highest priority.
(In big O notation: O(1) insertion time, O(n) pull time due to search.)
Usual implementation
To improve performance, priority queues typically use a heap as their backbone, giving O(log n) performance for
inserts and removals, and O(n) to build initially. Alternatively, when a self-balancing binary search tree is used,
insertion and removal also take O(log n) time, although building trees from existing sequences of elements takes O(n
log n) time; this is typical where one might already have access to these data structures, such as with third-party or
standard libraries.
Note that from a computational-complexity standpoint, priority queues are congruent to sorting algorithms. See the
next section for how efficient sorting algorithms can create efficient priority queues.
There are several specialized heap data structures that either supply additional operations or outperform these
approaches. The binary heap uses O(log n) time for both operations, but also allows queries of the element of highest
priority without removing it in constant time. Binomial heaps add several more operations, but require O(log n) time
for requests. Fibonacci heaps can insert elements, query the highest priority element, and increase an element's
priority in amortized constant time[1] though deletions are still O(log n). Brodal queues can do this in worst-case
constant time.
While relying on a heap is a common way to implement priority queues, for integer data, faster implementations
exist. This can even apply to data-types that have a finite range, such as floats:
When the set of keys is {1, 2, ..., C}, a van Emde Boas tree would support the minimum, maximum, insert, delete,
search, extract-min, extract-max, predecessor and successor operations in
time, but has a space
cost for small queues of about O(2m/2), where m is the number of bits in the priority value.[2]
The Fusion tree algorithm by Fredman and Willard implements the minimum operation in O(1) time and insert
and extract-min operations in
time.[3]
For applications that do many "peek" operations for every "extract-min" operation, the time complexity for peek
actions can be reduced to O(1) in all tree and heap implementations by caching the highest priority element after
every insertion and removal. For insertion, this adds at most a constant cost, since the newly inserted element is
compared only to the previously cached minimum element. For deletion, this at most adds an additional "peek" cost,
which is typically cheaper than the deletion cost, so overall time complexity is not significantly impacted.
189
Priority queue
190
Libraries
A priority queue is often considered to be a "container data structure".
The Standard Template Library (STL), and the C++ 1998 standard, specifies priority_queue as one of the
STL container adaptor class templates. It implements a max-priority-queue. Unlike actual STL containers, it does not
allow iteration of its elements (it strictly adheres to its abstract data type definition). STL also has utility functions
for manipulating another random-access container as a binary max-heap. The Boost (C++ libraries) also have an
implementation in the library heap.
Python's heapq [6] module implements a binary min-heap on top of a list.
Java's library contains a PriorityQueue [7] class, which implements a min-priority-queue.
Go's library contains a container/heap
structure.
[8]
The Standard PHP Library extension contains the class SplPriorityQueue [9].
Apple's Core Foundation framework contains a CFBinaryHeap [10] structure, which implements a min-heap.
Applications
Bandwidth management
Priority queuing can be used to manage limited resources such as bandwidth on a transmission line from a network
router. In the event of outgoing traffic queuing due to insufficient bandwidth, all other queues can be halted to send
the traffic from the highest priority queue upon arrival. This ensures that the prioritized traffic (such as real-time
traffic, e.g. an RTP stream of a VoIP connection) is forwarded with the least delay and the least likelihood of being
Priority queue
rejected due to a queue reaching its maximum capacity. All other traffic can be handled when the highest priority
queue is empty. Another approach used is to send disproportionately more traffic from higher priority queues.
Many modern protocols for Local Area Networks also include the concept of Priority Queues at the Media Access
Control (MAC) sub-layer to ensure that high-priority applications (such as VoIP or IPTV) experience lower latency
than other applications which can be served with Best effort service. Examples include IEEE 802.11e (an
amendment to IEEE 802.11 which provides Quality of Service) and ITU-T G.hn (a standard for high-speed Local
area network using existing home wiring (power lines, phone lines and coaxial cables).
Usually a limitation (policer) is set to limit the bandwidth that traffic from the highest priority queue can take, in
order to prevent high priority packets from choking off all other traffic. This limit is usually never reached due to
high level control instances such as the Cisco Callmanager, which can be programmed to inhibit calls which would
exceed the programmed bandwidth limit.
Dijkstra's algorithm
When the graph is stored in the form of adjacency list or matrix, priority queue can be used to extract minimum
efficiently when implementing Dijkstra's algorithm, although one also needs the ability to alter the priority of a
particular vertex in the priority queue efficiently.
Huffman coding
Huffman coding requires one to repeatedly obtain the two lowest-frequency trees. A priority queue makes this
efficient.
191
Priority queue
References
[1] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Second Edition. MIT Press and
McGraw-Hill, 2001. ISBN 0-262-03293-7. Chapter 20: Fibonacci Heaps, pp.476497. Third edition p518.
[2] P. van Emde Boas. Preserving order in a forest in less than logarithmic time. In Proceedings of the 16th Annual Symposium on Foundations
of Computer Science, pages 75-84. IEEE Computer Society, 1975.
[3] Michael L. Fredman and Dan E. Willard. Surpassing the information theoretic bound with fusion trees. Journal of Computer and System
Sciences, 48(3):533-551, 1994
[4] Mikkel Thorup. 2007. Equivalence between priority queues and sorting. J. ACM 54, 6, Article 28 (December 2007).
DOI=10.1145/1314690.1314692 (http:/ / doi. acm. org/ 10. 1145/ 1314690. 1314692)
[5] http:/ / courses. csail. mit. edu/ 6. 851/ spring07/ scribe/ lec17. pdf
[6] http:/ / docs. python. org/ library/ heapq. html
[7] http:/ / download. oracle. com/ javase/ 7/ docs/ api/ java/ util/ PriorityQueue. html
[8] http:/ / golang. org/ pkg/ container/ heap/
[9] http:/ / us2. php. net/ manual/ en/ class. splpriorityqueue. php
[10] http:/ / developer. apple. com/ library/ mac/ #documentation/ CoreFoundation/ Reference/ CFBinaryHeapRef/ Reference/ reference. html
Further reading
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms,
Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Section 6.5: Priority queues,
pp.138142.
External links
192
193
194
Variants
2-3 heap
Beap
Binary heap
Binomial heap
Brodal queue
d-ary heap
Fibonacci heap
Leftist heap
Pairing heap
Skew heap
Soft heap
Weak heap
Leaf heap
Radix heap
Randomized meldable heap
Binomial
Fibonacci
find-min
(1)
(1)
(1)
(1)
delete-min
(log n) (log n)
O(log n)*
O(log n)*
insert
(log n) O(log n)
(1)
(1)
(1)
(1)*
(1)*
(1)
(1)
(n)
Pairing
[2]
Binary
Brodal***
(1)
(1)
(1)
RP
(1)
Applications
The heap data structure has many applications.
Heapsort: One of the best sorting methods being in-place and with no quadratic worst-case scenarios.
Selection algorithms: A heap allows access to the min or max element in constant time, and other selections (such
as median or kth-element) can be done in sub-linear time on data that is in a heap.
Graph algorithms: By using heaps as internal traversal data structures, run time will be reduced by polynomial
order. Examples of such problems are Prim's minimal spanning tree algorithm and Dijkstra's shortest path
problem.
Full and almost full binary heaps may be represented in a very space-efficient way using an array alone. The first (or
last) element will contain the root. The next two elements of the array contain its children. The next four contain the
four children of the two child nodes, etc. Thus the children of the node at position n would be at positions 2n and
Implementations
The C++ Standard Template Library provides the make_heap, push_heap and pop_heap algorithms for
heaps (usually implemented as binary heaps), which operate on arbitrary random access iterators. It treats the
iterators as a reference to an array, and uses the array-to-heap conversion. It also provides the container adaptor
priority_queue, which wraps these facilities in a container-like class. However, there is no standard support
for the decrease/increase-key operation.
The Boost C++ libraries include a heaps library. Unlike the STL it supports decrease and increase operations, and
supports additional types of heap: specifically, it supports d-ary, binomial, Fibonacci, pairing and skew heaps.
The Java 2 platform (since version 1.5) provides the binary heap implementation with class
java.util.PriorityQueue<E> [3] in Java Collections Framework.
Python has a heapq [6] module that implements a priority queue using a binary heap.
PHP has both max-heap (SplMaxHeap) and min-heap (SplMinHeap) as of version 5.3 in the Standard PHP
Library.
Perl has implementations of binary, binomial, and Fibonacci heaps in the Heap [4] distribution available on
CPAN.
The Go library contains a heap [8] package with heap algorithms that operate on an arbitrary type that satisfy a
given interface.
Apple's Core Foundation library contains a CFBinaryHeap [5] structure.
References
[1]
[2]
[3]
[4]
[5]
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest (1990): Introduction to algorithms. MIT Press / McGraw-Hill.
http:/ / www. cs. au. dk/ ~gerth/ papers/ soda96. pdf
http:/ / docs. oracle. com/ javase/ 6/ docs/ api/ java/ util/ PriorityQueue. html
https:/ / metacpan. org/ module/ Heap
https:/ / developer. apple. com/ library/ mac/ #documentation/ CoreFoundation/ Reference/ CFBinaryHeapRef/ Reference/ reference. html
External links
Heap (http://mathworld.wolfram.com/Heap.html) at Wolfram MathWorld
195
Binary heap
196
Binary heap
Binary Heap
Type
Tree
Time complexity
in big O notation
Average
Space O(n)
Worst case
O(n)
O(log n)
Delete O(log n)
O(log n)
Binary heap
Heap operations
Both the insert and remove operations modify the heap to conform to the shape property first, by adding or removing
from the end of the heap. Then the heap property is restored by traversing up or down the heap. Both operations take
O(log n) time.
Insert
To add an element to a heap we must perform an up-heap operation (also known as bubble-up, percolate-up, sift-up,
trickle up, heapify-up, or cascade-up), by following this algorithm:
1. Add the element to the bottom level of the heap.
2. Compare the added element with its parent; if they are in the correct order, stop.
3. If not, swap the element with its parent and return to the previous step.
The number of operations required is dependent on the number of levels the new element must rise to satisfy the
heap property, thus the insertion operation has a time complexity of O(log n).
As an example, say we have a max-heap
and we want to add the number 15 to the heap. We first place the 15 in the position marked by the X. However, the
heap property is violated since 15 is greater than 8, so we need to swap the 15 and the 8. So, we have the heap
looking as follows after the first swap:
However the heap property is still violated since 15 is greater than 11, so we need to swap again:
which is a valid max-heap. There is no need to check the children after this. Before we placed 15 on X, the heap was
valid, meaning 11 is greater than 5. If 15 is greater than 11, and 11 is greater than 5, then 15 must be greater than 5,
because of the transitive relation.
Delete
The procedure for deleting the root from the heap (effectively extracting the maximum element in a max-heap or the
minimum element in a min-heap) and restoring the properties is called down-heap (also known as bubble-down,
percolate-down, sift-down, trickle down, heapify-down, cascade-down and extract-min/max).
1. Replace the root of the heap with the last element on the last level.
2. Compare the new root with its children; if they are in the correct order, stop.
3. If not, swap the element with one of its children and return to the previous step. (Swap with its smaller child in a
min-heap and its larger child in a max-heap.)
So, if we have the same max-heap as before
197
Binary heap
198
Now the heap property is violated since 8 is greater than 4. In this case, swapping the two elements, 4 and 8, is
enough to restore the heap property and we need not swap elements further:
The downward-moving node is swapped with the larger of its children in a max-heap (in a min-heap it would be
swapped with its smaller child), until it satisfies the heap property in its new position. This functionality is achieved
by the Max-Heapify function as defined below in pseudocode for an array-backed heap A of length heap_length[A].
Note that "A" is indexed starting at 1, not 0 as is common in many real programming languages.
Max-Heapify (A, i):
left 2i
right 2i + 1
largest i
if left heap_length[A] and A[left] > A[largest] then:
largest left
if right heap_length[A] and A[right] > A[largest] then:
largest right
if largest i then:
swap A[i] A[largest]
Max-Heapify(A, largest)
For the above algorithm to correctly re-heapify the array, the node at index i and its two direct children must violate
the heap property. If they do not, the algorithm will fall through with no change to the array. The down-heap
operation (without the preceding swap) can also be used to modify the value of the root, even when an element is not
being deleted.
In the worst case, the new root has to be swapped with its child on each level until it reaches the bottom level of the
heap, meaning that the delete operation has a time complexity relative to the height of the tree, or O(log n).
Building a heap
A heap could be built by successive insertions. This approach requires
takes
elements. However this is not the optimal method. The optimal method starts
by arbitrarily putting the elements on a binary tree, respecting the shape property (the tree could be represented by an
array, see below). Then starting from the lowest level and moving upwards, shift the root of each subtree downward
as in the deletion algorithm until the heap property is restored. More specifically if all the subtrees starting at some
height (measured from the bottom) have already been "heapified", the trees at height
can be heapified by
sending their root down along the path of maximum valued children when building a max-heap, or minimum valued
children when building a min-heap. This process takes
operations (swaps) per node. In this method most of
Binary heap
199
the heapification takes place in the lower levels. Since the height of the heap is
This uses the fact that the given infinite series h / 2h converges to 2.
The exact value of the above (the worst-case number of comparisons during the heap construction) is known to be
equal to:
,
where s2(n) is the sum of all digits of the binary representation of n and e2(n) is the exponent of 2 in the prime
factorization of n.
The Build-Max-Heap function that follows, converts an array A which stores a complete binary tree with n nodes to
a max-heap by repeatedly using Max-Heapify in a bottom up manner. It is based on the observation that the array
elements indexed by floor(n/2) + 1, floor(n/2) + 2, ..., n are all leaves for the tree, thus each is a one-element heap.
Build-Max-Heap runs Max-Heapify on each of the remaining tree nodes.
Build-Max-Heap (A):
heap_length[A] length[A]
for i floor(length[A]/2) downto 1 do
Max-Heapify(A, i)
Heap implementation
Heaps are commonly implemented
with an array. Any binary tree can be
stored in an array, but because a heap
is always an almost complete binary
tree, it can be stored compactly. No
space is required for pointers; instead,
A small complete binary tree stored in an array
the parent and children of each node
can be found by arithmetic on array
indices. These properties make this
heap implementation a simple example
of an implicit data structure or
Ahnentafel list. Details depend on the
root position, which in turn may
depend
on
constraints
of
a
Comparison between a binary heap and an array implementation.
programming language used for
implementation, or programmer preference. Specifically, sometimes the root is placed at index 1, sacrificing space in
order to simplify arithmetic. The peek operation (find-min or find-max) simply returns the value of the root, and is
thus O(1).
is
Binary heap
Let n be the number of elements in the heap and i be an arbitrary valid index of the array storing the heap. If the tree
root is at index 0, with valid indices 0 through n 1, then each element a at index i has
children at indices 2i + 1 and 2i + 2
its parent (i 1) 2 where is the floor function.
Alternatively, if the tree root is at index 1, with valid indices 1 through n, then each element a at index i has
children at indices 2i and 2i +1
its parent at index i 2.
This implementation is used in the heapsort algorithm, where it allows the space in the input array to be reused to
store the heap (i.e. the algorithm is done in-place). The implementation is also useful for use as a Priority queue
where use of a dynamic array allows insertion of an unbounded number of items.
The upheap/downheap operations can then be stated in terms of an array as follows: suppose that the heap property
holds for the indices b, b+1, ..., e. The sift-down function extends the heap property to b1, b, b+1, ..., e. Only index
i = b1 can violate the heap property. Let j be the index of the largest child of a[i] (for a max-heap, or the smallest
child for a min-heap) within the range b, ..., e. (If no such index exists because 2i > e then the heap property holds
for the newly extended range and nothing needs to be done.) By swapping the values a[i] and a[j] the heap property
for position i is established. At this point, the only problem is that the heap property might not hold for index j. The
sift-down function is applied tail-recursively to index j until the heap property is established for all elements.
The sift-down function is fast. In each step it only needs two comparisons and one swap. The index value where it is
working doubles in each iteration, so that at most log2 e steps are required.
For big heaps and using virtual memory, storing elements in an array according to the above scheme is inefficient:
(almost) every level is in a different page. B-heaps are binary heaps that keep subtrees in a single page, reducing the
number of pages accessed by up to a factor of ten.[2]
The operation of merging two binary heaps takes (n) for equal-sized heaps. The best you can do is (in case of array
implementation) simply concatenating the two heap arrays and build a heap of the result.[3] A heap on n elements can
be merged with a heap on k elements using O(log n log k) key comparisons, or, in case of a pointer-based
implementation, in O(log n log k) time.[4] An algorithm for splitting a heap on n elements into two heaps on k and
n-k elements, respectively, based on a new view of heaps as an ordered collections of subheaps was presented in.[5]
The algorithm requires O(log n * log n) comparisons. The view also presents a new and conceptually simple
algorithm for merging heaps. When merging is a common task, a different heap implementation is recommended,
such as binomial heaps, which can be merged in O(log n).
Additionally, a binary heap can be implemented with a traditional binary tree data structure, but there is an issue with
finding the adjacent element on the last level on the binary heap when adding an element. This element can be
determined algorithmically or by adding extra data to the nodes, called "threading" the treeinstead of merely
storing references to the children, we store the inorder successor of the node as well.
It is possible to modify the heap structure to allow extraction of both the smallest and largest element in
time. To do this, the rows alternate between min heap and max heap. The algorithms are roughly the same, but, in
each step, one must consider the alternating rows with alternating comparisons. The performance is roughly the same
as a normal single direction heap. This idea can be generalised to a min-max-median heap.
200
Binary heap
201
and
Mathematical proof
From the figure in "Heap Implementation" section, it can be seen that any node can store its children only after its
right siblings and its left siblings' children have been stored. This fact will be used for derivation.
Total number of elements from root to any given level
Suppose the node
is at level
is
, where
starts at zero.
So, the total number of nodes from root to previous level would be
Total number of nodes stored in the array till the index
So, total number of siblings on the left of
is
(Counting
too)
is
is:-
would be:-
[Proved]
Intuitive proof
Although the mathematical approach proves this without doubt, the simplicity of the resulting equation suggests that
there should be a simpler way to arrive at this conclusion.
For this two facts should be noted.
Children for node will be found at the very first empty slot.
Second is that, all nodes previous to node , right up to the root, will have exactly two children. This is
necessary to maintain the shape of the heap.
Now since all nodes have two children (as per the second fact) so all memory slots taken by the children will be
. We add one since starts at zero. Then we subtract one since node doesn't yet have
any children.
This means all filled memory slots have been accounted for except one the root node. Root is child to none. So
finally, the count of all filled memory slots are
.
Binary heap
So, by fact one and since our indexing starts at zero,
202
itself gives the index of the first child of
Notes
[1] Min-heaps are used for priority queues because the higher priority is usually represented with lower numeric value, so it's min-heap which
pushes high-priority items to the front of a queue.
[2] Poul-Henning Kamp. "You're Doing It Wrong" (http:/ / queue. acm. org/ detail. cfm?id=1814327). ACM Queue. June 11, 2010.
[3] Chris L. Kuszmaul. "binary heap" (http:/ / nist. gov/ dads/ HTML/ binaryheap. html). Dictionary of Algorithms and Data Structures, Paul E.
Black, ed., U.S. National Institute of Standards and Technology. 16 November 2009.
[4] J.-R. Sack and T. Strothotte "An Algorithm for Merging Heaps" (http:/ / www. springerlink. com/ content/ k24440h5076w013q/ ), Acta
Informatica 22, 171-186 (1985).
[5] . J.-R. Sack and T. Strothotte "A characterization of heaps and its applications" (http:/ / www. sciencedirect. com/ science/ article/ pii/
089054019090026E) Information and Computation Volume 86, Issue 1, May 1990, Pages 6986.
References
External links
Binary Heap Applet (http://people.ksp.sk/~kuko/bak/index.html) by Kubo Kovac
Using Binary Heaps in A* Pathfinding (http://www.policyalmanac.org/games/binaryHeaps.htm)
Open Data Structures - Section 10.1 - BinaryHeap: An Implicit Binary Tree (http://opendatastructures.org/
versions/edition-0.1e/ods-java/10_1_BinaryHeap_Implicit_Bi.html)
d-ary heap
The d-ary heap or d-heap is a priority queue data structure, a generalization of the binary heap in which the nodes
have d children instead of 2. Thus, a binary heap is a 2-heap. According to Tarjan and Jensen et al., d-ary heaps were
invented by Donald B. Johnson in 1975.
This data structure allows decrease priority operations to be performed more quickly than binary heaps, at the
expense of slower delete minimum operations. This tradeoff leads to better running times for algorithms such as
Dijkstra's algorithm in which decrease priority operations are more common than delete min operations.
Additionally, d-ary heaps have better memory cache behavior than a binary heap, allowing them to run more quickly
in practice despite having a theoretically larger worst-case running time. Like binary heaps, d-ary heaps are an
in-place data structure that uses no additional storage beyond that needed to store the array of items in the heap.
Data structure
The d-ary heap consists of an array of n items, each of which has a priority associated with it. These items may be
viewed as the nodes in a complete d-ary tree, listed in breadth first traversal order: the item at position 0 of the array
forms the root of the tree, the items at positions 1d are its children, the next d2 items are its grandchildren, etc.
Thus, the parent of the item at position i (for any i > 0) is the item at position floor((i 1)/d) and its children are the
items at positions di + 1 through di + d. According to the heap property, in a min-heap, each item has a priority that
is at least as large as its parent; in a max-heap, each item has a priority that is no larger than its parent.
The minimum priority item in a min-heap (or the maximum priority item in a max-heap) may always be found at
position 0 of the array. To remove this item from the priority queue, the last item x in the array is moved into its
place, and the length of the array is decreased by one. Then, while item x and its children do not satisfy the heap
property, item x is swapped with one of its children (the one with the smallest priority in a min-heap, or the one with
the largest priority in a max-heap), moving it downward in the tree and later in the array, until eventually the heap
property is satisfied. The same downward swapping procedure may be used to increase the priority of an item in a
d-ary heap
203
Analysis
In a d-ary heap with n items in it, both the upward-swapping procedure and the downward-swapping procedure may
perform as many as logd n = log n / log d swaps. In the upward-swapping procedure, each swap involves a single
comparison of an item with its parent, and takes constant time. Therefore, the time to insert a new item into the heap,
to decrease the priority of an item in a min-heap, or to increase the priority of an item in a max-heap, is O(log n / log
d). In the downward-swapping procedure, each swap involves d comparisons and takes O(d) time: it takes d 1
comparisons to determine the minimum or maximum of the children and then one more comparison against the
parent to determine whether a swap is needed. Therefore, the time to delete the root item, to increase the priority of
an item in a min-heap, or to decrease the priority of an item in a max-heap, is O(d log n / log d).
When creating a d-ary heap from a set of n items, most of the items are in positions that will eventually hold leaves
of the d-ary tree, and no downward swapping is performed for those items. At most n/d + 1 items are non-leaves, and
may be swapped downwards at least once, at a cost of O(d) time to find the child to swap them with. At most n/d2 +
1 nodes may be swapped downward two times, incurring an additional O(d) cost for the second swap beyond the
cost already counted in the first term, etc. Therefore, the total amount of time to create a heap in this way is
The exact value of the above (the worst-case number of comparisons during the construction of d-ary heap) is known
to be equal to:
,
where sd(n) is the sum of all digits of the standard base-d representation of n and ed(n) is the exponent of d in the
factorization of n. This reduces to
,
for d = 2, and to
,
for d = 3.
The space usage of the d-ary heap, with insert and delete-min operations, is linear, as it uses no extra storage other
than an array containing a list of the items in the heap. If changes to the priorities of existing items need to be
supported, then one must also maintain pointers from the items to their positions in the heap, which again uses only
linear storage.
d-ary heap
Applications
Dijkstra's algorithm for shortest paths in graphs and Prim's algorithm for minimum spanning trees both use a
min-heap in which there are n delete-min operations and as many as m decrease-priority operations, where n is the
number of vertices in the graph and m is the number of edges. By using a d-ary heap with d = m/n, the total times for
these two types of operations may be balanced against each other, leading to a total time of O(m logm/n n) for the
algorithm, an improvement over the O(m log n) running time of binary heap versions of these algorithms whenever
the number of edges is significantly larger than the number of vertices.[] An alternative priority queue data structure,
the Fibonacci heap, gives an even better theoretical running time of O(m + n log n), but in practice d-ary heaps are
generally at least as fast, and often faster, than Fibonacci heaps for this application.
4-heaps may perform better than binary heaps in practice, even for delete-min operations. Additionally, a d-ary heap
typically runs much faster than a binary heap for heap sizes that exceed the size of the computer's cache memory: A
binary heap typically requires more cache misses and virtual memory page faults than a d-ary heap, each one taking
far more time than the extra work incurred by the additional comparisons a d-ary heap makes compared to a binary
heap.
References
External links
C++ implementation of generalized heap with D-Heap support (https://github.com/valyala/gheap)
Binomial heap
In computer science, a binomial heap is a heap similar to a binary heap but also supports quick merging of two
heaps. This is achieved by using a special tree structure. It is important as an implementation of the mergeable heap
abstract data type (also called meldable heap), which is a priority queue supporting merge operation.
Binomial tree
A binomial heap is implemented as a collection of binomial trees (compare with a binary heap, which has a shape of
a single binary tree). A binomial tree is defined recursively:
A binomial tree of order 0 is a single node
A binomial tree of order k has a root node whose children are roots of binomial trees of orders k1, k2, ..., 2, 1, 0
(in this order).
204
Binomial heap
205
Binomial trees of order 0 to 3: Each tree has a root node with subtrees of all lower ordered binomial trees, which have been highlighted.
For example, the order 3 binomial tree is connected to an order 2, 1, and 0 (highlighted as blue, green and red respectively) binomial tree.
has
nodes at depth
Binomial heap
206
Implementation
Because no operation requires random access to the root nodes of the binomial trees, the roots of the binomial trees
can be stored in a linked list, ordered by increasing order of the tree.
Merge
As mentioned above, the simplest and most important operation is the merging of two binomial trees of the same
order within two binomial heaps. Due to the structure of binomial trees, they can be merged trivially. As their root
node is the smallest element within the tree, by comparing the two keys, the smaller of them is the minimum key,
and becomes the new root node. Then the other tree become a subtree of the combined tree. This operation is basic to
the complete merging of two binomial heaps.
function mergeTree(p, q)
if p.root.key <= q.root.key
return p.addSubTree(q)
else
return q.addSubTree(p)
Binomial heap
207
The operation of merging two heaps is perhaps the most
interesting and can be used as a subroutine in most other
operations. The lists of roots of both heaps are traversed
simultaneously, similarly as in the merge algorithm.
If only one of the heaps contains a tree of order j, this tree is
moved to the merged heap. If both heaps contain a tree of order j,
the two trees are merged to one tree of order j+1 so that the
minimum-heap property is satisfied. Note that it may later be
necessary to merge this tree with some other tree of order j+1
present in one of the heaps. In the course of the algorithm, we need
to examine at most three trees of any order (two from the two
heaps we merge and one composed of two smaller trees).
function merge(p, q)
while not( p.end() and q.end() )
tree = mergeTree(p.currentTree(), q.currentTree())
if not heap.currentTree().empty()
tree = mergeTree(tree, heap.currentTree())
heap.addTree(tree)
else
heap.addTree(tree)
heap.next() p.next() q.next()
Binomial heap
208
Insert
Inserting a new element to a heap can be
done by simply creating a new heap
containing only this element and then
merging it with the original heap. Due to the
merge, insert takes O(log n) time,however it
has an amortized time of O(1) (i.e.
constant).
Find minimum
To find the minimum element of the heap,
find the minimum among the roots of the
binomial trees. This can again be done
easily in O(log n) time, as there are just
O(log n) trees and hence roots to examine.
This shows the merger of two binomial heaps. This is accomplished by merging
two binomial trees of the same order one by one. If the resulting merged tree has
the same order as one binomial tree in one of the two heaps, then those two are
merged again.
Delete minimum
To delete the minimum element from the heap, first find this element, remove it from its binomial tree, and obtain a
list of its subtrees. Then transform this list of subtrees into a separate binomial heap by reordering them from
smallest to largest order. Then merge this heap with the original heap. Since each tree has at most log n children,
creating this new heap is O(log n). Merging heaps is O(log n), so the entire delete minimum operation is O(log n).
function deleteMin(heap)
min = heap.trees().first()
for each current in heap.trees()
if current.root < min then min = current
for each tree in min.subTrees()
tmp.addTree(tree)
heap.removeTree(min)
merge(heap, tmp)
Binomial heap
Decrease key
After decreasing the key of an element, it may become smaller than the key of its parent, violating the
minimum-heap property. If this is the case, exchange the element with its parent, and possibly also with its
grandparent, and so on, until the minimum-heap property is no longer violated. Each binomial tree has height at most
log n, so this takes O(log n) time.
Delete
To delete an element from the heap, decrease its key to negative infinity (that is, some value lower than any element
in the heap) and then delete the minimum in the heap.
Performance
All of the following operations work in O(log n) time on a binomial heap with n elements:
Applications
Discrete event simulation
Priority queues
References
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms,
Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Chapter 19: Binomial Heaps,
pp.455475.
Vuillemin, J. (1978). A data structure for manipulating priority queues. [1] Communications of the ACM 21,
309314.
External links
209
Binomial heap
210
References
[1]
[2]
[3]
[4]
[5]
[6]
Fibonacci heap
In computer science, a Fibonacci heap is a heap data structure consisting of a collection of trees. It has a better
amortized running time than a binomial heap. Fibonacci heaps were developed by Michael L. Fredman and Robert
E. Tarjan in 1984 and first published in a scientific journal in 1987. The name of Fibonacci heap comes from
Fibonacci numbers which are used in the running time analysis.
Find-minimum is O(1) amortized time.[1] Operations insert, decrease key, and merge (union) work in constant
amortized time. Operations delete and delete minimum work in O(log n) amortized time. This means that starting
from an empty data structure, any sequence of a operations from the first group and b operations from the second
group would take O(a+blogn) time. In a binomial heap such a sequence of operations would take O((a+b) log
(n)) time. A Fibonacci heap is thus better than a binomial heap when b is asymptotically smaller than a.
Using Fibonacci heaps for priority queues improves the asymptotic running time of important algorithms, such as
Dijkstra's algorithm for computing the shortest path between two nodes in a graph.
Structure
A Fibonacci heap is a collection of trees satisfying the
minimum-heap property, that is, the key of a child is
always greater than or equal to the key of the parent.
This implies that the minimum key is always at the root
of one of the trees. Compared with binomial heaps, the
structure of a Fibonacci heap is more flexible. The trees
do not have a prescribed shape and in the extreme case
the heap can have every element in a separate tree. This
flexibility allows some operations to be executed in a
"lazy" manner, postponing the work for later
operations. For example merging heaps is done simply
by concatenating the two lists of trees, and operation
decrease key sometimes cuts a node from its parent and
forms a new tree.
Fibonacci heap
211
As a result of a relaxed structure, some operations can take a long time while others are done very quickly. For the
amortized running time analysis we use the potential method, in that we pretend that very fast operations take a little
bit longer than they actually do. This additional time is then later combined and subtracted from the actual running
time of slow operations. The amount of time saved for later use is measured at any given moment by a potential
function. The potential of a Fibonacci heap is given by
Potential = t + 2m
where t is the number of trees in the Fibonacci heap, and m is the number of marked nodes. A node is marked if at
least one of its children was cut since this node was made a child of another node (all roots are unmarked).
Thus, the root of each tree in a heap has one unit of time stored. This unit of time can be used later to link this tree
with another tree at amortized time 0. Also, each marked node has two units of time stored. One can be used to cut
the node from its parent. If this happens, the node becomes a root and the second unit of time will remain stored in it
as in any other root.
Implementation of operations
To allow fast deletion and concatenation, the roots of all trees are linked using a circular, doubly linked list. The
children of each node are also linked using such a list. For each node, we maintain its number of children and
whether the node is marked. Moreover we maintain a pointer to the root containing the minimum key.
Operation find minimum is now trivial because we keep the pointer to the node containing it. It does not change the
potential of the heap, therefore both actual and amortized cost is constant. As mentioned above, merge is
implemented simply by concatenating the lists of tree roots of the two heaps. This can be done in constant time and
the potential does not change, leading again to constant amortized time.
Operation insert works by creating a new heap with one element and doing merge. This takes constant time, and the
potential increases by one, because the number of trees increases. The amortized cost is thus still constant.
Operation extract minimum (same as delete minimum) operates in three
phases. First we take the root containing the minimum element and
remove it. Its children will become roots of new trees. If the number of
children was d, it takes time O(d) to process all new roots and the
potential increases by d1. Therefore the amortized running time of this
phase is O(d) = O(log n).
Fibonacci heap
212
of both sides
gives
as required.)
Consider any node x somewhere in the heap (x need not be the root of one of the main trees). Define size(x) to be the
size of the tree rooted at x (the number of descendants of x, including x itself). We prove by induction on the height
of x (the length of a longest simple path from x to a descendant leaf), that size(x)Fd+2, where d is the degree of x.
Base case: If x has height 0, then d=0, and size(x)=1=F2.
Fibonacci heap
213
Inductive case: Suppose x has positive height and degree d>0. Let y1, y2, ..., yd be the children of x, indexed in order
of the times they were most recently made children of x (y1 being the earliest and yd the latest), and let c1, c2, ..., cd
be their respective degrees. We claim that cii-2 for each i with 2id: Just before yi was made a child of x,
y1,...,yi1 were already children of x, and so x had degree at least i1 at that time. Since trees are combined only
when the degrees of their roots are equal, it must have been that yi also had degree at least i-1 at the time it became a
child of x. From that time to the present, yi can only have lost at most one child (as guaranteed by the marking
process), and so its current degree ci is at least i2. This proves the claim.
Since the heights of all the yi are strictly less than that of x, we can apply the inductive hypothesis to them to get
size(yi)Fci+2F(i2)+2=Fi. The nodes x and y1 each contribute at least 1 to size(x), and so we have
for any
size(x).
Worst case
Although the total running time of a sequence of operations starting with an empty structure is bounded by the
bounds given above, some (very few) operations in the sequence can take very long to complete (in particular delete
and delete minimum have linear running time in the worst case). For this reason Fibonacci heaps and other amortized
data structures may not be appropriate for real-time systems. It is possible to create a data structure which has the
same worst case performance as the Fibonacci heap has amortized performance. However the resulting structure (a
Brodal queue) is, in the words of the creator, "quite complicated" and "[not] applicable in practice."
Effect
Pairing heap
insert(data,key)
Adds data to
the queue,
tagged with
key
O(1)
O(n)
O(log n)
O(log
n)
O(log n)
O(1)
O(1)
O(1)
findMin() ->
key,data
Returns
O(n)
key,data
corresponding
to min-value
key
O(1)
O(log n) or
O(1) (**)
O(1)
O(log n)
O(1)
O(1)
O(1)
deleteMin()
O(1)
O(log n)
O(log
n)
O(log n)
O(log n)*
delete(node)
O(1)
O(log n)
O(log
n)
O(log n)
O(log n)*
Fibonacci heap
decreaseKey(node)
214
Decreases
O(1)
the key of a
node, given a
pointer to the
node being
modified
O(1)
O(n)
O(log n)
O(log
n)
O(log n)
O(1)*
O(1)
O(m +
n)
O(m
log(n+m))
O(m +
n)
O(log m
+ log n)
O(1)
O(1)
O(1)
(*)Amortized time
(**)With trivial modification to store an additional pointer to the minimum element
References
[1] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Second Edition. MIT Press and
McGraw-Hill, 2001. ISBN 0-262-03293-7. Chapter 20: Fibonacci Heaps, pp.476497. Third edition p518.
External links
Java applet simulation of a Fibonacci heap (http://www.cs.yorku.ca/~aaw/Jason/FibonacciHeapAnimation.
html)
MATLAB implementation of Fibonacci heap (http://www.mathworks.com/matlabcentral/fileexchange/
30072-fibonacci-heap)
De-recursived and memory efficient C implementation of Fibonacci heap (http://www.labri.fr/perso/pelegrin/
code/#fibonacci) (free/libre software, CeCILL-B license (http://www.cecill.info/licences/
Licence_CeCILL-B_V1-en.html))
C++ template Fibonacci heap, with demonstration (http://ideone.com/9jYnv)
Ruby implementation of the Fibonacci heap (with tests) (http://github.com/evansenter/f_heap)
Pseudocode of the Fibonacci heap algorithm (http://www.cs.princeton.edu/~wayne/cs423/fibonacci/
FibonacciHeapAlgorithm.html)
Efficient C++ Fibonacci heap (http://stackoverflow.com/q/504823/194609)
Various Java Implementations for Fibonacci heap (http://stackoverflow.com/q/6273833/194609)
Pairing heap
215
Pairing heap
A pairing heap is a type of heap data structure with relatively simple implementation and excellent practical
amortized performance. However, it has proven very difficult to determine the precise asymptotic running time of
pairing heaps.
Pairing heaps are heap ordered multiway trees. Describing the various heap operations is relatively simple (in the
following we assume a min-heap):
find-min: simply return the top element of the heap.
merge: compare the two root elements, the smaller remains the root of the result, the larger element and its
subtree is appended as a child of this root.
insert: create a new heap for the inserted element and merge into the original heap.
decrease-key (optional): remove the subtree rooted at the key to be decreased then merge it with the heap.
delete-min: remove the root and merge its subtrees. Various strategies are employed.
The amortized time per delete-min is
decrease-key takes
and
amortized time. Fredman proved that the amortized time per decrease-key is at
least
.
Although this is worse than other priority queue algorithms such as Fibonacci heaps, which perform decrease-key in
amortized time, the performance in practice is excellent. Stasko and Vitter and Moret and Shapiro conducted
experiments on pairing heaps and other heap data structures. They concluded that the pairing heap is as fast as, and
often faster than, other efficient data structures like the binary heaps.
Implementation
A pairing heap is either an empty heap, or a pair consisting of a root element and a possibly empty list of pairing
heaps. The heap ordering property requires that all the root elements of the subheaps in the list are not smaller than
the root element of the heap. The following description assumes a purely functional heap that does not support the
decrease-key operation.
type PairingHeap[Elem] = Empty Heap(elem: Elem, subheaps: List[PairingHeap[Elem)
Operations
find-min
The function find-min simply returns the root element of the heap:
function find-min(heap)
if heap == Empty
error
else
return heap.elem
Pairing heap
merge
Merging with an empty heap returns the other heap, otherwise a new heap is returned that has the minimum of the
two root elements as its root element and just adds the heap with the larger root to the list of subheaps:
function merge(heap1, heap2)
if heap1 == Empty
return heap2
elsif heap2 == Empty
return heap1
elsif heap1.elem < heap2.elem
return Heap(heap1.elem, heap2 :: heap1.subheaps)
else
return Heap(heap2.elem, heap1 :: heap2.subheaps)
insert
The easiest way to insert an element into a heap is to merge the heap with a new heap containing just this element
and an empty list of subheaps:
function insert(elem, heap)
return merge(Heap(elem, []), heap)
delete-min
The only non-trivial fundamental operation is the deletion of the minimum element from the heap. The standard
strategy first merges the subheaps in pairs (this is the step that gave this datastructure its name) from left to right and
then merges the resulting list of heaps from right to left:
function delete-min(heap)
if heap == Empty
error
else
return merge-pairs(heap.subheaps)
This uses the auxiliary function merge-pairs:
function merge-pairs(l)
if length(l) == 0
return Empty
elsif length(l) == 1
return l[0]
else
return merge(merge(l[0], l[1]), merge-pairs(l[2.. ]))
That this does indeed implement the described two-pass left-to-right then right-to-left merging strategy can be seen
from this reduction:
merge-pairs([H1, H2, H3, H4, H5, H6, H7])
=> merge(merge(H1, H2), merge-pairs([H3, H4, H5, H6, H7]))
# merge H1 and H2 to H12, then the rest of the list
=> merge(H12, merge(merge(H3, H4), merge-pairs([H5, H6, H7])))
# merge H3 and H4 to H34, then the rest of the list
216
Pairing heap
=> merge(H12, merge(H34, merge(merge(H5, H6), merge-pairs([H7]))))
# merge H5 and H5 to H56, then the rest of the list
=> merge(H12, merge(H34, merge(H56, H7)))
# switch direction, merge the last two resulting heaps, giving H567
=> merge(H12, merge(H34, H567))
# merge the last two resulting heaps, giving H34567
=> merge(H12, H34567)
# finally, merge the first merged pair with the result of merging the rest
=> H1234567
References
External links
Louis Wasserman discusses pairing heaps and their implementation in Haskell in The Monad Reader, Issue 16
(http://themonadreader.files.wordpress.com/2010/05/issue16.pdf) (pp.3752).
pairing heaps (http://www.cise.ufl.edu/~sahni/dsaaj/enrich/c13/pairing.htm), Sartaj Sahni
Amr Elmasry (2009), "Pairing Heaps with O(log log n) decrease Cost" (http://www.siam.org/proceedings/
soda/2009/SODA09_052_elmasrya.pdf), Proceedings of the twentieth Annual ACM-SIAM Symposium on
Discrete Algorithms {SODA '09} (New York): 471476
heaps library (http://www.swi-prolog.org/pldoc/doc/swi/library/heaps.pl) in SWI-Prolog, uses pairing heaps
Open source implementation of pairing heaps in Erlang (https://gist.github.com/1248317)
Operations
A double-ended priority queue features the follow operations:
isEmpty()
Checks if DEPQ is empty and returns true if empty.
size()
Returns the total number of elements present in the DEPQ.
getMin()
Returns the element having least priority.
getMax()
Returns the element having highest priority.
put(x)
Inserts the element x in the DEPQ.
removeMin()
217
218
Implementation
Double-ended priority queues can be built from balanced binary search trees (where the minimum and maximum
elements are the leftmost and rightmost leaves, respectively), or using specialized data structures like min-max heap
and pairing heap.
Generic methods of arriving at double-ended priority queues from normal priority queues are:[2]
Removing the max element: Perform removemax() on the max heap and remove(node value) on the min heap,
where node value is the value in the corresponding node in the min heap.
Total correspondence
Half the elements are in the min PQ and the other
half in the max PQ. Each element in the min PQ
has a one to one correspondence with an element
in max PQ. If the number of elements in the DEPQ
is odd, one of the elements is retained in a buffer.
Priority of every element in the min PQ will be less
than or equal to the corresponding element in the
max PQ.
219
Leaf correspondence
In this method only the leaf elements of the min
and max PQ form corresponding one to one pairs.
It is not necessary for non-leaf elements to be in a
one to one correspondence pair.
Interval heaps
Apart from the above mentioned correspondence
methods, DEPQ's can be obtained efficiently using
interval heaps.[3] An interval heap is like an
embedded min-max heap in which each node
contains two elements. It is a complete binary tree
in which:
The left element is less than or equal to the right
element.
Both the elements define a closed interval.
Interval represented by any node except the root
is a sub-interval of the parent node.
Elements on the left hand side define a min heap.
220
heap and if the element falls to the right of the parent interval, it is considered in the max heap. Further, it is
compared successively and moved from the last node to the root until all the conditions for interval heap are
satisfied. If the element lies within the interval of the parent node itself, the process is stopped then and there itself
and moving of elements does not take place.
The time required for inserting an element depends on the number of movements required to meet all the conditions
and is O(logn).
Deleting an element
Min element: In an interval heap, the minimum element is the element on the left hand side of the root node. This
element is removed and returned. To fill in the vacancy created on the left hand side of the root node, an element
from the last node is removed and reinserted into the root node. This element is then compared successively with
all the left hand elements of the descending nodes and the process stops when all the conditions for an interval
heap are satisfied.In case if the left hand side element in the node becomes greater than the right side element at
any stage, the two elements are swapped and then further comparisons are done. Finally, the root node will again
contain the minimum element on the left hand side.
Max element: In an interval heap, the maximum element is the element on the right hand side of the root node.
This element is removed and returned. To fill in the vacancy created on the right hand side of the root node, an
element from the last node is removed and reinserted into the root node. Further comparisons are carried out on a
similar basis as discussed above. Finally, the root node will again contain the max element on the right hand side.
Thus, with interval heaps, both the minimum and maximum elements can be removed efficiently traversing from
root to leaf. Thus, a DEPQ can be obtained from an interval heap where the elements of the interval heap are the
priorities of elements in the DEPQ.
Time Complexity
Interval Heaps
When DEPQ's are implemented using Interval heaps consisting of n elements, the time complexities for the various
functions are formulated in the table below
Operation
Time Complexity
init( )
O(n)
isEmpty( )
O(1)
getmin( )
O(1)
getmax( )
O(1)
size( )
O(1)
insert(x)
O(logn)
removeMin( ) O(logn)
removeMax( ) O(logn)
221
Pairing heaps
When DEPQ's are implemented using heaps or paring heaps consisting of n elements, the time complexities for the
various functions are formulated in the table below. For pairing heaps, it is an amortized complexity.
Operation
Time Complexity
isEmpty( )
O(1)
getmin( )
O(1)
getmax( )
O(1)
insert(x)
O(logn)
removeMax( ) O(logn)
removeMin( ) O(logn)
Applications
External sorting
One example application of the double-ended priority queue is external sorting. In an external sort, there are more
elements than can be held in the computer's memory. The elements to be sorted are initially on a disk and the sorted
sequence is to be left on the disk. The external quick sort is implemented using the DEPQ as follows:
1. Read in as many elements as will fit into an internal DEPQ. The elements in the DEPQ will eventually be the
middle group (pivot) of elements.
2. Read in the remaining elements. If the next element is the smallest element in the DEPQ, output this next
element as part of the left group. If the next element is the largest element in the DEPQ, output this next
element as part of the right group. Otherwise, remove either the max or min element from the DEPQ (the choice
may be made randomly or alternately); if the max element is removed, output it as part of the right group;
otherwise, output the removed element as part of the left group; insert the newly input element into the DEPQ.
3. Output the elements in the DEPQ, in sorted order, as the middle group.
4. Sort the left and right groups recursively.
References
[1] Data Structures, Algorithms, & Applications in Java: Double-Ended Priority Queues (http:/ / www. cise. ufl. edu/ ~sahni/ dsaaj/ enrich/ c13/
double. htm), Sartaj Sahni, 1999.
[2] Fundamentals of Data Structures in C++ - Ellis Horowitz, Sartaj Sahni and Dinesh Mehta
[3] http:/ / www. mhhe. com/ engcs/ compsci/ sahni/ enrich/ c9/ interval. pdf
Soft heap
Soft heap
In computer science, a soft heap is a variant on the simple heap data structure that has constant amortized time for 5
types of operations. This is achieved by carefully "corrupting" (increasing) the keys of at most a certain number of
values in the heap. The constant time operations are:
Other heaps such as Fibonacci heaps achieve most of these bounds without any corruption, but cannot provide a
constant-time bound on the critical delete operation. The amount of corruption can be controlled by the choice of a
parameter , but the lower this is set, the more time insertions require (O(log 1/) for an error rate of ).
More precisely, the guarantee offered by the soft heap is the following: for a fixed value between 0 and 1/2, at any
point in time there will be at most *n corrupted keys in the heap, where n is the number of elements inserted so far.
Note that this does not guarantee that only a fixed percentage of the keys currently in the heap are corrupted: in an
unlucky sequence of insertions and deletions, it can happen that all elements in the heap will have corrupted keys.
Similarly, we have no guarantee that in a sequence of elements extracted from the heap with findmin and delete, only
a fixed percentage will have corrupted keys: in an unlucky scenario only corrupted elements are extracted from the
heap.
The soft heap was designed by Bernard Chazelle in 2000. The term "corruption" in the structure is the result of what
Chazelle called "carpooling" in a soft heap. Each node in the soft heap contains a linked-list of keys and one
common key. The common key is an upper bound on the values of the keys in the linked-list. Once a key is added to
the linked-list, it is considered corrupted because its value is never again relevant in any of the soft heap operations:
only the common keys are compared. This is what makes soft heaps "soft"; you can't be sure whether or not any
particular value you put into it will be corrupted. The purpose of these corruptions is effectively to lower the
information entropy of the data, enabling the data structure to break through information-theoretic barriers regarding
heaps.
Applications
Surprisingly, despite its limitations and its unpredictable nature, soft heaps are useful in the design of deterministic
algorithms. They were used to achieve the best complexity to date for finding a minimum spanning tree. They can
also be used to easily build an optimal selection algorithm, as well as near-sorting algorithms, which are algorithms
that place every element near its final position, a situation in which insertion sort is fast.
One of the simplest examples is the selection algorithm. Say we want to find the kth largest of a group of n numbers.
First, we choose an error rate of 1/3; that is, at most about 33% of the keys we insert will be corrupted. Now, we
insert all n elements into the heap we call the original values the "correct" keys, and the values stored in the heap
the "stored" keys. At this point, at most n/3 keys are corrupted, that is, for at most n/3 keys is the "stored" key larger
than the "correct" key, for all the others the stored key equals the correct key.
Next, we delete the minimum element from the heap n/3 times (this is done according to the "stored" key). As the
total number of insertions we have made so far is still n, there are still at most n/3 corrupted keys in the heap.
Accordingly, at least 2n/3 n/3 = n/3 of the keys remaining in the heap are not corrupted.
Let L be the element with the largest correct key among the elements we removed. The stored key of L is possibly
larger than its correct key (if L was corrupted), and even this larger value is smaller than all the stored keys of the
remaining elements in the heap (as we were removing minimums). Therefore, the correct key of L is smaller than the
222
Soft heap
remaining n/3 uncorrupted elements in the soft heap. Thus, L divides the elements somewhere between 33%/66%
and 66%/33%. We then partition the set about L using the partition algorithm from quicksort and apply the same
algorithm again to either the set of numbers less than L or the set of numbers greater than L, neither of which can
exceed 2n/3 elements. Since each insertion and deletion requires O(1) amortized time, the total deterministic time is
T(n) = T(2n/3) + O(n). Using case 3 of the master theorem (with =1 and c=2/3), we know that T(n) = (n).
The final algorithm looks like this:
function softHeapSelect(a[1..n], k)
if k = 1 then return minimum(a[1..n])
create(S)
for i from 1 to n
insert(S, a[i])
for i from 1 to n/3
x := findmin(S)
delete(S, x)
xIndex := partition(a, x) // Returns new index of pivot x
if k < xIndex
softHeapSelect(a[1..xIndex-1], k)
else
softHeapSelect(a[xIndex..n], k-xIndex+1)
References
Chazelle, B. 2000. The soft heap: an approximate priority queue with optimal error rate. [1] J. ACM 47, 6 (Nov.
2000), 1012-1027.
Kaplan, H. and Zwick, U. 2009. A simpler implementation and analysis of Chazelle's soft heaps. [2] In
Proceedings of the Nineteenth Annual ACM -SIAM Symposium on Discrete Algorithms (New York, New York,
January 46, 2009). Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics,
Philadelphia, PA, 477-485.
References
[1] http:/ / citeseerx. ist. psu. edu/ viewdoc/ summary?doi=10. 1. 1. 5. 9705
[2] http:/ / www. siam. org/ proceedings/ soda/ 2009/ SODA09_053_kaplanh. pdf
223
224
Search algorithm
Data structure
Array
O(log n)
O(1)
O(log n)
In computer science, a binary search or half-interval search algorithm finds the position of a specified input value
(the search "key") within an array sorted by key value. In each step, the algorithm compares the search key value
with the key value of the middle element of the array. If the keys match, then a matching element has been found and
its index, or position, is returned. Otherwise, if the search key is less than the middle element's key, then the
algorithm repeats its action on the sub-array to the left of the middle element or, if the search key is greater, on the
sub-array to the right. If the remaining array to be searched is empty, then the key cannot be found in the array and a
special "not found" indication is returned.
A binary search halves the number of items to check with each iteration, so locating an item (or determining its
absence) takes logarithmic time. A binary search is a dichotomic divide and conquer search algorithm.
Overview
Searching a sorted collection is a common task. A dictionary is a sorted list of word definitions. Given a word, one
can find its definition. A telephone book is a sorted list of people's names, addresses, and telephone numbers.
Knowing someone's name allows one to quickly find their telephone number and address.
If the list to be searched contains more than a few items (a dozen, say) a binary search will require far fewer
comparisons than a linear search, but it imposes the requirement that the list be sorted. Similarly, a hash search can
be faster than a binary search but imposes still greater requirements. If the contents of the array are modified between
searches, maintaining these requirements may even take more time than the searches. And if it is known that some
items will be searched for much more often than others, and it can be arranged that these items are at the start of the
list, then a linear search may be the best.
Examples
Example: L = 1 3 4 6 8 9 11. X = 4.
Compare X to 6. It's smaller. Repeat with L = 1 3 4.
Compare X to 3. It's bigger. Repeat with L = 4.
Compare X to 4. It's equal. We're done, we found X.
This is called Binary Search: each iteration of (1)-(4) the length of the list we are looking in gets cut in half.
Therefore, the total number of iterations cannot be greater than logN.
Word lists
People typically use a mixture of the binary search and interpolative search algorithms when searching a telephone
book, after the initial guess we exploit the fact that the entries are sorted and can rapidly find the required entry. For
example when searching for Smith, if Rogers and Thomas have been found, one can flip to a page about halfway
between the previous guesses. If this shows Samson, it can be concluded that Smith is somewhere between the
Samson and Thomas pages so these can be divided.
225
Algorithm
Recursive
A straightforward implementation of binary search is recursive. The initial call uses the indices of the entire array to
be searched. The procedure then calculates an index midway between the two indices, determines which of the two
subarrays to search, and then does a recursive call to search that subarray. Each of the calls is tail recursive, so a
compiler need not make a new stack frame for each call. The variables imin and imax are the lowest and highest
inclusive indices that are searched.
int binary_search(int A[], int key, int imin, int imax)
{
// test if array is empty
if (imax < imin)
// set is empty, so return value showing not found
return KEY_NOT_FOUND;
else
{
// calculate midpoint to cut set in half
int imid = midpoint(imin, imax);
// three-way comparison
if (A[imid] > key)
// key is in lower subset
return binary_search(A, key, imin, imid-1);
else if (A[imid] < key)
// key is in upper subset
return binary_search(A, key, imid+1, imax);
else
// key has been found
return imid;
}
}
It is invoked with initial imin and imax values of 0 and N-1 for a zero based array of length N.
The number type "int" shown in the code has an influence on how the midpoint calculation can be implemented
correctly. With unlimited numbers, the midpoint can be calculated as "(imin + imax) / 2". In practical
programming, however, the calculation is often performed with numbers of a limited range, and then the
intermediate result "(imin + imax)" might overflow. With limited numbers, the midpoint can be calculated
correctly as "imin + ((imax - imin) / 2)".
226
Iterative
The binary search algorithm can also be expressed iteratively with two index limits that progressively narrow the
search range.
int binary_search(int A[], int key, int imin, int imax)
{
// continue searching while [imin,imax] is not empty
while (imax >= imin)
{
// calculate the midpoint for roughly equal partition
int imid = midpoint(imin, imax);
// determine which subarray to search
if (A[imid] < key)
// change min index to search upper subarray
imin = imid + 1;
else if (A[imid] > key)
// change max index to search lower subarray
imax = imid - 1;
else
// key found at index imid
return imid;
}
// key not found
return KEY_NOT_FOUND;
}
227
Performance
With each test that fails to find a match at the probed position, the search is continued with one or other of the two
sub-intervals, each at most half the size. More precisely, if the number of items, N, is odd then both sub-intervals will
contain (N1)/2 elements, while if N is even then the two sub-intervals contain N/21 and N/2 elements.
If the original number of items is N then after the first iteration there will be at most N/2 items remaining, then at
most N/4 items, at most N/8 items, and so on. In the worst case, when the value is not in the list, the algorithm must
continue iterating until the span has been made empty; this will have taken at most log2(N)+1 iterations, where the
notation denotes the floor function that rounds its argument down to an integer. This worst case analysis is tight:
for any N there exists a query that takes exactly log2(N)+1 iterations. When compared to linear search, whose
worst-case behaviour is N iterations, we see that binary search is substantially faster as N grows large. For example,
to search a list of one million items takes as many as one million iterations with linear search, but never more than
twenty iterations with binary search. However, a binary search can only be performed if the list is in sorted order.
228
Average performance
log2(N)1 is the expected number of probes in an average successful search, and the worst case is log2(N), just one
more probe.[citation needed] If the list is empty, no probes at all are made. Thus binary search is a logarithmic
algorithm and executes in O(log N) time. In most cases it is considerably faster than a linear search. It can be
implemented using iteration, or recursion. In some languages it is more elegantly expressed recursively; however, in
some C-based languages tail recursion is not eliminated and the recursive version requires more stack space.
Binary search can interact poorly with the memory hierarchy (i.e. caching), because of its random-access nature. For
in-memory searching, if the span to be searched is small, a linear search may have superior performance simply
because it exhibits better locality of reference. For external searching, care must be taken or each of the first several
probes will lead to a disk seek. A common method is to abandon binary searching for linear searching as soon as the
size of the remaining span falls below a small value such as 8 or 16 or even more in recent computers. The exact
value depends entirely on the machine running the algorithm.
Notice that for multiple searches with a fixed value for N, then (with the appropriate regard for integer division), the
first iteration always selects the middle element at N/2, and the second always selects either N/4 or 3N/4, and so on.
Thus if the array's key values are in some sort of slow storage (on a disc file, in virtual memory, not in the cpu's
on-chip memory), keeping those three keys in a local array for a special preliminary search will avoid accessing
widely separated memory. Escalating to seven or fifteen such values will allow further levels at not much cost in
storage. On the other hand, if the searches are frequent and not separated by much other activity, the computer's
various storage control features will more or less automatically promote frequently accessed elements into faster
storage.
When multiple binary searches are to be performed for the same key in related lists, fractional cascading can be used
to speed up successive searches after the first one.
Even though in theory binary search is almost always faster than linear search, in practice even on medium sized
arrays (around 100 items or less) it might be infeasible to ever use binary search. On larger arrays, it only makes
sense to binary search if the number of searches is large enough, because the initial time to sort the array is
comparable to many linear searches
Variations
There are many, and they are easily confused. Also, using a binary search within a sorting method is debatable.
229
Search domain
There is no particular requirement that the array being searched has the bounds 1 to N. It is possible to search a
specified range, elements first to last instead of 1 to N. All that is necessary is that the initialization of the bounds be
L := first1 and R := last+1, then all proceeds as before.
The elements of the list are not necessarily all unique. If one searches for a value that occurs multiple times in the
list, the index returned will be of the first-encountered equal element, and this will not necessarily be that of the first,
last, or middle element of the run of equal-key elements but will depend on the positions of the values. Modifying
the list even in seemingly unrelated ways such as adding elements elsewhere in the list may change the result.
If the location of the first and/or last equal element needs to be determined, this can be done efficiently with a variant
of the binary search algorithms which perform only one inequality test per iteration.
Noisy search
Several algorithms closely related to or extending binary search exist. For instance, noisy binary search solves the
same class of projects as regular binary search, with the added complexity that any given test can return a false value
at random. (Usually, the number of such erroneous results are bounded in some way, either in the form of an average
error rate, or in the total number of errors allowed per element in the search space.) Optimal algorithms for several
classes of noisy binary search problems have been known since the late seventies, and more recently, optimal
algorithms for noisy binary search in quantum computers (where several elements can be tested at the same time)
have been discovered.
Implementation issues
Although the basic idea of binary search is comparatively straightforward, the details can be surprisingly
tricky Donald Knuth
When Jon Bentley assigned it as a problem in a course for professional programmers, he found that an astounding
ninety percent failed to code a binary search correctly after several hours of working on it, and another study shows
that accurate code for it is only found in five out of twenty textbooks.[1] Furthermore, Bentley's own implementation
of binary search, published in his 1986 book Programming Pearls, contains an error that remained undetected for
over twenty years.
230
Arithmetic
In a practical implementation, the variables used to represent the indices will often be of finite size, hence only
capable of representing a finite range of values. For example, 32-bit unsigned integers can only hold values from 0 to
4294967295. 32-bit signed integers can only hold values from -2147483648 to 2147483647. If the binary search
algorithm is to operate on large arrays, this has two implications:
The values first 1 and last + 1 must both be representable within the finite bounds of the chosen
integer type . Therefore, continuing the 32-bit unsigned example, the largest value that last may take is
+4294967294, not +4294967295. A problem exists even for the "inclusive" form of the method, as if x >
A(4294967295).Key, then on the final iteration the algorithm will attempt to store 4294967296 into L and
fail. Equivalent issues apply to the lower limit, where first 1 could become negative as when the first
element of the array is at index zero.
If the midpoint of the span is calculated as p := (L + R)/2, then the value (L + R) will exceed the
number range if last is greater than (for unsigned) 4294967295/2 or (for signed) 2147483647/2 and the search
wanders toward the upper end of the search space. This can be avoided by performing the calculation as p :=
(R - L)/2 + L. For example, this bug existed in Java SDK at Arrays.binarySearch() from 1.2 to 5.0
and fixed in 6.0.
Language support
Many standard libraries provide a way to do a binary search:
C provides algorithm function bsearch in its standard library.
C++'s STL provides algorithm functions binary_search, lower_bound and upper_bound.
Java offers a set of overloaded binarySearch() static methods in the classes Arrays [2] and
Collections [3] in the standard java.util package for performing binary searches on Java arrays and on
Lists, respectively. They must be arrays of primitives, or the arrays or Lists must be of a type that implements
the Comparable interface, or you must specify a custom Comparator object.
Microsoft's .NET Framework 2.0 offers static generic versions of the binary search algorithm in its collection base
classes. An example would be System.Array's method BinarySearch<T>(T[] array, T value).
Python provides the bisect [4] module.
COBOL can perform binary search on internal tables using the SEARCH ALL statement.
Perl can perform a generic binary search using the CPAN module Search::Binary.
Go's sort standard library package contains functions Search, SearchInts, SearchFloat64s, and
SearchStrings, which implement general binary search, as well as specific implementations for searching
slices of integers, floating-point numbers, and strings, respectively.
For Objective-C, the Cocoa framework provides the NSArray
-indexOfObject:inSortedRange:options:usingComparator: [5] method in Mac OS X 10.6+. Apple's Core
Foundation C framework also contains a CFArrayBSearchValues() [6] function.
231
References
[1]
[2]
[3]
[4]
[5]
cited at
http:/ / download. oracle. com/ javase/ 7/ docs/ api/ java/ util/ Arrays. html
http:/ / download. oracle. com/ javase/ 7/ docs/ api/ java/ util/ Collections. html
http:/ / docs. python. org/ library/ bisect. html
http:/ / developer. apple. com/ library/ mac/ documentation/ Cocoa/ Reference/ Foundation/ Classes/ NSArray_Class/ NSArray. html#/ /
apple_ref/ occ/ instm/ NSArray/ indexOfObject:inSortedRange:options:usingComparator:
[6] http:/ / developer. apple. com/ library/ mac/ documentation/ CoreFoundation/ Reference/ CFArrayRef/ Reference/ reference. html#/ /
apple_ref/ c/ func/ CFArrayBSearchValues
Other sources
Kruse, Robert L.: "Data Structures and Program Design in C++", Prentice-Hall, 1999, ISBN 0-13-768995-0, page
280.
van Gasteren, Netty; Feijen, Wim (1995). "The Binary Search Revisited" (http://www.mathmeth.com/wf/files/
wf2xx/wf214.pdf) (PDF). AvG127/WF214. (investigates the foundations of the binary search, debunking the
myth that it applies only to sorted arrays)
External links
NIST Dictionary of Algorithms and Data Structures: binary search (http://www.nist.gov/dads/HTML/
binarySearch.html)
Binary search implemented in 12 languages (http://www.codecodex.com/wiki/Binary_search)
232
233
Tree
Time complexity
in big O notation
Average Worst case
Space O(n)
O(n)
The major advantage of binary search trees over other data structures is that the related sorting algorithms and search
algorithms such as in-order traversal can be very efficient.
Binary search trees are a fundamental data structure used to construct more abstract data structures such as sets,
multisets, and associative arrays.
Binary-search-tree property
Let x be a node in a binary search tree. If y is a node in the left subtree of x, then y.key < x.key. If y is a node in the
right subtree of x, then y.key > x.key.
Operations
Operations on a binary search tree require comparisons between nodes. These comparisons are made with calls to a
comparator, which is a subroutine that computes the total order (linear order) on any two keys. This comparator can
be explicitly or implicitly defined, depending on the language in which the BST is implemented.
Searching
Searching a binary search tree for a specific key can be a recursive or iterative process.
We begin by examining the root node. If the tree is null, the key we are searching for does not exist in the tree.
Otherwise, if the key equals that of the root, the search is successful. If the key is less than the root, search the left
subtree. Similarly, if it is greater than the root, search the right subtree. This process is repeated until the key is found
or the remaining subtree is null. If the searched key is not found before a null subtree is reached, then the item must
not be present in the tree. This is easily expressed as a recursive algorithm:
algorithm Find-recursive(key, node): // call initially with node = root
if node = Nil or node.key = key then
node
else if key < node.key then
Find-recursive(key, node.left)
else
Find-recursive(key, node.right)
The same algorithm can be implemented iteratively:
algorithm Find(key, root):
current-node := root
while current-node is not Null do
if current-node.key = key then
return current-node
else if key < current-node.key then
current-node := current-node.left
else
current-node := current-node.right
Because in the worst case this algorithm must search from the root of the tree to the leaf farthest from the root, the
search operation takes time proportional to the tree's height (see tree terminology). On average, trees with n nodes
have O(log n) height, but in the worst case have O(n) height, when the unbalanced tree resembles a linked list
(degenerate tree).
Insertion
Insertion begins as a search would begin; if the key is not equal to that of the root, we search the left or right subtrees
as before. Eventually, we will reach an external node and add the new key-value pair (here encoded as a record
'newNode') as its right or left child, depending on the node's key. In other words, we examine the root and
recursively insert the new node to the left subtree if its key is less than that of the root, or the right subtree if its key
is greater than or equal to the root.
Here's how a typical binary search tree insertion might be performed in a non-empty tree in C++:
void insert(Node* node, int value) {
if (value < node->key) {
if (node->leftChild == NULL)
node->leftChild = new Node(value);
else
insert(node->leftChild, value);
} else {
if(node->rightChild == NULL)
234
235
node->rightChild = new Node(value);
else
insert(node->rightChild, value);
}
return node;
}
The above destructive procedural variant modifies the tree in place. It uses only constant heap space (and the
iterative version uses constant stack space as well), but the prior version of the tree is lost. Alternatively, as in the
following Python example, we can reconstruct all ancestors of the inserted node; any reference to the original tree
root remains valid, making the tree a persistent data structure:
def binary_tree_insert(node, key, value):
if node is None:
return TreeNode(None, key, value, None)
if key == node.key:
return TreeNode(node.left, key, value, node.right)
if key < node.key:
return TreeNode(binary_tree_insert(node.left, key, value),
node.key, node.value, node.right)
else:
return TreeNode(node.left, node.key, node.value,
binary_tree_insert(node.right, key, value))
The part that is rebuilt uses O(log n) space in the average case and O(n) in the worst case (see big-O notation).
In either version, this operation requires time proportional to the height of the tree in the worst case, which is O(log
n) time in the average case over all trees, but O(n) time in the worst case.
Another way to explain insertion is that in order to insert a new node in the tree, its key is first compared with that of
the root. If its key is less than the root's, it is then compared with the key of the root's left child. If its key is greater, it
is compared with the root's right child. This process continues, until the new node is compared with a leaf node, and
then it is added as this node's right or left child, depending on its key.
There are other ways of inserting nodes into a binary tree, but this is the only way of inserting nodes at the leaves
and at the same time preserving the BST structure.
Deletion
There are three possible cases to consider:
Deleting a leaf (node with no children): Deleting a leaf is easy, as we can simply remove it from the tree.
Deleting a node with one child: Remove the node and replace it with its child.
Deleting a node with two children: Call the node to be deleted N. Do not delete N. Instead, choose either its
in-order successor node or its in-order predecessor node, R. Replace the value of N with the value of R, then
delete R.
As with all binary trees, a node's in-order successor is the left-most child of its right subtree, and a node's in-order
predecessor is the right-most child of its left subtree. In either case, this node will have zero or one children. Delete it
according to one of the two simpler cases above.
Deleting a node with two children from a binary search tree. First the rightmost node in the left subtree, the inorder predecessor 6, is identified. Its
value is copied into the node being deleted. The inorder predecessor can then be easily deleted because it has at most one child. The same method
works symmetrically using the inorder successor labelled 9.
Consistently using the in-order successor or the in-order predecessor for every instance of the two-child case can
lead to an unbalanced tree, so some implementations select one or the other at different times.
Runtime analysis: Although this operation does not always traverse the tree down to a leaf, this is always a
possibility; thus in the worst case it requires time proportional to the height of the tree. It does not require more even
when the node has two children, since it still follows a single path and does not visit any node twice.
def find_min(self):
# Gets minimum node (leftmost leaf) in a subtree
current_node = self
while current_node.left_child:
current_node = current_node.left_child
return current_node
def replace_node_in_parent(self, new_value=None):
if self.parent:
if self == self.parent.left_child:
self.parent.left_child = new_value
else:
self.parent.right_child = new_value
if new_value:
new_value.parent = self.parent
def binary_tree_delete(self, key):
if key < self.key:
self.left_child.binary_tree_delete(key)
elif key > self.key:
self.right_child.binary_tree_delete(key)
else: # delete the key here
if self.left_child and self.right_child: # if both children are
present
successor = self.right_child.find_min()
self.key = successor.key
successor.binary_tree_delete(successor.key)
elif self.left_child:
# if the node has only a *left* child
self.replace_node_in_parent(self.left_child)
236
Traversal
Once the binary search tree has been created, its elements can be retrieved in-order by recursively traversing the left
subtree of the root node, accessing the node itself, then recursively traversing the right subtree of the node,
continuing this pattern with each node in the tree as it's recursively accessed. As with all binary trees, one may
conduct a pre-order traversal or a post-order traversal, but neither are likely to be useful for binary search trees. An
in-order traversal of a binary search tree will always result in a sorted list of node items (numbers, strings or other
comparable items).
The code for in-order traversal in Python is given below. It will call callback for every node in the tree.
def traverse_binary_tree(node, callback):
if node is None:
return
traverse_binary_tree(node.leftChild, callback)
callback(node.value)
traverse_binary_tree(node.rightChild, callback)
Traversal requires O(n) time, since it must visit every node. This algorithm is also O(n), so it is asymptotically
optimal.
Sort
A binary search tree can be used to implement a simple but efficient
sorting algorithm. Similar to heapsort, we insert all the values we wish
to sort into a new ordered data structurein this case a binary search
treeand then traverse it in order, building our result:
def build_binary_tree(values):
tree = None
for v in values:
237
238
tree = binary_tree_insert(tree, v)
return tree
def get_inorder_traversal(root):
'''
Returns a list containing all the values in the tree, starting at
*root*.
Traverses the tree in-order(leftChild, root, rightChild).
'''
result = []
traverse_binary_tree(root, lambda element: result.append(element))
return result
The worst-case time of build_binary_tree is
into a linked list with no left subtrees. For example, build_binary_tree([1, 2, 3, 4, 5]) yields the
tree (1 (2 (3 (4 (5))))).
There are several schemes for overcoming this flaw with simple binary trees; the most common is the self-balancing
binary search tree. If this same procedure is done using such a tree, the overall worst-case time is O(nlog n), which is
asymptotically optimal for a comparison sort. In practice, the poor cache performance and added overhead in time
and space for a tree-based sort (particularly for node allocation) make it inferior to other asymptotically optimal sorts
such as heapsort for static list sorting. On the other hand, it is one of the most efficient methods of incremental
sorting, adding items to a list over time while keeping the list sorted at all times.
Types
There are many types of binary search trees. AVL trees and red-black trees are both forms of self-balancing binary
search trees. A splay tree is a binary search tree that automatically moves frequently accessed elements nearer to the
root. In a treap (tree heap), each node also holds a (randomly chosen) priority and the parent node has higher priority
than its children. Tango trees are trees optimized for fast searches.
Two other titles describing binary search trees are that of a complete and degenerate tree.
A complete tree is a tree with n levels, where for each level d <= n - 1, the number of existing nodes at level d is
equal to 2d. This means all possible nodes exist at these levels. An additional requirement for a complete binary tree
is that for the nth level, while every node does not have to exist, the nodes that do exist must fill from left to right.
A degenerate tree is a tree where for each parent node, there is only one associated child node. What this means is
that in a performance measurement, the tree will essentially behave like a linked list data structure.
Performance comparisons
D. A. Heger (2004) presented a performance comparison of binary search trees. Treap was found to have the best
average performance, while red-black tree was found to have the smallest amount of performance variations.
References
Further reading
Paul E. Black, Binary Search Tree (http://www.nist.gov/dads/HTML/binarySearchTree.html) at the NIST
Dictionary of Algorithms and Data Structures.
Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001). "12: Binary search trees,
15.5: Optimal binary search trees". Introduction to Algorithms (2nd ed.). MIT Press & McGraw-Hill.
pp.253272, 356363. ISBN0-262-03293-7.
Jarc, Duane J. (3 December 2005). "Binary Tree Traversals" (http://nova.umuc.edu/~jarc/idsv/lesson1.html).
Interactive Data Structure Visualizations. University of Maryland.
Knuth, Donald (1997). "6.2.2: Binary Tree Searching". The Art of Computer Programming. 3: "Sorting and
Searching" (3rd ed.). Addison-Wesley. pp.426458. ISBN0-201-89685-0.
Long, Sean. "Binary Search Tree" (http://employees.oneonta.edu/zhangs/PowerPointPlatform/resources/
samples/binarysearchtree.ppt) (PPT). Data Structures and Algorithms Visualization-A PowerPoint Slides Based
Approach. SUNY Oneonta.
Parlante, Nick (2001). "Binary Trees" (http://cslibrary.stanford.edu/110/BinaryTrees.html). CS Education
Library. Stanford University.
239
240
External links
Literate implementations of binary search trees in various languages (http://en.literateprograms.org/
Category:Binary_search_tree) on LiteratePrograms
Goleta, Maksim (27 November 2007). "Goletas.Collections" (http://goletas.com/csharp-collections/).
goletas.com. Includes an iterative C# implementation of AVL trees.
Jansens, Dana. "Persistent Binary Search Trees" (http://cg.scs.carleton.ca/~dana/pbst). Computational
Geometry Lab, School of Computer Science, Carleton University. C implementation using GLib.
Kovac, Kubo. "Binary Search Trees" (http://people.ksp.sk/~kuko/bak/) (Java applet). Korepondenn
seminr z programovania.
Madru, Justin (18 August 2009). "Binary Search Tree" (http://jdserver.homelinux.org/wiki/
Binary_Search_Tree). JDServer. C++ implementation.
Tarreau, Willy (2011). "Elastic Binary Trees (ebtree)" (http://1wt.eu/articles/ebtree/). 1wt.eu.
Binary Search Tree Example in Python (http://code.activestate.com/recipes/286239/)
"References to Pointers (C++)" (http://msdn.microsoft.com/en-us/library/1sf8shae(v=vs.80).aspx). MSDN.
Microsoft. 2005. Gives an example binary tree implementation.
Igushev, Eduard. "Binary Search Tree C++ implementation" (http://igushev.com/implementations/
binary-search-tree-cpp/).
Stromberg, Daniel. "Python Search Tree Empirical Performance Comparison" (http://stromberg.dnsalias.org/
~strombrg/python-tree-and-heap-comparison/).
Probabilistic
data structures
Bloom filter Quotient filter Skip list
Random trees
Random binary tree Treap
Rapidly exploring random tree
Related
Randomized algorithm
Computer science Portal
In computer science and probability theory, a random binary tree refers to a binary tree selected at random from
some probability distribution on binary trees. Two different distributions are commonly used: binary trees formed by
inserting nodes one at a time according to a random permutation, and binary trees chosen from a uniform discrete
distribution in which all distinct trees are equally likely. It is also possible to form other distributions, for instance by
repeated splitting. Adding and removing nodes directly in a random binary tree will in general disrupt its random
structure, but the treap and related randomized binary search tree data structures use the principle of binary trees
formed from a random permutation in order to maintain a balanced binary search tree dynamically as nodes are
inserted and deleted.
For random trees that are not necessarily binary, see random tree.
241
where is the unique number in the range 0 < < 1 satisfying the equation
[2]
Notes
[1]
[2]
[3]
[4]
[5]
[6]
[7]
; ; , p. 75.
; ; ; , pp. 9199; .
;.
, p. 15.
. That it is at most logarithmic is trivial, because the Strahler number of every tree is bounded by the logarithm of the number of its nodes.
, p. 63.
, p. 70.
242
References
Aldous, David (1996), "Probability distributions on cladograms", in Aldous, David; Pemantle, Robin, Random
Discrete Structures, The IMA Volumes in Mathematics and its Applications 76, Springer-Verlag, pp.118.
Devroye, Luc (1986), "A note on the height of binary search trees", Journal of the ACM 33 (3): 489498, doi:
10.1145/5925.5930 (http://dx.doi.org/10.1145/5925.5930).
Devroye, Luc; Kruszewski, Paul (1995), "A note on the Horton-Strahler number for random trees", Information
Processing Letters 56 (2): 9599, doi: 10.1016/0020-0190(95)00114-R (http://dx.doi.org/10.1016/
0020-0190(95)00114-R).
Devroye, Luc; Kruszewski, Paul (1996), "The botanical beauty of random binary trees", in Brandenburg, Franz J.,
Graph Drawing: 3rd Int. Symp., GD'95, Passau, Germany, September 20-22, 1995, Lecture Notes in Computer
Science 1027, Springer-Verlag, pp.166177, doi: 10.1007/BFb0021801 (http://dx.doi.org/10.1007/
BFb0021801), ISBN3-540-60723-4.
Drmota, Michael (2009), Random Trees : An Interplay between Combinatorics and Probability, Springer-Verlag,
ISBN978-3-211-75355-2.
Flajolet, P.; Raoult, J. C.; Vuillemin, J. (1979), "The number of registers required for evaluating arithmetic
expressions", Theoretical Computer Science 9 (1): 99125, doi: 10.1016/0304-3975(79)90009-4 (http://dx.doi.
org/10.1016/0304-3975(79)90009-4).
Hibbard, T. (1962), "Some combinatorial properties of certain trees with applications to searching and sorting",
Journal of the ACM 9 (1): 1328, doi: 10.1145/321105.321108 (http://dx.doi.org/10.1145/321105.321108).
Knuth, Donald M. (1973), "6.2.2 Binary Tree Searching", The Art of Computer Programming III,
Addison-Wesley, pp.422451.
Knuth, Donald M. (2005), "Draft of Section 7.2.1.6: Generating All Trees" (http://www-cs-faculty.stanford.
edu/~knuth/fasc4a.ps.gz), The Art of Computer Programming IV.
Mahmoud, Hosam M. (1992), Evolution of Random Search Trees, John Wiley & Sons.
Martinez, Conrado; Roura, Salvador (1998), "Randomized binary search trees" (http://citeseer.ist.psu.edu/
article/martinez97randomized.html), Journal of the ACM (ACM Press) 45 (2): 288323, doi:
10.1145/274787.274812 (http://dx.doi.org/10.1145/274787.274812).
Pittel, B. (1985), "Asymptotical growth of a class of random trees", Annals of Probability 13 (2): 414427, doi:
10.1214/aop/1176993000 (http://dx.doi.org/10.1214/aop/1176993000).
Reed, Bruce (2003), "The height of a random binary search tree", Journal of the ACM 50 (3): 306332, doi:
10.1145/765568.765571 (http://dx.doi.org/10.1145/765568.765571).
Robson, J. M. (1979), "The height of binary search trees", Australian Computer Journal 11: 151153.
Seidel, Raimund; Aragon, Cecilia R. (1996), "Randomized Search Trees" (http://citeseer.ist.psu.edu/
seidel96randomized.html), Algorithmica 16 (4/5): 464497, doi: 10.1007/s004539900061 (http://dx.doi.org/
10.1007/s004539900061).
External links
Open Data Structures - Chapter 7 - Random Binary Search Trees (http://opendatastructures.org/versions/
edition-0.1e/ods-java/7_Random_Binary_Search_Tree.html)
243
Tree rotation
Tree rotation
In discrete mathematics, tree rotation is an
operation on a binary tree that changes the
structure without interfering with the order
of the elements. A tree rotation moves one
node up in the tree and one node down. It is
used to change the shape of the tree, and in
particular to decrease its height by moving
smaller subtrees down and larger subtrees
up, resulting in improved performance of
many tree operations.
There exists an inconsistency in different
Generic tree rotations.
descriptions as to the definition of the
direction of rotations. Some say that the direction of a rotation depends on the side which the tree nodes are shifted
upon whilst others say that it depends on which child takes the root's place (opposite of the former). This article takes
the approach of the side where the nodes get shifted to.
Illustration
The right rotation operation as shown in the image above is performed with Q as the root and hence is a right
rotation on, or rooted at, Q. This operation results in a rotation of the tree in the clockwise direction. The inverse
operation is the left rotation, which results in a movement in a counter-clockwise direction (the left rotation shown
above is rooted at P). The key to understanding how a rotation functions is to understand its constraints. In particular
the order of the leaves of the tree (when read left to right for example) cannot change (another way to think of it is
that the order that the leaves would be visited in an in-order traversal must be the same after the operation as before).
Another constraint is the main property of a binary search tree, namely that the right child is greater than the parent
and the left child is less than the parent. Notice that the right child of a left child of the root of a sub-tree (for
example node B in the diagram for the tree rooted at Q) can become the left child of the root, that itself becomes the
right child of the "new" root in the rotated sub-tree, without violating either of those constraints. As you can see in
the diagram, the order of the leaves doesn't change. The opposite operation also preserves the order and is the second
kind of rotation.
Assuming this is a binary search tree, as stated above, the elements must be interpreted as variables that can be
compared to each other. The alphabetic characters above are used as placeholders for these variables.
244
Tree rotation
245
Detailed illustration
When a subtree is rotated, the subtree side upon which it is rotated
decreases its height by one node while the other subtree increases its
height. This makes tree rotations useful for rebalancing a tree.
Using the terminology of Root for the parent node of the subtrees to
rotate, Pivot for the node which will become the new parent node, RS
for rotation side upon to rotate and OS for opposite side of rotation. In
the above diagram for the root Q, the RS is C and the OS is P. The
pseudo code for the rotation is:
Pivot = Root.OS
Root.OS = Pivot.RS
Pivot.RS = Root
Root = Pivot
This is a constant time operation.
The programmer must also make sure that the root's parent points to the pivot after the rotation. Also, the
programmer should note that this operation may result in a new root for the entire tree and take care to update
pointers accordingly.
Inorder Invariance
The tree rotation renders the inorder traversal of the binary tree invariant. This implies the order of the elements are
not affected when a rotation is performed in any part of the tree. Here are the inorder traversals of the trees shown
above:
Left tree: ((A, P, B), Q, C)
Computing one from the other is very simple. The following is example Python code that performs that computation:
def right_rotation(treenode):
left, Q, C = treenode
A, P, B = left
return (A, P, (B, Q, C))
Another way of looking at it is:
Right Rotation of node Q:
Let
Set
Set
Set
Tree rotation
246
Rotation distance
The rotation distance between any two binary trees with the same
number of nodes is the minimum number of rotations needed to transform one into the other. With this distance, the
set of n-node binary trees becomes a metric space: the distance is symmetric, positive when given two different trees,
and satisfies the triangle inequality.
It is an open problem whether there exists a polynomial time algorithm for calculating rotation distance.
Daniel Sleator, Robert Tarjan and William Thurston showed that the rotation distance between any two n-node trees
(for n 11) is at most 2n6, and that infinitely many pairs of trees are this far apart.
References
External links
Java applets demonstrating tree rotations (http://www.cs.queensu.ca/home/jstewart/applets/bst/bst-rotation.
html)
The AVL Tree Rotations Tutorial (http://fortheloot.com/public/AVLTreeTutorial.rtf) (RTF) by John Hargrove
247
The same tree after being height-balanced; the average path effort
decreased to 3.00 node accesses
248
Overview
Most operations on a binary search tree
(BST) take time directly proportional to the
height of the tree, so it is desirable to keep
the height small. A binary tree with height h
can contain at most 20+21++2h=2h+11
nodes. It follows that for a tree with n nodes
and height h:
Tree rotations are very common internal operations on self-balancing binary trees
to keep perfect or near-to-perfect balance.
However, the simplest algorithms for BST item insertion may yield a tree with height n in rather common situations.
For example, when the items are inserted in sorted key order, the tree degenerates into a linked list with n nodes. The
difference in performance between the two situations may be enormous: for n=1,000,000, for example, the
minimum height is
.
If the data items are known ahead of time, the height can be kept small, in the average sense, by adding values in a
random order, resulting in a random binary search tree. However, there are many situations (such as online
algorithms) where this randomization is not viable.
Self-balancing binary trees solve this problem by performing transformations on the tree (such as tree rotations) at
key times, in order to keep the height proportional to log2(n). Although a certain overhead is involved, it may be
justified in the long run by ensuring fast execution of later operations.
Maintaining the height always at its minimum value
insertion algorithm which did so would have an excessive overhead.[citation needed] Therefore, most self-balanced
BST algorithms keep the height within a constant factor of this lower bound.
In the asymptotic ("Big-O") sense, a self-balancing BST structure containing n items allows the lookup, insertion,
and removal of an item in O(log n) worst-case time, and ordered enumeration of all items in O(n) time. For some
implementations these are per-operation time bounds, while for others they are amortized bounds over a sequence of
operations. These times are asymptotically optimal among all data structures that manipulate the key only through
comparisons.
Implementations
Popular data structures implementing this type of tree include:
2-3 tree
AA tree
AVL tree
Red-black tree
Scapegoat tree
Splay tree
Treap
Applications
Self-balancing binary search trees can be used in a natural way to construct and maintain ordered lists, such as
priority queues. They can also be used for associative arrays; key-value pairs are simply inserted with an ordering
based on the key alone. In this capacity, self-balancing BSTs have a number of advantages and disadvantages over
their main competitor, hash tables. One advantage of self-balancing BSTs is that they allow fast (indeed,
asymptotically optimal) enumeration of the items in key order, which hash tables do not provide. One disadvantage
is that their lookup algorithms get more complicated when there may be multiple items with the same key.
Self-balancing BSTs have better worst-case lookup performance than hash tables (O(log n) compared to O(n)), but
have worse average-case performance (O(log n) compared to O(1)).
Self-balancing BSTs can be used to implement any algorithm that requires mutable ordered lists, to achieve optimal
worst-case asymptotic performance. For example, if binary tree sort is implemented with a self-balanced BST, we
have a very simple-to-describe yet asymptotically optimal O(n log n) sorting algorithm. Similarly, many algorithms
in computational geometry exploit variations on self-balancing BSTs to solve problems such as the line segment
intersection problem and the point location problem efficiently. (For average-case performance, however,
self-balanced BSTs may be less efficient than other solutions. Binary tree sort, in particular, is likely to be slower
than merge sort, quicksort, or heapsort, because of the tree-balancing overhead as well as cache access patterns.)
Self-balancing BSTs are flexible data structures, in that it's easy to extend them to efficiently record additional
information or perform new operations. For example, one can record the number of nodes in each subtree having a
certain property, allowing one to count the number of nodes in a certain key range with that property in O(log n)
time. These extensions can be used, for example, to optimize database queries or other list-processing algorithms.
References
[1] Donald Knuth. The Art of Computer Programming, Volume 3: Sorting and Searching, Second Edition. Addison-Wesley, 1998. ISBN
0-201-89685-0. Section 6.2.3: Balanced Trees, pp.458481.
External links
Dictionary of Algorithms and Data Structures: Height-balanced binary search tree (http://www.nist.gov/dads/
HTML/heightBalancedTree.html)
GNU libavl (http://adtinfo.org/), a LGPL-licensed library of binary tree implementations in C, with
documentation
249
Treap
250
Treap
Treap
Type
Worst case
Space O(n)
O(n)
Search O(log n)
amortized O(log n)
Insert O(log n)
amortized O(log n)
Delete O(log n)
amortized O(log n)
Part of a series on
Probabilistic
data structures
Bloom filter Quotient filter Skip list
Random trees
Random binary tree Treap
Rapidly exploring random tree
Related
Randomized algorithm
Computer science Portal
In computer science, the treap and the randomized binary search tree are two closely related forms of binary
search tree data structures that maintain a dynamic set of ordered keys and allow binary searches among the keys.
After any sequence of insertions and deletions of keys, the shape of the tree is a random variable with the same
probability distribution as a random binary tree; in particular, with high probability its height is proportional to the
logarithm of the number of keys, so that each search, insertion, or deletion operation takes logarithmic time to
perform.
Treap
251
The treap was first described by Cecilia R. Aragon and Raimund
Seidel in 1989; its name is a portmanteau of tree and heap. It is a
Cartesian tree in which each key is given a (randomly chosen) numeric
priority. As with any binary search tree, the inorder traversal order of
the nodes is the same as the sorted order of the keys. The structure of
the tree is determined by the requirement that it be heap-ordered: that
is, the priority number for any non-leaf node must be greater than or
equal to the priority of its children. Thus, as with Cartesian trees more
generally, the root node is the maximum-priority node, and its left and
right subtrees are formed in the same manner from the subsequences of
the sorted order to the left and right of that node.
Operations
Specifically, the treap supports the following operations:
To search for a given key value, apply a standard binary search algorithm in a binary search tree, ignoring the
priorities.
To insert a new key x into the treap, generate a random priority y for x. Binary search for x in the tree, and create a
new node at the leaf position where the binary search determines a node for x should exist. Then, as long as x is
not the root of the tree and has a larger priority number than its parent z, perform a tree rotation that reverses the
parent-child relation between x and z.
To delete a node x from the treap, if x is a leaf of the tree, simply remove it. If x has a single child z, remove x
from the tree and make z be the child of the parent of x (or make z the root of the tree if x had no parent). Finally,
if x has two children, swap its position in the tree with the position of its immediate successor z in the sorted
order, resulting in one of the previous cases. In this final case, the swap may violate the heap-ordering property
for z, so additional rotations may need to be performed to restore this property.
To split a treap into two smaller treaps, those smaller than key x, and those larger than key x, insert x into the treap
with maximum prioritylarger than the priority of any node in the treap. After this insertion, x will be the root
node of the treap, all values less than x will be found in the left subtreap, and all values greater than x will be
found in the right subtreap. This costs as much as a single insertion into the treap.
Treap
Merging two treaps that are the product of a former split, one can safely assume that the greatest value in the first
treap is less than the smallest value in the second treap. Insert a value x, such that x is larger than this max-value
in the first treap, and smaller than the min-value in the second treap, and assign it the minimum priority. After
insertion it will be a leaf node, and can easily be deleted. The result is one treap merged from the two original
treaps. This is effectively "undoing" a split, and costs the same.
Comparison
The information stored per node in the randomized binary tree is simpler than in a treap (a small integer rather than a
high-precision random number), but it makes a greater number of calls to the random number generator (O(logn)
calls per insertion or deletion rather than one call per insertion) and the insertion procedure is slightly more
complicated due to the need to update the numbers of descendants per node. A minor technical difference is that, in a
treap, there is a small probability of a collision (two keys getting the same priority), and in both cases there will be
statistical differences between a true random number generator and the pseudo-random number generator typically
used on digital computers. However, in any case the differences between the theoretical model of perfect random
choices used to design the algorithm and the capabilities of actual random number generators are vanishingly small.
Although the treap and the randomized binary search tree both have the same random distribution of tree shapes after
each update, the history of modifications to the trees performed by these two data structures over a sequence of
insertion and deletion operations may be different. For instance, in a treap, if the three numbers 1, 2, and 3 are
inserted in the order 1, 3, 2, and then the number 2 is deleted, the remaining two nodes will have the same
parent-child relationship that they did prior to the insertion of the middle number. In a randomized binary search
tree, the tree after the deletion is equally likely to be either of the two possible trees on its two nodes, independently
of what the tree looked like prior to the insertion of the middle number.
252
Treap
253
References
External links
AVL tree
254
AVL tree
AVL tree
Type
Tree
Invented
1962
Worst case
Space
O(n)
O(n)
Search
O(log n)
O(log n)
Insert
O(log n)
O(log n)
Delete
O(log n)
O(log n)
AVL tree
Operations
Basic operations of an AVL tree involve
carrying out the same actions as would be
carried out on an unbalanced binary search
tree, but modifications are followed by zero
or more operations called tree rotations,
which help to restore the height balance of
the subtrees.
Searching
Once a node has been found in a balanced
Tree rotations
tree, the next or previous nodes can be
explored in amortized constant time. Some
instances of exploring these "nearby" nodes require traversing up to 2log(n) links (particularly when moving from
the rightmost leaf of the root's left subtree to the leftmost leaf of the root's right subtree; in the example AVL tree,
moving from node 14 to the next but one node 19 takes 4 steps). However, exploring all n nodes of the tree in this
manner would use each link exactly twice: one traversal to enter the subtree rooted at that node, another to leave that
node's subtree after having explored it. And since there are n1 links in any tree, the amortized cost is found to be
2(n1)/n, or approximately 2.
255
AVL tree
256
Insertion
After inserting a node, it is necessary to check each of the
node's ancestors for consistency with the rules of AVL. The
balance factor is calculated as follows: balanceFactor =
height(left-subtree) - height(right-subtree). For each node
checked, if the balance factor remains 1, 0, or +1 then no
rotations are necessary. However, if balance factor becomes
less than -1 or greater than +1, the subtree rooted at this
node is unbalanced. If insertions are performed serially,
after each insertion, at most one of the following cases
needs to be resolved to restore the entire tree to the rules of
AVL.
Suppose inserting one element causes P's balance factor to
go out of range. It must be that insertion caused the height
of one of P's child nodes to increase by 1 (but not the other).
Without loss of generality, assume that the height of L, P's
left, was increased. The following procedure can restore
balance at P:
if (balance_factor(L) < 0) {
// In the illustration to the right,
// this is the first step in the left-right case.
rotate_left(L);
}
// This brings us to the left-left case.
rotate_right(P);
The right-left and right-right cases are analogous. The names of the cases refer to the portion of the tree that is
reduced in height.
In order to restore the balance factors of all nodes, first observe that all nodes requiring correction lie along the path
used during the initial insertion. If the above procedure is applied to nodes along this path, starting from the bottom
(i.e. the node furthest away from the root), then every node in the tree will again have a balance factor of -1, 0, or 1.
AVL tree
Deletion
Let node X be the node with the value we need to delete, and let node Y be a node in the tree we need to find to take
node X's place, and let node Z be the actual node we take out of the tree.
Steps to consider when deleting a node in an AVL tree are the following:
1. If node X is a leaf or has only one child, skip to step 5. (node Z will be node X)
2. Otherwise, determine node Y by finding the largest node in node X's left sub tree (in-order predecessor) or the
smallest in its right sub tree (in-order successor).
3. Replace node X with node Y (remember, tree structure doesn't change here, only the values). In this step, node X
is essentially deleted when its internal values were overwritten with node Y's.
4. Choose node Z to be the old node Y.
5. Attach node Z's subtree to its parent (if it has a subtree). If node Z's parent is null, update root. (node Z is
currently root)
6. Delete node Z.
7. Retrace the path back up the tree (starting with node Z's parent) to the root, adjusting the balance factors as
needed.
As with all binary trees, a node's in-order successor is the left-most child of its right subtree, and a node's in-order
predecessor is the right-most child of its left subtree. In either case, this node will have zero or one children. Delete it
according to one of the two simpler cases above.
Deleting a node with two children from a binary search tree using the inorder predecessor (rightmost node in the left subtree, labelled 6).
In addition to the balancing described above for insertions, if the balance factor for the tree is 2 and that of the left
subtree is 0, a right rotation must be performed on P. The mirror of this case is also necessary.
The retracing can stop if the balance factor becomes 1 or +1 indicating that the height of that subtree has remained
unchanged. If the balance factor becomes 0 then the height of the subtree has decreased by one and the retracing
needs to continue. If the balance factor becomes 2 or +2 then the subtree is unbalanced and needs to be rotated to
fix it. If the rotation leaves the subtree's balance factor at 0 then the retracing towards the root must continue since
the height of this subtree has decreased by one. This is in contrast to an insertion where a rotation resulting in a
balance factor of 0 indicated that the subtree's height has remained unchanged.
The time required is O(log n) for lookup, plus a maximum of O(log n) rotations on the way back to the root, so the
operation can be completed in O(log n) time.
257
AVL tree
258
where
[5]
AVL trees are more rigidly balanced than red-black trees, leading to slower insertion and removal but faster
retrieval.
References
[1] Robert Sedgewick, Algorithms, Addison-Wesley, 1983, ISBN 0-201-06672-6, page 199, chapter 15: Balanced Trees.
[2] English translation by Myron J. Ricci in Soviet Math. Doklady, 3:12591263, 1962.
[3] AVL trees are not weight-balanced? (meaning: AVL trees are not -balanced?) (http:/ / cs. stackexchange. com/ questions/ 421/
avl-trees-are-not-weight-balanced)
Thereby: A Binary Tree is called UNIQ-math-0-5cd073dcfe757d11-QINU -balanced, with UNIQ-math-1-5cd073dcfe757d11-QINU , if for
every node UNIQ-math-2-5cd073dcfe757d11-QINU , the inequality
holds and UNIQ-math-4-5cd073dcfe757d11-QINU is minimal with this property. UNIQ-math-5-5cd073dcfe757d11-QINU is the number of
nodes below the tree with UNIQ-math-6-5cd073dcfe757d11-QINU as root (including the root) and UNIQ-math-7-5cd073dcfe757d11-QINU
is the left child node of UNIQ-math-8-5cd073dcfe757d11-QINU .
[4] In fact, each AVL tree can be colored red-black.
[5] Proof of asymptotic bounds
Further reading
Donald Knuth. The Art of Computer Programming, Volume 3: Sorting and Searching, Third Edition.
Addison-Wesley, 1997. ISBN 0-201-89685-0. Pages 458475 of section 6.2.3: Balanced Trees.
External links
xdg library (https://github.com/vilkov/libxdg/wiki) by Dmitriy Vilkov: Serializable straight C-implementation
could easily be taken from this library under GNU-LGPL and AFL v2.0 licenses.
Description from the Dictionary of Algorithms and Data Structures (http://www.nist.gov/dads/HTML/avltree.
html)
Python Implementation (http://github.com/pgrafov/python-avl-tree/)
Single C header file by Ian Piumarta (http://piumarta.com/software/tree/)
AVL Tree Demonstration (http://www.strille.net/works/media_technology_projects/avl-tree_2001/)
AVL tree applet all the operations (http://webdiis.unizar.es/asignaturas/EDA/AVLTree/avltree.html)
Fast and efficient implementation of AVL Trees (http://github.com/fbuihuu/libtree)
PHP Implementation (https://github.com/mondrake/Rbppavl)
C++ implementation which can be used as an array (http://www.codeproject.com/Articles/12347/
AVL-Binary-Tree-for-C)
Self balancing AVL tree with Concat and Split operations (http://code.google.com/p/self-balancing-avl-tree/)
Redblack tree
259
Redblack tree
Redblack tree
Type
Tree
Invented
1972
O(n)
O(n)
Search
O(log n) O(log n)
Insert
O(log n) O(log n)
Delete
O(log n) O(log n)
A redblack tree is a type of self-balancing binary search tree, a data structure used in computer science.
The self-balancing is provided by painting each node with one of two colors (these are typically called 'red' and
'black', hence the name of the trees) in such a way that the resulting painted tree satisfies certain properties that don't
allow it to become significantly unbalanced. When the tree is modified, the new tree is subsequently rearranged and
repainted to restore the coloring properties. The properties are designed in such a way that this rearranging and
recoloring can be performed efficiently.
The balancing of the tree is not perfect but it is good enough to allow it to guarantee searching in O(log n) time,
where n is the total number of elements in the tree. The insertion, and deletion operations, along with the tree
rearrangement and recoloring are also performed in O(log n) time.
Tracking the color of each node requires only 1 bit of information per node because there are only two colors. The
tree does not contain any other data specific to its being a redblack tree so its memory footprint is almost identical
to classic (uncolored) binary search tree. In many cases the additional bit of information can be stored at no
additional memory cost.
History
The original data structure was invented in 1972 by Rudolf Bayer and named "symmetric binary B-tree," but
acquired its modern name in a paper in 1978 by Leonidas J. Guibas and Robert Sedgewick entitled "A Dichromatic
Framework for Balanced Trees". The color "red" was chosen because it was the best-looking color produced by the
color laser printer available to the authors while working at Xerox PARC.
Terminology
A redblack tree is a special type of binary tree, used in computer science to organize pieces of comparable data,
such as text fragments or numbers.
The leaf nodes of redblack trees do not contain data. These leaves need not be explicit in computer memorya null
child pointer can encode the fact that this child is a leafbut it simplifies some algorithms for operating on
redblack trees if the leaves really are explicit nodes. To save memory, sometimes a single sentinel node performs
the role of all leaf nodes; all references from internal nodes to leaf nodes then point to the sentinel node.
Redblack trees, like all binary search trees, allow efficient in-order traversal (that is: in the order LeftRootRight)
of their elements. The search-time results from the traversal from root to leaf, and therefore a balanced tree of n
Redblack tree
260
nodes, having the least possible tree height, results in O(log n) search time.
Properties
In addition to the requirements
imposed on a binary search trees, with
redblack trees:
1. A node is either red or black.
2. The root is black. (This rule is
sometimes omitted. Since the root
can always be changed from red to
black, but not necessarily
vice-versa, this rule has little effect
on analysis.)
3. All leaves (NIL) are black. (All
leaves are same color as the root.)
4. Every red node must have two black child nodes.
5. Every path from a given node to any of its descendant leaves contains the same number of black nodes.
These constraints enforce a critical property of redblack trees: that the path from the root to the furthest leaf is no
more than twice as long as the path from the root to the nearest leaf. The result is that the tree is roughly
height-balanced. Since operations such as inserting, deleting, and finding values require worst-case time proportional
to the height of the tree, this theoretical upper bound on the height allows redblack trees to be efficient in the worst
case, unlike ordinary binary search trees.
To see why this is guaranteed, it suffices to consider the effect of properties 4 and 5 together. For a redblack tree T,
let B be the number of black nodes in property 5. Let the shortest possible path from the root of T to any leaf consist
of B black nodes. Longer possible paths may be constructed by inserting red nodes. However, property 4 makes it
impossible to insert more than one consecutive red node. Therefore the longest possible path consists of 2B nodes,
alternating black and red.
The shortest possible path has all black nodes, and the longest possible path alternates between red and black nodes.
Since all maximal paths have the same number of black nodes, by property 5, this shows that no path is more than
twice as long as any other path.
Redblack tree
One way to see this equivalence is to "move up" the red nodes in a graphical representation of the redblack tree, so
that they align horizontally with their parent black node, by creating together a horizontal cluster. In the B-tree, or in
the modified graphical representation of the redblack tree, all leaf nodes are at the same depth.
The redblack tree is then structurally equivalent to a B-tree of order 4, with a minimum fill factor of 33% of values
per cluster with a maximum capacity of 3 values.
This B-tree type is still more general than a redblack tree though, as it allows ambiguity in a redblack tree
conversionmultiple redblack trees can be produced from an equivalent B-tree of order 4. If a B-tree cluster
contains only 1 value, it is the minimum, black, and has two child pointers. If a cluster contains 3 values, then the
central value will be black and each value stored on its sides will be red. If the cluster contains two values, however,
either one can become the black node in the redblack tree (and the other one will be red).
So the order-4 B-tree does not maintain which of the values contained in each cluster is the root black tree for the
whole cluster and the parent of the other values in the same cluster. Despite this, the operations on redblack trees
are more economical in time because you don't have to maintain the vector of values. It may be costly if values are
stored directly in each node rather than being stored by reference. B-tree nodes, however, are more economical in
space because you don't need to store the color attribute for each node. Instead, you have to know which slot in the
cluster vector is used. If values are stored by reference, e.g. objects, null references can be used and so the cluster can
be represented by a vector containing 3 slots for value pointers plus 4 slots for child references in the tree. In that
case, the B-tree can be more compact in memory, improving data locality.
The same analogy can be made with B-trees with larger orders that can be structurally equivalent to a colored binary
tree: you just need more colors. Suppose that you add blue, then the blueredblack tree defined like redblack trees
but with the additional constraint that no two successive nodes in the hierarchy will be blue and all blue nodes will
be children of a red node, then it becomes equivalent to a B-tree whose clusters will have at most 7 values in the
following colors: blue, red, blue, black, blue, red, blue (For each cluster, there will be at most 1 black node, 2 red
nodes, and 4 blue nodes).
For moderate volumes of values, insertions and deletions in a colored binary tree are faster compared to B-trees
because colored trees don't attempt to maximize the fill factor of each horizontal cluster of nodes (only the minimum
fill factor is guaranteed in colored binary trees, limiting the number of splits or junctions of clusters). B-trees will be
faster for performing rotations (because rotations will frequently occur within the same cluster rather than with
multiple separate nodes in a colored binary tree). However for storing large volumes, B-trees will be much faster as
they will be more compact by grouping several children in the same cluster where they can be accessed locally.
All optimizations possible in B-trees to increase the average fill factors of clusters are possible in the equivalent
multicolored binary tree. Notably, maximizing the average fill factor in a structurally equivalent B-tree is the same as
reducing the total height of the multicolored tree, by increasing the number of non-black nodes. The worst case
occurs when all nodes in a colored binary tree are black, the best case occurs when only a third of them are black
(and the other two thirds are red nodes).
Notes
[1] Using Knuth's definition of order: the maximum number of children
261
Redblack tree
The AVL tree is another structure supporting O(log n) search, insertion, and removal. It is more rigidly balanced
than redblack trees, leading to slower insertion and removal but faster retrieval. This makes it attractive for data
structures that may be built once and loaded without reconstruction, such as language dictionaries (or program
dictionaries, such as the opcodes of an assembler or interpreter).
Redblack trees are also particularly valuable in functional programming, where they are one of the most common
persistent data structures, used to construct associative arrays and sets which can retain previous versions after
mutations. The persistent version of redblack trees requires O(log n) space for each insertion or deletion, in addition
to time.
For every 2-4 tree, there are corresponding redblack trees with data elements in the same order. The insertion and
deletion operations on 2-4 trees are also equivalent to color-flipping and rotations in redblack trees. This makes 2-4
trees an important tool for understanding the logic behind redblack trees, and this is why many introductory
algorithm texts introduce 2-4 trees just before redblack trees, even though 2-4 trees are not often used in practice.
In 2008, Sedgewick introduced a simpler version of the redblack tree called the left-leaning redblack tree[1] by
eliminating a previously unspecified degree of freedom in the implementation. The LLRB maintains an additional
invariant that all red links must lean left except during inserts and deletes. Redblack trees can be made isometric to
either 2-3 trees,[2] or 2-4 trees, for any sequence of operations. The 2-4 tree isometry was described in 1978 by
Sedgewick.Wikipedia:Quotations With 2-4 trees, the isometry is resolved by a "color flip," corresponding to a split,
in which the red color of two children nodes leaves the children and moves to the parent node. The tango tree, a type
of tree optimized for fast searches, usuallyWikipedia:Manual of Style/Dates and numbers#Chronological items uses
redblack trees as part of its data structure.
Operations
Read-only operations on a redblack tree require no modification from those used for binary search trees, because
every redblack tree is a special case of a simple binary search tree. However, the immediate result of an insertion or
removal may violate the properties of a redblack tree. Restoring the redblack properties requires a small number
(O(log n) or amortized O(1)) of color changes (which are very quick in practice) and no more than three tree
rotations (two for insertion). Although insert and delete operations are complicated, their times remain O(log n).
Insertion
Insertion begins by adding the node as any binary search tree insertion does and by coloring it red. Whereas in the
binary search tree, we always add a leaf, in the redblack tree leaves contain no information, so instead we add a red
interior node, with two black leaves, in place of an existing black leaf.
What happens next depends on the color of other nearby nodes. The term uncle node will be used to refer to the
sibling of a node's parent, as in human family trees. Note that:
property 3 (all leaves are black) always holds.
property 4 (both children of every red node are black) is threatened only by adding a red node, repainting a black
node red, or a rotation.
property 5 (all paths from any given node to its leaf nodes contain the same number of black nodes) is threatened
only by adding a black node, repainting a red node black (or vice versa), or a rotation.
Note: The label N will be used to denote the current node (colored red). At the beginning, this is the new node
being inserted, but the entire procedure may also be applied recursively to other nodes (see case 3). P will
denote N's parent node, G will denote N's grandparent, and U will denote N's uncle. Note that in between some
cases, the roles and labels of the nodes are exchanged, but in each case, every label continues to represent the
same node it represented at the beginning of the case. Any color shown in the diagram is either assumed in its
case or implied by those assumptions. A numbered triangle represents a subtree of unspecified depth. A black
262
Redblack tree
circle atop the triangle designates a black root node, otherwise the root node's color is unspecified.
Each case will be demonstrated with example C code. The uncle and grandparent nodes can be found by these
functions:
struct node *grandparent(struct node *n)
{
if ((n != NULL) && (n->parent != NULL))
return n->parent->parent;
else
return NULL;
}
struct node *uncle(struct node *n)
{
struct node *g = grandparent(n);
if (g == NULL)
return NULL; // No grandparent means no uncle
if (n->parent == g->left)
return g->right;
else
return g->left;
}
Case 1: The current node N is at the root of the tree. In this case, it is repainted black to satisfy property 2 (the root is
black). Since this adds one black node to every path at once, property 5 (all paths from any given node to its leaf
nodes contain the same number of black nodes) is not violated.
void insert_case1(struct node *n)
{
if (n->parent == NULL)
n->color = BLACK;
else
insert_case2(n);
}
Case 2: The current node's parent P is black, so property 4 (both children of every red node are black) is not
invalidated. In this case, the tree is still valid. Property 5 (all paths from any given node to its leaf nodes contain the
same number of black nodes) is not threatened, because the current node N has two black leaf children, but because
N is red, the paths through each of its children have the same number of black nodes as the path through the leaf it
replaced, which was black, and so this property remains satisfied.
void insert_case2(struct node *n)
{
if (n->parent->color == BLACK)
return; /* Tree is still valid */
else
insert_case3(n);
}
263
Redblack tree
Note: In the following cases it can be assumed that N has a grandparent node G, because its parent P is red,
and if it were the root, it would be black. Thus, N also has an uncle node U, although it may be a leaf in cases
4 and 5.
Case 3: If both the parent P and the uncle U are red, then both of them
can be repainted black and the grandparent G becomes red (to maintain
property 5 (all paths from any given node to its leaf nodes contain the
same number of black nodes)). Now, the current red node N has a
black parent. Since any path through the parent or uncle must pass
through the grandparent, the number of black nodes on these paths has
not changed. However, the grandparent G may now violate properties
2 (The root is black) or 4 (Both children of every red node are black) (property 4 possibly being violated since G
may have a red parent). To fix this, the entire procedure is recursively performed on G from case 1. Note that this is
a tail-recursive call, so it could be rewritten as a loop; since this is the only loop, and any rotations occur after this
loop, this proves that a constant number of rotations occur.
void insert_case3(struct node *n)
{
struct node *u = uncle(n), *g;
if ((u != NULL) && (u->color == RED)) {
n->parent->color = BLACK;
u->color = BLACK;
g = grandparent(n);
g->color = RED;
insert_case1(g);
} else {
insert_case4(n);
}
}
Note: In the remaining cases, it is assumed that the parent node P is the left child of its parent. If it is the right
child, left and right should be reversed throughout cases 4 and 5. The code samples take care of this.
Case 4: The parent P is red but the uncle U is black; also, the current
node N is the right child of P, and P in turn is the left child of its parent
G. In this case, a left rotation on P that switches the roles of the current
node N and its parent P can be performed; then, the former parent node
P is dealt with using case 5 (relabeling N and P) because property 4
(both children of every red node are black) is still violated. The
rotation causes some paths (those in the sub-tree labelled "1") to pass
through the node N where they did not before. It also causes some paths (those in the sub-tree labelled "3") not to
pass through the node P where they did before. However, both of these nodes are red, so property 5 (all paths from
any given node to its leaf nodes contain the same number of black nodes) is not violated by the rotation. After this
case has been completed, property 4 (both children of every red node are black) is still violated, but now we can
resolve this by continuing to case 5.
void insert_case4(struct node *n)
{
struct node *g = grandparent(n);
264
Redblack tree
265
Redblack tree
three.
void insert_case5(struct node *n)
{
struct node *g = grandparent(n);
n->parent->color = BLACK;
g->color = RED;
if (n == n->parent->left)
rotate_right(g);
else
rotate_left(g);
}
Note that inserting is actually in-place, since all the calls above use tail recursion.
Removal
In a regular binary search tree when deleting a node with two non-leaf children, we find either the maximum element
in its left subtree (which is the in-order predecessor) or the minimum element in its right subtree (which is the
in-order successor) and move its value into the node being deleted (as shown here). We then delete the node we
copied the value from, which must have fewer than two non-leaf children. (Non-leaf children, rather than all
children, are specified here because unlike normal binary search trees, redblack trees can have leaf nodes anywhere,
so that all nodes are either internal nodes with two children or leaf nodes with, by definition, zero children. In effect,
internal nodes having two leaf children in a redblack tree are like the leaf nodes in a regular binary search tree.)
Because merely copying a value does not violate any redblack properties, this reduces to the problem of deleting a
node with at most one non-leaf child. Once we have solved that problem, the solution applies equally to the case
where the node we originally want to delete has at most one non-leaf child as to the case just considered where it has
two non-leaf children.
Therefore, for the remainder of this discussion we address the deletion of a node with at most one non-leaf child. We
use the label M to denote the node to be deleted; C will denote a selected child of M, which we will also call "its
child". If M does have a non-leaf child, call that its child, C; otherwise, choose either leaf as its child, C.
If M is a red node, we simply replace it with its child C, which must be black by property 4. (This can only occur
when M has two leaf children, because if the red node M had a black non-leaf child on one side but just a leaf child
on the other side, then the count of black nodes on both sides would be different, thus the tree would violate property
5.) All paths through the deleted node will simply pass through one fewer red node, and both the deleted node's
parent and child must be black, so property 3 (all leaves are black) and property 4 (both children of every red node
are black) still hold.
Another simple case is when M is black and C is red. Simply removing a black node could break Properties 4 (Both
children of every red node are black) and 5 (All paths from any given node to its leaf nodes contain the same
number of black nodes), but if we repaint C black, both of these properties are preserved.
The complex case is when both M and C are black. (This can only occur when deleting a black node which has two
leaf children, because if the black node M had a black non-leaf child on one side but just a leaf child on the other
side, then the count of black nodes on both sides would be different, thus the tree would have been an invalid
redblack tree by violation of property 5.) We begin by replacing M with its child C. We will call (or labelthat is,
relabel) this child (in its new position) N, and its sibling (its new parent's other child) S. (S was previously the
sibling of M.) In the diagrams below, we will also use P for N's new parent (M's old parent), SL for S's left child, and
SR for S's right child (S cannot be a leaf because if M and C were black, then P's one subtree which included M
266
Redblack tree
counted two black-height and thus P's other subtree which includes S must also count two black-height, which
cannot be the case if S is a leaf node).
Note: In between some cases, we exchange the roles and labels of the nodes, but in each case, every label
continues to represent the same node it represented at the beginning of the case. Any color shown in the
diagram is either assumed in its case or implied by those assumptions. White represents an unknown color
(either red or black).
We will find the sibling using this function:
struct node *sibling(struct node *n)
{
if (n == n->parent->left)
return n->parent->right;
else
return n->parent->left;
}
Note: In order that the tree remains well-defined, we need that every null leaf remains a leaf after all
transformations (that it will not have any children). If the node we are deleting has a non-leaf (non-null) child
N, it is easy to see that the property is satisfied. If, on the other hand, N would be a null leaf, it can be verified
from the diagrams (or code) for all the cases that the property is satisfied as well.
We can perform the steps outlined above with the following code, where the function replace_node substitutes
child into n's place in the tree. For convenience, code in this section will assume that null leaves are represented
by actual node objects rather than NULL (the code in the Insertion section works with either representation).
void delete_one_child(struct node *n)
{
/*
* Precondition: n has at most one non-null child.
*/
struct node *child = is_leaf(n->right) ? n->left : n->right;
replace_node(n, child);
if (n->color == BLACK) {
if (child->color == RED)
child->color = BLACK;
else
delete_case1(child);
}
free(n);
}
Note: If N is a null leaf and we do not want to represent null leaves as actual node objects, we can modify the
algorithm by first calling delete_case1() on its parent (the node that we delete, n in the code above) and
deleting it afterwards. We can do this because the parent is black, so it behaves in the same way as a null leaf
(and is sometimes called a 'phantom' leaf). And we can safely delete it at the end as n will remain a leaf after
all operations, as shown above.
If both N and its original parent are black, then deleting this original parent causes paths which proceed through N to
have one fewer black node than paths that do not. As this violates property 5 (all paths from any given node to its
267
Redblack tree
leaf nodes contain the same number of black nodes), the tree must be rebalanced. There are several cases to consider:
Case 1: N is the new root. In this case, we are done. We removed one black node from every path, and the new root
is black, so the properties are preserved.
void delete_case1(struct node *n)
{
if (n->parent != NULL)
delete_case2(n);
}
Note: In cases 2, 5, and 6, we assume N is the left child of its parent P. If it is the right child, left and right
should be reversed throughout these three cases. Again, the code examples take both cases into account.
Case 2: S is red. In this case we reverse the colors of P and S, and then
rotate left at P, turning S into N's grandparent. Note that P has to be
black as it had a red child. Although all paths still have the same
number of black nodes, now N has a black sibling and a red parent, so
we can proceed to step 4, 5, or 6. (Its new sibling is black because it
was once the child of the red S.) In later cases, we will relabel N's new
sibling as S.
void delete_case2(struct node *n)
{
struct node *s = sibling(n);
if (s->color == RED) {
n->parent->color = RED;
s->color = BLACK;
if (n == n->parent->left)
rotate_left(n->parent);
else
rotate_right(n->parent);
}
delete_case3(n);
}
Case 3: P, S, and S's children are black. In this case, we simply repaint
S red. The result is that all paths passing through S, which are precisely
those paths not passing through N, have one less black node. Because
deleting N's original parent made all paths passing through N have one
less black node, this evens things up. However, all paths through P
now have one fewer black node than paths that do not pass through P,
so property 5 (all paths from any given node to its leaf nodes contain the same number of black nodes) is still
violated. To correct this, we perform the rebalancing procedure on P, starting at case 1.
void delete_case3(struct node *n)
{
struct node *s = sibling(n);
if ((n->parent->color == BLACK) &&
268
Redblack tree
269
Redblack tree
/* the following statements just force the red to be on the left of the
left of the parent,
or right of the right, so case six will rotate correctly. */
if ((n == n->parent->left) &&
(s->right->color == BLACK) &&
(s->left->color == RED)) { /* this last test is trivial too due to
cases 2-4. */
s->color = RED;
s->left->color = BLACK;
rotate_right(s);
} else if ((n == n->parent->right) &&
(s->left->color == BLACK) &&
(s->right->color == RED)) {/* this last test is trivial too
due to cases 2-4. */
s->color = RED;
s->right->color = BLACK;
rotate_left(s);
}
}
delete_case6(n);
}
Case 6: S is black, S's right child is red, and N is the left child of its
parent P. In this case we rotate left at P, so that S becomes the parent
of P and S's right child. We then exchange the colors of P and S, and
make S's right child black. The subtree still has the same color at its
root, so Properties 4 (Both children of every red node are black) and 5
(All paths from any given node to its leaf nodes contain the same
number of black nodes) are not violated. However, N now has one
additional black ancestor: either P has become black, or it was black and S was added as a black grandparent. Thus,
the paths passing through N pass through one additional black node.
Meanwhile, if a path does not go through N, then there are two possibilities:
It goes through N's new sibling. Then, it must go through S and P, both formerly and currently, as they have only
exchanged colors and places. Thus the path contains the same number of black nodes.
It goes through N's new uncle, S's right child. Then, it formerly went through S, S's parent, and S's right child
(which was red), but now only goes through S, which has assumed the color of its former parent, and S's right
child, which has changed from red to black (assuming S's color: black). The net effect is that this path goes
through the same number of black nodes.
Either way, the number of black nodes on these paths does not change. Thus, we have restored Properties 4 (Both
children of every red node are black) and 5 (All paths from any given node to its leaf nodes contain the same number
of black nodes). The white node in the diagram can be either red or black, but must refer to the same color both
before and after the transformation.
void delete_case6(struct node *n)
{
struct node *s = sibling(n);
270
Redblack tree
271
s->color = n->parent->color;
n->parent->color = BLACK;
if (n == n->parent->left) {
s->right->color = BLACK;
rotate_left(n->parent);
} else {
s->left->color = BLACK;
rotate_right(n->parent);
}
}
Again, the function calls all use tail recursion, so the algorithm is in-place. In the algorithm above, all cases are
chained in order, except in delete case 3 where it can recurse to case 1 back to the parent node: this is the only case
where an in-place implementation will effectively loop (after only one rotation in case 3).
Additionally, no tail recursion ever occurs on a child node, so the tail recursion loop can only move from a child
back to its successive ancestors. No more than O(log n) loops back to case 1 will occur (where n is the total number
of nodes in the tree before deletion). If a rotation occurs in case 2 (which is the only possibility of rotation within the
loop of cases 13), then the parent of the node N becomes red after the rotation and we will exit the loop. Therefore
at most one rotation will occur within this loop. Since no more than two additional rotations will occur after exiting
the loop, at most three rotations occur in total.
internal nodes.
such that h(
) = k+1
internal nodes.
) > 0 it is an internal node. As such it has two children each of which have a black-height of
) or bh(
)-1 (depending on whether the child is red or black, respectively). By the inductive
internal nodes, so
has at least:
internal nodes.
Using this lemma we can now show that the height of the tree is logarithmic. Since at least half of the nodes on any
path from the root to a leaf are black (property 4 of a redblack tree), the black-height of the root is at least h(root)/2.
By the lemma we get:
Redblack tree
Therefore the height of the root is O(log(n)).
Insertion complexity
In the tree code there is only one loop where the node of the root of the redblack property that we wish to restore, x,
can be moved up the tree by one level at each iteration.
Since the original height of the tree is O(log n), there are O(log n) iterations. So overall the insert routine has O(log
n) complexity.
Parallel algorithms
Parallel algorithms for constructing redblack trees from sorted lists of items can run in constant time or O(loglog n)
time, depending on the computer model, if the number of processors available is proportional to the number of items.
Fast search, insertion, and deletion parallel algorithms are also known.
Notes
[1] http:/ / www. cs. princeton. edu/ ~rs/ talks/ LLRB/ RedBlack. pdf
[2] http:/ / www. cs. princeton. edu/ courses/ archive/ fall08/ cos226/ lectures/ 10BalancedTrees-2x2. pdf
References
Mathworld: RedBlack Tree (http://mathworld.wolfram.com/Red-BlackTree.html)
San Diego State University: CS 660: RedBlack tree notes (http://www.eli.sdsu.edu/courses/fall95/cs660/
notes/RedBlackTree/RedBlack.html#RTFToC2), by Roger Whitney
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms,
Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7 . Chapter 13: RedBlack Trees,
pp.273301.
Pfaff, Ben (June 2004). "Performance Analysis of BSTs in System Software" (http://www.stanford.edu/~blp/
papers/libavl.pdf) (PDF). Stanford University.
Okasaki, Chris. "RedBlack Trees in a Functional Setting" (http://www.eecs.usma.edu/webs/people/okasaki/
jfp99.ps) (PS).
External links
RedBlack Tree Demonstration (http://www.ece.uc.edu/~franco/C321/html/RedBlack/redblack.html)
OCW MIT Lecture by Prof. Erik Demaine on Red Black Trees (http://ocw.mit.edu/courses/
electrical-engineering-and-computer-science/6-046j-introduction-to-algorithms-sma-5503-fall-2005/
video-lectures/lecture-10-red-black-trees-rotations-insertions-deletions/) Binary Search Tree Insertion Visualization (https://www.youtube.com/watch?v=_VbTnLV8plU) on YouTube
Visualization of random and pre-sorted data insertions, in elementary binary search trees, and left-leaning
redblack trees
272
Scapegoat tree
273
Scapegoat tree
In computer science, a scapegoat tree is a self-balancing binary search tree, discovered by Arne Andersson and
again by Igal Galperin and Ronald L. Rivest. It provides worst-case O(log n) lookup time, and O(log n) amortized
insertion and deletion time.
Unlike most other self-balancing binary search trees that provide worst case O(log n) lookup time, scapegoat trees
have no additional per-node memory overhead compared to a regular binary search tree: a node stores only a key and
two pointers to the child nodes. This makes scapegoat trees easier to implement and, due to data structure alignment,
can reduce node overhead by up to one-third.
Theory
A binary search tree is said to be weight balanced if half the nodes are on the left of the root, and half on the right.
An -weight-balanced is therefore defined as meeting the following conditions:
size(left) <= *size(node)
size(right) <= *size(node)
Where size can be defined recursively as:
function
if node
return
else
return
end
size(node)
= nil
0
size(node->left) + size(node->right) + 1
An of 1 therefore would describe a linked list as balanced, whereas an of 0.5 would only match almost complete
binary trees.
A binary search tree that is -weight-balanced must also be -height-balanced, that is
height(tree) <= log1/(NodeCount)
Scapegoat trees are not guaranteed to keep -weight-balance at all times, but are always loosely -height-balance in
that
height(scapegoat tree) <= log1/(NodeCount) + 1
This makes scapegoat trees similar to red-black trees in that they both have restrictions on their height. They differ
greatly though in their implementations of determining where the rotations (or in the case of scapegoat trees,
rebalances) take place. Whereas red-black trees store additional 'color' information in each node to determine the
location, scapegoat trees find a scapegoat which isn't -weight-balanced to perform the rebalance operation on. This
is loosely similar to AVL trees, in that the actual rotations depend on 'balances' of nodes, but the means of
determining the balance differs greatly. Since AVL trees check the balance value on every insertion/deletion, it is
typically stored in each node; scapegoat trees are able to calculate it only as needed, which is only when a scapegoat
needs to be found.
Unlike most other self-balancing search trees, scapegoat trees are entirely flexible as to their balancing. They support
any such that 0.5 < < 1. A high value results in fewer balances, making insertion quicker but lookups and
deletions slower, and vice versa for a low . Therefore in practical applications, an can be chosen depending on
how frequently these actions should be performed.
Scapegoat tree
Operations
Insertion
Insertion is implemented with the same basic ideas as an unbalanced binary search tree, however with a few
significant changes.
When finding the insertion point, the depth of the new node must also be recorded. This is implemented via a simple
counter that gets incremented during each iteration of the lookup, effectively counting the number of edges between
the root and the inserted node. If this node violates the -height-balance property (defined above), a rebalance is
required.
To rebalance, an entire subtree rooted at a scapegoat undergoes a balancing operation. The scapegoat is defined as
being an ancestor of the inserted node which isn't -weight-balanced. There will always be at least one such
ancestor. Rebalancing any of them will restore the -height-balanced property.
One way of finding a scapegoat, is to climb from the new node back up to the root and select the first node that isn't
-weight-balanced.
Climbing back up to the root requires O(log n) storage space, usually allocated on the stack, or parent pointers. This
can actually be avoided by pointing each child at its parent as you go down, and repairing on the walk back up.
To determine whether a potential node is a viable scapegoat, we need to check its -weight-balanced property. To do
this we can go back to the definition:
size(left) <= *size(node)
size(right) <= *size(node)
However a large optimisation can be made by realising that we already know two of the three sizes, leaving only the
third having to be calculated.
Consider the following example to demonstrate this. Assuming that we're climbing back up to the root:
size(parent) = size(node) + size(sibling) + 1
But as:
size(inserted node) = 1.
The case is trivialized down to:
size[x+1] = size[x] + size(sibling) + 1
Where x = this node, x + 1 = parent and size(sibling) is the only function call actually required.
Once the scapegoat is found, the subtree rooted at the scapegoat is completely rebuilt to be perfectly balanced. This
can be done in O(n) time by traversing the nodes of the subtree to find their values in sorted order and recursively
choosing the median as the root of the subtree.
As rebalance operations take O(n) time (dependent on the number of nodes of the subtree), insertion has a worst case
performance of O(n) time. However, because these worst-case scenarios are spread out, insertion takes O(log n)
amortized time.
274
Scapegoat tree
275
is Big O Notation.)
Proof of lemma:
Let
. If there are
degenerate insertions (that is, where each inserted node increases the height by 1), then
,
and
.
before rebuilding, there were
Since
. Using aggregate analysis it becomes clear that the amortized cost of an insertion is
Deletion
Scapegoat trees are unusual in that deletion is easier than insertion. To enable deletion, scapegoat trees need to store
an additional value with the tree data structure. This property, which we will call MaxNodeCount simply represents
the highest achieved NodeCount. It is set to NodeCount whenever the entire tree is rebalanced, and after insertion is
set to max(MaxNodeCount, NodeCount).
To perform a deletion, we simply remove the node as you would in a simple binary search tree, but if
NodeCount <= MaxNodeCount / 2
then we rebalance the entire tree about the root, remembering to set MaxNodeCount to NodeCount.
This gives deletion its worst case performance of O(n) time, however it is amortized to O(log n) average time.
Sketch of proof for cost of deletion
Suppose the scapegoat tree has elements and has just been rebuilt (in other words, it is a complete binary tree). At
most
deletions can be performed before the tree must be rebuilt. Each of these deletions take
time (the amount of time to search for the element and flag it as deleted). The
rebuilt and takes
amortized cost of a deletion is
(or just
:
Scapegoat tree
Lookup
Lookup is not modified from a standard binary search tree, and has a worst-case time of O(log n). This is in contrast
to splay trees which have a worst-case time of O(n). The reduced node memory overhead compared to other
self-balancing binary search trees can further improve locality of reference and caching.
References
External links
Scapegoat Tree Applet (http://people.ksp.sk/~kuko/bak/index.html) by Kubo Kovac
Scapegoat Trees: Galperin and Rivest's paper describing scapegoat trees (http://cg.scs.carleton.ca/~morin/
teaching/5408/refs/gr93.pdf)
On Consulting a Set of Experts and Searching (full version paper) (http://publications.csail.mit.edu/lcs/pubs/
pdf/MIT-LCS-TR-700.pdf)
Open Data Structures - Chapter 8 - Scapegoat Trees (http://opendatastructures.org/versions/edition-0.1e/
ods-java/8_Scapegoat_Trees.html)
276
Splay tree
277
Splay tree
Splay tree
Type
Tree
Invented
1985
Worst case
Space
O(n)
O(n)
Search
O(log n)
amortized O(log n)
Insert
O(log n)
amortized O(log n)
Delete
O(log n)
amortized O(log n)
A splay tree is a self-adjusting binary search tree with the additional property that recently accessed elements are
quick to access again. It performs basic operations such as insertion, look-up and removal in O(log n) amortized
time. For many sequences of non-random operations, splay trees perform better than other search trees, even when
the specific pattern of the sequence is unknown. The splay tree was invented by Daniel Dominic Sleator and Robert
Endre Tarjan in 1985.
All normal operations on a binary search tree are combined with one basic operation, called splaying. Splaying the
tree for a certain element rearranges the tree so that the element is placed at the root of the tree. One way to do this is
to first perform a standard binary tree search for the element in question, and then use tree rotations in a specific
fashion to bring the element to the top. Alternatively, a top-down algorithm can combine the search and the tree
reorganization into a single phase.
Advantages
Good performance for a splay tree depends on the fact that it is self-optimizing, in that frequently accessed nodes
will move nearer to the root where they can be accessed more quickly. The worst-case heightthough unlikelyis
O(n), with the average being O(log n). Having frequently used nodes near the root is an advantage for nearly all
practical applications (also see Locality of reference),[citation needed] and is particularly useful for implementing
caches and garbage collection algorithms.
Advantages include:
Simple implementationsimpler than other self-balancing binary search trees, such as red-black trees or AVL
trees.
Comparable performanceaverage-case performance is as efficient as other trees.[citation needed]
Small memory footprintsplay trees do not need to store any bookkeeping data.
Possibility of creating a persistent data structure version of splay treeswhich allows access to both the previous
and new versions after an update. This can be useful in functional programming, and requires amortized O(log n)
space per update.
Working well with nodes containing identical keyscontrary to other types of self-balancing trees. Even with
identical keys, performance remains amortized O(log n). All tree operations preserve the order of the identical
nodes within the tree, which is a property similar to stable sorting algorithms. A carefully designed find operation
can return the leftmost or rightmost node of a given key.
Splay tree
Disadvantages
Perhaps the most significant disadvantage of splay trees is that the height of a splay tree can be linear. For example,
this will be the case after accessing all n elements in non-decreasing order. Since the height of a tree corresponds to
the worst-case access time, this means that the actual cost of an operation can be slow. However the amortized
access cost of this worst case is logarithmic, O(log n). Also, the expected access cost can be reduced to O(log n) by
using a randomized variant.
A splay tree can be worse than a static tree by at most a constant factor.
The representation of splay trees can change even when they are accessed in a 'read-only' manner (i.e. by find
operations). This complicates the use of such splay trees in a multi-threaded environment. Specifically, extra
management is needed if multiple threads are allowed to perform find operations concurrently.
Operations
Splaying
When a node x is accessed, a splay operation is performed on x to move it to the root. To perform a splay operation
we carry out a sequence of splay steps, each of which moves x closer to the root. By performing a splay operation on
the node of interest after every access, the recently accessed nodes are kept near the root and the tree remains
roughly balanced, so that we achieve the desired amortized time bounds.
Each particular step depends on three factors:
Whether x is the left or right child of its parent node, p,
whether p is the root or not, and if not
whether p is the left or right child of its parent, g (the grandparent of x).
It is important to remember to set gg (the great-grandparent of x) to now point to x after any splay operation. If gg is
null, then x obviously is now the root and must be updated as such.
There are three types of splay steps, each of which has a left- and right-handed case. For the sake of brevity, only one
of these two is shown for each type. These three types are:
Zig Step: This step is done when p is the root. The tree is rotated on the edge between x and p. Zig steps exist to deal
with the parity issue and will be done only as the last step in a splay operation and only when x has odd depth at the
beginning of the operation.
Zig-zig Step: This step is done when p is not the root and x and p are either both right children or are both left
children. The picture below shows the case where x and p are both left children. The tree is rotated on the edge
joining p with its parent g, then rotated on the edge joining x with p. Note that zig-zig steps are the only thing that
278
Splay tree
differentiate splay trees from the rotate to root method introduced by Allen and Munro prior to the introduction of
splay trees.
Zig-zag Step: This step is done when p is not the root and x is a right child and p is a left child or vice versa. The
tree is rotated on the edge between p and x, and then rotated on the resulting edge between x and g.
Insertion
To insert a node x into a splay tree:
1. First insert the node as with a normal binary search tree.
2. Then splay the newly inserted node x to the top of the tree.
Deletion
To delete a node x, we use the same method as with a binary search tree: if x has two children, we swap its value
with that of either the rightmost node of its left sub tree (its in-order predecessor) or the leftmost node of its right
subtree (its in-order successor). Then we remove that node instead. In this way, deletion is reduced to the problem of
removing a node with 0 or 1 children.
Unlike a binary search tree, in a splay tree after deletion, we splay the parent of the removed node to the top of the
tree. OR The node to be deleted is first splayed, i.e. brought to the root of the tree and then deleted. This leaves the
tree with two sub trees. The maximum element of the left sub tree (: METHOD 1), or minimum of the right sub tree
(: METHOD 2) is then splayed to the root. The right sub tree is made the right child of the resultant left sub tree (for
METHOD 1). The root of left sub tree is the root of melded tree.
279
Splay tree
Implementation
Below there is an implementation of splay trees in C++, which uses pointers to represent each node on the tree. This
implementation is based on the second method of deletion on a splay tree.
#include <functional>
#ifndef SPLAY_TREE
#define SPLAY_TREE
template< typename T, typename Comp = std::less< T > >
class splay_tree {
private:
Comp comp;
unsigned long p_size;
struct node {
node *left, *right;
node *parent;
T key;
node( const T& init = T( ) ) : left( 0 ), right( 0 ), parent( 0 ),
key( init ) { }
} *root;
void left_rotate( node *x ) {
node *y = x->right;
x->right = y->left;
if( y->left ) y->left->parent = x;
y->parent = x->parent;
if( !x->parent ) root = y;
else if( x == x->parent->left ) x->parent->left = y;
else x->parent->right = y;
y->left = x;
x->parent = y;
}
void right_rotate( node *x ) {
node *y = x->left;
x->left = y->right;
if( y->right ) y->right->parent = x;
y->parent = x->parent;
if( !x->parent ) root = y;
else if( x == x->parent->left ) x->parent->left = y;
else x->parent->right = y;
y->right = x;
x->parent = y;
}
280
Splay tree
void splay( node *x ) {
while( x->parent ) {
if( !x->parent->parent ) {
if( x->parent->left == x ) right_rotate( x->parent );
else left_rotate( x->parent );
} else if( x->parent->left == x && x->parent->parent->left == x->parent
) {
right_rotate( x->parent->parent );
right_rotate( x->parent );
} else if( x->parent->right == x && x->parent->parent->right ==
x->parent ) {
left_rotate( x->parent->parent );
left_rotate( x->parent );
} else if( x->parent->left == x && x->parent->parent->right ==
x->parent ) {
right_rotate( x->parent );
left_rotate( x->parent );
} else {
left_rotate( x->parent );
right_rotate( x->parent );
}
}
}
void replace( node *u, node *v ) {
if( !u->parent ) root = v;
else if( u == u->parent->left ) u->parent->left = v;
else u->parent->right = v;
if( v ) v->parent = u->parent;
}
node* subtree_minimum( node *u ) {
while( u->left ) u = u->left;
return u;
}
node* subtree_maximum( node *u ) {
while( u->right ) u = u->right;
return u;
}
public:
splay_tree( ) : root( 0 ) { }
void insert( const T &key ) {
node *z = root;
node *p = 0;
281
Splay tree
282
while( z ) {
p = z;
if( comp( z->key, key ) ) z = z->right;
else z = z->left;
}
z = new node( key );
z->parent = p;
if( !p ) root = z;
else if( comp( p->key, z->key ) ) p->right = z;
else p->left = z;
splay( z );
p_size++;
}
node* find( const T &key ) {
node *z = root;
while( z ) {
if( comp( z->key, key ) ) z = z->right;
else if( comp( key, z->key ) ) z = z->left;
else return z;
}
return 0;
}
void erase( const T &key ) {
node *z = find( key );
if( !z ) return;
splay( z );
if( !z->left ) replace( z, z->right );
else if( !z->right ) replace( z, z->left );
else {
node *y = subtree_minimum( z->right );
if( y->parent != z ) {
replace( y, y->right );
y->right = z->right;
y->right->parent = y;
}
replace( z, y );
y->left = z->left;
y->left->parent = y;
}
Splay tree
283
p_size--;
}
const T& minimum( ) { return subtree_minimum( root )->key; }
const T& maximum( ) { return subtree_maximum( root )->key; }
bool empty( ) const { return root == 0; }
unsigned long size( ) const { return p_size; }
};
#endif // SPLAY_TREE
Below there is an implementation of splay trees in C#. This implementation is based on the second method of
deletion on a splay tree.
using System;
using System.Collections.Generic;
namespace Trees {
class SplayNode<T> where T : IComparable
{
public SplayNode<T> Left = null;
public SplayNode<T> Right = null;
public SplayNode<T> Parent = null;
public T Node;
}
class SplayTree<T> where T : IComparable
{
public SplayTree()
{
Root = null;
}
public void Insert(T node)
{
if(node == null)
{
return;
}
SplayNode<T> z = Root;
SplayNode<T> p = null;
while (z != null)
{
p = z;
if(node.CompareTo(p.Node) < 0)
{
z = z.Right;
}
Splay tree
284
else
{
z = z.Left;
}
}
z = new SplayNode<T>();
z.Node = node;
z.Parent = p;
if (p == null)
{
Root = z;
}
else if (node.CompareTo(p.Node) < 0)
{
p.Right = z;
}
else
{
p.Left = z;
}
Splay(z);
Count++;
}
public int Find(T node)
{
SplayNode<T> find = FindNode(node);
if (find != null)
{
node = find.Node;
Splay(find);
}
return Depth;
}
public void Remove(T node)
{
SplayNode<T> find = FindNode(node);
Remove(find);
}
private void Remove(SplayNode<T> node)
{
if (node == null)
{
return;
}
Splay tree
285
Splay(node);
if( (node.Left != null) && (node.Right !=null))
{
SplayNode<T> min = node.Left;
while(min.Right!=null)
{
min = min.Right;
}
min.Right = node.Right;
node.Right.Parent = min;
node.Left.Parent = null;
Root = node.Left;
}
else if (node.Right != null)
{
node.Right.Parent = null;
Root = node.Right;
}
else if( node.Left !=null)
{
node.Left.Parent = null;
Root = node.Left;
}
else
{
Root = null;
}
node.Parent = null;
node.Left = null;
node.Right = null;
node = null;
Count--;
}
private SplayNode<T> FindNode(T node)
{
SplayNode<T> z = Root;
while (z != null)
{
if (node.CompareTo(z.Node) < 0)
{
z = z.Right;
}
else if (node.CompareTo(z.Node) > 0)
{
z = z.Left;
}
Splay tree
286
else
{
return z;
}
}
return null;
}
Splay tree
287
p.Parent.Left = c;
}
else
{
p.Parent.Right = c;
}
}
if (c.Left != null)
{
c.Left.Parent = p;
}
c.Parent = p.Parent;
p.Parent = c;
p.Right = c.Left;
c.Left = p;
}
void Splay(SplayNode<T> x)
{
while (x.Parent != null)
{
SplayNode<T> Parent = x.Parent;
SplayNode<T> GrandParent = Parent.Parent;
if (GrandParent == null)
{
if (x == Parent.Left)
{
MakeLeftChildParent(x, Parent);
}
else
{
MakeRightChildParent(x, Parent);
}
}
else
{
if (x == Parent.Left)
{
if (Parent == GrandParent.Left)
{
MakeLeftChildParent(Parent,
GrandParent);
MakeLeftChildParent(x,
Parent);
}
else
{
Splay tree
288
MakeLeftChildParent(x,
x.Parent);
MakeRightChildParent(x,
x.Parent);
}
}
else
{
if (Parent == GrandParent.Left)
{
MakeRightChildParent(x,
x.Parent);
MakeLeftChildParent(x,
x.Parent);
}
else
{
MakeRightChildParent(Parent,
GrandParent);
MakeRightChildParent(x,
Parent);
}
}
}
}
Root = x;
}
public void Clear()
{
while (Root != null)
{
Remove(Root);
}
}
SplayNode<T> Root = null;
int Count = 0;
}
}
Splay tree
289
Analysis
A simple amortized analysis of static splay trees can be carried out using the potential method. Suppose that size(r)
is the number of nodes in the subtree rooted at r (including r) and rank(r) = log2(size(r)). Then the potential function
P(t) for a splay tree t is the sum of the ranks of all the nodes in the tree. This will tend to be high for poorly balanced
trees, and low for well-balanced trees. We can bound the amortized cost of any zig-zig or zig-zag operation by:
amortized cost = cost + P(tf) - P(ti) 3(rankf(x) - ranki(x)),
where x is the node being moved towards the root, and the subscripts "f" and "i" indicate after and before the
operation, respectively. When summed over the entire splay operation, this telescopes to 3(rank(root)) which is
O(log n). Since there's at most one zig operation, this only adds a constant.
Performance theorems
There are several theorems and conjectures regarding the worst-case runtime for performing a sequence S of m
accesses in a splay tree containing n elements.
Balance Theorem
The cost of performing the sequence S is
performing S is
access of S and let f be any fixed element (the finger). The cost of
.
be the number of distinct elements accessed between access j and the previous time element
was
Scanning Theorem
Also known as the Sequential Access Theorem. Accessing the n elements of a splay tree in symmetric order
takes O(n) time, regardless of the initial structure of the splay tree. The tightest upper bound proven so far is
.
Splay tree
290
In addition to the proven performance guarantees for splay trees there is an unproven conjecture of great interest
from the original Sleator and Tarjan paper. This conjecture is known as the dynamic optimality conjecture and it
basically claims that splay trees perform as well as any other binary search tree algorithm up to a constant factor.
Dynamic Optimality Conjecture: Let
traversing the path from the root to
accesses. Then the cost for a splay tree to perform the same accesses is
There are several corollaries of the dynamic optimality conjecture that remain unproven:
Traversal Conjecture: Let
and
of accesses on
be a sequence of
is
on a splay tree is
of
by
be the
be any permutation of the elements of the splay tree. Then the cost of deleting the
is
Notes
References
Knuth, Donald. The Art of Computer Programming, Volume 3: Sorting and Searching, Third Edition.
Addison-Wesley, 1997. ISBN 0-201-89685-0. Page 478 of section 6.2.3.
External links
NIST's Dictionary of Algorithms and Data Structures: Splay Tree (http://www.nist.gov/dads/HTML/
splaytree.html)
Implementations in C and Java (by Daniel Sleator) (http://www.link.cs.cmu.edu/link/ftp-site/splaying/)
(link is broken: page not found) Pointers to splay tree visualizations (http://wiki.algoviz.org/AlgovizWiki/
SplayTrees)
Fast and efficient implentation of Splay trees (http://github.com/fbuihuu/libtree)
Top-Down Splay Tree Java implementation (http://github.com/cpdomina/SplayTree)
Zipper Trees (http://arxiv.org/abs/1003.0139)
splay tree video (http://www.youtube.com/watch?v=G5QIXywcJlY)
Tango tree
291
Tango tree
A Tango tree is a type of binary search tree proposed by Erik D. Demaine, Dion Harmon, John Iacono, and Mihai
Patrascu in 2004. It is an online binary search tree that achieves an
competitive ratio relative to the
optimal offline binary search tree, while only using
improved upon the previous best known competitive ratio, which was
Structure
Tango trees work by partitioning a binary search tree into a set of preferred paths, which are themselves stored in
auxiliary trees (so the tango tree is represented as a tree of trees).
Reference Tree
To construct a tango tree, we simulate a complete binary search tree called the reference tree, which is simply a
traditional binary search tree containing all the elements. This tree never shows up in the actual implementation, but
is the conceptual basis behind the following pieces of a tango tree.
Preferred Paths
First, we define for each node its preferred child, which informally is the most-recently-touched child by a
traditional binary search tree lookup. More formally, consider a subtree T, rooted at p, with children l (left) and r
(right). We set r as the preferred child of p if the most recently accessed node in T is in the subtree rooted at r, and l
as the preferred child otherwise. Note that if the most recently accessed node of T is p itself, then l is the preferred
child by definition.
A preferred path is defined by starting at the root and following the preferred children until reaching a leaf node.
Removing the nodes on this path partitions the remainder of the tree into a number of subtrees, and we recurse on
each subtree (forming a preferred path from its root, which partitions the subtree into more subtrees).
Auxiliary Trees
To represent a preferred path, we store its nodes in a balanced binary search tree, specifically a red-black tree. For
each non-leaf node n in a preferred path P, it has a non-preferred child c, which is the root of a new auxiliary tree.
We attach this other auxiliary tree's root (c) to n in P, thus linking the auxiliary trees together. We also augment the
auxiliary tree by storing at each node the minimum and maximum depth (depth in the reference tree, that is) of nodes
in the subtree under that node.
Algorithm
Searching
To search for an element in the tango tree, we simply simulate searching the reference tree. We start by searching the
preferred path connected to the root, which is simulated by searching the auxiliary tree corresponding to that
preferred path. If the auxiliary tree doesn't contain the desired element, the search terminates on the parent of the root
of the subtree containing the desired element (the beginning of another preferred path), so we simply proceed by
searching the auxiliary tree for that preferred path, and so forth.
Tango tree
Updating
In order to maintain the structure of the tango tree (auxiliary trees correspond to preferred paths), we must do some
updating work whenever preferred children change as a result of searches. When a preferred child changes, the top
part of a preferred path becomes detached from the bottom part (which becomes its own preferred path) and
reattached to another preferred path (which becomes the new bottom part). In order to do this efficiently, we'll define
cut and join operations on our auxiliary trees.
Join
Our join operation will combine two auxiliary trees as long as they have the property that the top node of one (in the
reference tree) is a child of the bottom node of the other (essentially, that the corresponding preferred paths can be
concatenated). This will work based on the concatenate operation of red-black trees, which combines two trees as
long as they have the property that all elements of one are less than all elements of the other, and split, which does
the reverse. In the reference tree, note that there exist two nodes in the top path such that a node is in the bottom path
if and only if its key-value is between them. Now, to join the bottom path to the top path, we simply split the top path
between those two nodes, then concatenate the two resulting auxiliary trees on either side of the bottom path's
auxiliary tree, and we have our final, joined auxiliary tree.
Cut
Our cut operation will break a preferred path into two parts at a given node, a top part and a bottom part. More
formally, it'll partition an auxiliary tree into two auxiliary trees, such that one contains all nodes at or above a certain
depth in the reference tree, and the other contains all nodes below that depth. As in join, note that the top part has
two nodes that bracket the bottom part. Thus, we can simply split on each of these two nodes to divide the path into
three parts, then concatenate the two outer ones so we end up with two parts, the top and bottom, as desired.
Analysis
In order to bound the competitive ratio for tango trees, we must find a lower bound on the performance of the
optimal offline tree that we use as a benchmark. Once we find an upper bound on the performance of the tango tree,
we can divide them to bound the competitive ratio.
Interleave Bound
To find a lower bound on the work done by the optimal offline binary search tree, we again use the notion of
preferred children. When considering an access sequence (a sequence of searches), we keep track of how many times
a reference tree node's preferred child switches. The total number of switches (summed over all nodes) gives an
asymptotic lower bound on the work done by any binary search tree algorithm on the given access sequence. This is
called the interleave lower bound.[1]
292
Tango tree
293
Tango Tree
In order to connect this to tango trees, we will find an upper bound on the work done by the tango tree for a given
access sequence. Our upper bound will be
, where k is the number of interleaves.
The total cost is divided into two parts, searching for the element, and updating the structure of the tango tree to
maintain the proper invariants (switching preferred children and re-arranging preferred paths).
Searching
To see that the searching (not updating) fits in this bound, simply note that every time an auxiliary tree search is
unsuccessful and we have to move to the next auxiliary tree, that results in a preferred child switch (since the parent
preferred path now switches directions to join the child preferred path). Since all auxiliary tree searches are
unsuccessful except the last one (we stop once a search is successful, naturally), we search
auxiliary trees.
Each search takes
tree.
Updating
The update cost fits within this bound as well, because we only have to perform one cut and one join for every
visited auxiliary tree. A single cut or join operation takes only a constant number of searches, splits, and
concatenates, each of which takes logarithmic time in the size of the auxiliary tree, so our update cost is
.
Competitive Ratio
Tango trees are
-competitive, because the work done by the optimal offline binary search tree is at
least linear in k (the total number of preferred child switches), and the work done by the tango tree is at most
.
References
[1] Demaine, E., Harmon, D., Iacono, J., and Patrascu, M. SIAM Journal on Computing 2007 37:1, 240-251. http:/ / dx. doi. org/ 10. 1137/
S0097539705447347
Skip list
294
Skip list
Skip List
Type
List
Invented
1989
Invented by W.Pugh
Time complexity
in big O notation
Average Worst case
Space
O(n)
Search
O(log n) O(n)
Insert
O(log n) O(n)
Delete
O(log n) O(n)
[1]
O(n log n)
Part of a series on
Probabilistic
data structures
Bloom filter Quotient filter Skip list
Random trees
Random binary tree Treap
Rapidly exploring random tree
Related
Randomized algorithm
Computer science Portal
A skip list is a data structure for storing a sorted list of items using a hierarchy of linked lists that connect
increasingly sparse subsequences of the items. These auxiliary lists allow item lookup with efficiency comparable to
balanced binary search trees (that is, with number of probes proportional to log n instead of n).
Each link of the sparser lists skips over many items of the full list in one step, hence the structure's name. These
forward links may be added in a randomized way with a geometric / negative binomial distribution. Insert, search
and delete operations are performed in logarithmic expected time. The links may also be added in a non-probabilistic
way so as to guarantee amortized (rather than merely expected) logarithmic cost.[2]
Skip list
295
Description
A skip list is built in layers. The bottom layer is an ordinary ordered linked list. Each higher layer acts as an "express
lane" for the lists below, where an element in layer i appears in layer i+1 with some fixed probability p (two
commonly used values for p are 1/2 or 1/4). On average, each element appears in 1/(1-p) lists, and the tallest element
(usually a special head element at the front of the skip list) in
lists.
A search for a target element begins at the head element in the top list, and proceeds horizontally until the current
element is greater than or equal to the target. If the current element is equal to the target, it has been found. If the
current element is greater than the target, or the search reaches the end of the linked list, the procedure is repeated
after returning to the previous element and dropping down vertically to the next lower list. The expected number of
steps in each linked list is at most 1/p, which can be seen by tracing the search path backwards from the target until
reaching an element that appears in the next higher list or reaching the beginning of the current list. Therefore, the
total expected cost of a search is
which is
when p is a constant. By choosing different
values of p, it is possible to trade search costs against storage costs.
Implementation details
The elements used for a skip list can contain more than one pointer since they can participate in more than one list.
Insertions and deletions are implemented much like the corresponding linked-list operations, except that "tall"
elements must be inserted into or deleted from more than one linked list.
operations, which force us to visit every node in ascending order (such as printing the entire list), provide the
opportunity to perform a behind-the-scenes derandomization of the level structure of the skip-list in an optimal way,
bringing the skip list to
search time. (Choose the level of the i'th finite node to be 1 plus the number of
times we can repeatedly divide i by 2 before it becomes odd. Also, i=0 for the negative infinity header as we have
the usual special case of choosing the highest possible level for negative and/or positive infinite nodes.) However
this also allows someone to know where all of the higher-than-level 1 nodes are and delete them.
Alternatively, we could make the level structure quasi-random in the following way:
make all nodes level 1
j 1
while the number of nodes at level
for each i'th node at level j do
if i is odd
if i is not the last node at
randomly choose whether to
else
do not promote
end if
else if i is even and node i-1
promote it to level j+1
end if
repeat
j j + 1
repeat
j > 1 do
level j
promote it to level j+1
Like the derandomized version, quasi-randomization is only done when there is some other reason to be running a
operation (which visits every node).
Skip list
296
The advantage of this quasi-randomness is that it doesn't give away nearly as much level-structure related
information to an adversarial user as the de-randomized one. This is desirable because an adversarial user who is
able to tell which nodes are not at the lowest level can pessimize performance by simply deleting higher-level nodes.
The search performance is still guaranteed to be logarithmic.
It would be tempting to make the following "optimization": In the part which says "Next, for each i'th...", forget
about doing a coin-flip for each even-odd pair. Just flip a coin once to decide whether to promote only the even ones
or only the odd ones. Instead of
coin flips, there would only be
of them. Unfortunately, this
gives the adversarial user a 50/50 chance of being correct upon guessing that all of the even numbered nodes (among
the ones at level 1 or higher) are higher than level one. This is despite the property that he has a very low probability
of guessing that a particular node is at level N for some integer N.
The following proves these two claims concerning the advantages of quasi-randomness over the totally
derandomized version. First, to prove that the search time is guaranteed to be logarithmic. Suppose a node n is
searched for, where n is the position of the found node among the nodes of level 1 or higher. If n is even, then there
is a 50/50 chance that it is higher than level 1. However, if it is not higher than level 1 then node n-1 is guaranteed to
be higher than level 1. If n is odd, then there is a 50/50 chance that it is higher than level 1. Suppose that it is not;
there is a 50/50 chance that node n-1 is higher than level 1. Suppose that this is not either; we are guaranteed that
node n-2 is higher than level 1. The analysis can then be repeated for nodes of level 2 or higher, level 3 or higher,
etc. always keeping in mind that n is the position of the node among the ones of level k or higher for integer k. So the
search time is constant in the best case (if the found node is the highest possible level) and 2 times the worst case for
the search time for the totally derandomized skip-list (because we have to keep moving left twice rather than keep
moving left once).
Next, an examination of the probability of an adversarial user's guess of a node being level k or higher being correct.
First, the adversarial user has a 50/50 chance of correctly guessing that a particular node is level 2 or higher. This
event is independent of whether or not the user correctly guesses at some other node being level 2 or higher. If the
user knows the positions of two consecutive nodes of level 2 or higher, and knows that the one on the left is in an
odd numbered position among the nodes of level 2 or higher, the user has a 50/50 chance of correctly guessing which
one is of level 3 or higher. So, the user's probability of being correct, when guessing that a node is level 3 or higher,
is 1/4. Inductively continuing this analysis, we see that the user's probability of guessing that a particular node is
level k or higher is
.
The above analyses only work when the number of nodes is a power of two. However, because of the third rule
which says, "Finally, if i is odd and also the last node at level 1 then do not promote." (where we substitute the
appropriate level number for 1) it becomes a sequence of exact-power-of-two-sized skiplists, concatenated onto each
other, for which the analysis does work. In fact, the exact powers of two correspond to the binary representation for
the number of nodes in the whole list.
A skip list, upon which we have not recently performed either of the above mentioned
provide the same absolute worst-case performance guarantees as more traditional balanced tree data structures,
because it is always possible (though with very low probability) that the coin-flips used to build the skip list will
produce a badly balanced structure. However, they work well in practice, and the randomized balancing scheme has
been argued to be easier to implement than the deterministic balancing schemes used in balanced binary search trees.
Skip lists are also useful in parallel computing, where insertions can be done in different parts of the skip list in
parallel without any global rebalancing of the data structure. Such parallelism can be especially advantageous for
resource discovery in an ad-hoc Wireless network because a randomized skip list can be made robust to the loss of
any single node.
There has been some evidence that skip lists have worse real-world performance and space requirements than B trees
due to memory locality and other issues.[3]
Skip list
297
Indexable skiplist
As described above, a skiplist is capable of fast
but it has only slow
lookups of values at a given position in the sequence (i.e. return the 500th value);
however, with a minor modification the speed of random access indexed lookups can be improved to
.
For every link, also store the width of the link. The width is defined as the number of bottom layer links being
traversed by each of the higher layer "express lane" links.
For example, here are the widths of the links in the example at the top of the page:
1
10
o---> o---------------------------------------------------------> o
1
Level 2
o---> o---> o---> o---> o---> o---> o---> o---> o---> o---> o---> o
Head
Level 3
Top level
1st
2nd
3rd
4th
5th
6th
7th
8th
9th
10th
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
Bottom level
NIL
Notice that the width of a higher level link is the sum of the component links below it (i.e. the width 10 link spans
the links of widths 3, 2 and 5 immediately below it). Consequently, the sum of all widths is the same on every level
(10 + 1 = 1 + 3 + 2 + 5 = 1 + 2 + 1 + 2 + 5).
To index the skiplist and find the i'th value, traverse the skiplist while counting down the widths of each traversed
link. Descend a level whenever the upcoming width would be too large.
For example, to find the node in the fifth position (Node 5), traverse a link of width 1 at the top level. Now four more
steps are needed but the next width on this level is ten which is too large, so drop one level. Traverse one link of
width 3. Since another step of width 2 would be too far, drop down to the bottom level. Now traverse the final link of
width 1 to reach the target running total of 5 (1+3+1).
function lookupByPositionIndex(i)
node head
i i + 1
# don't count the head as a step
for level from top to bottom do
while i node.width[level] do # if next step is not too far
i i - node.width[level] # subtract the current width
node node.next[level]
# traverse forward at the current level
repeat
repeat
return node.value
end function
This method of implementing indexing is detailed in Section 3.4 Linear List Operations in "A skip list cookbook" by
William Pugh [4].
Skip list
298
History
Skip lists were first described in 1990 by William Pugh. He details how they work in Pugh, William (June 1990).
"Skip lists: a probabilistic alternative to balanced trees". Communications of the ACM 33 (6): 668676.
doi:10.1145/78973.78977 [5].
To quote the author:
Skip lists are a probabilistic data structure that seem likely to supplant balanced trees as the implementation
method of choice for many applications. Skip list algorithms have the same asymptotic expected time bounds
as balanced trees and are simpler, faster and use less space.
Usages
List of applications and frameworks that use skip lists:
Cyrus IMAP server offers a "skiplist" backend DB implementation (source file [6])
QMap [7] (up to Qt 4) template class of Qt that provides a dictionary.
Redis, an ANSI-C open-source persistent key/value store for Posix systems, uses skip lists in its implementation
of ordered sets.
nessDB [8], a very fast key-value embedded Database Storage Engine (Using log-structured-merge (LSM) trees),
uses skip lists for its memtable.
skipdb [9] is an open-source database format using ordered key/value pairs.
ConcurrentSkipListSet [10] and ConcurrentSkipListMap [11] in the Java 1.6 API.
leveldb [12], a fast key-value storage library written at Google that provides an ordered mapping from string keys
to string values
Skip lists are used for efficient statistical computations [13] of running medians [14] (also known as moving
medians).
Skip lists are also used in distributed applications (where the nodes represent physical computers, and pointers
represent network connections) and for implementing highly scalable concurrent priority queues with less lock
contention,[15] or even without locking, as well lockless concurrent dictionaries. There are also several US patents
for using skip lists to implement (lockless) priority queues and concurrent dictionaries.[citation needed]
See Also
Bloom filter
Skip graph
References
[1] http:/ / www. cs. uwaterloo. ca/ research/ tr/ 1993/ 28/ root2side. pdf
[2] Deterministic skip lists (http:/ / www. ic. unicamp. br/ ~celio/ peer2peer/ skip-net-graph/ deterministic-skip-lists-munro. pdf)
[3] http:/ / resnet. uoregon. edu/ ~gurney_j/ jmpc/ skiplist. html
[4] http:/ / cg. scs. carleton. ca/ ~morin/ teaching/ 5408/ refs/ p90b. pdf
[5] http:/ / dx. doi. org/ 10. 1145%2F78973. 78977
[6] http:/ / git. cyrusimap. org/ cyrus-imapd/ tree/ lib/ cyrusdb_skiplist. c
[7] http:/ / qt-project. org/ doc/ qt-4. 8/ qmap. html#details
[8] https:/ / github. com/ shuttler/ nessDB
[9] http:/ / www. dekorte. com/ projects/ opensource/ skipdb/
[10] http:/ / download. oracle. com/ javase/ 6/ docs/ api/ java/ util/ concurrent/ ConcurrentSkipListSet. html
[11] http:/ / download. oracle. com/ javase/ 6/ docs/ api/ java/ util/ concurrent/ ConcurrentSkipListMap. html
[12] https:/ / code. google. com/ p/ leveldb/
[13] http:/ / code. activestate. com/ recipes/ 576930/
[14] https:/ / en. wikipedia. org/ wiki/ Moving_average#Moving_median
Skip list
[15] Skiplist-based concurrent priority queues (http:/ / dx. doi. org/ 10. 1109/ IPDPS. 2000. 845994)
External links
Skip Lists: A Probabilistic Alternative to Balanced Trees (ftp://ftp.cs.umd.edu/pub/skipLists/skiplists.pdf) William Pugh's original paper
"Skip list" entry (http://nist.gov/dads/HTML/skiplist.html) in the Dictionary of Algorithms and Data
Structures
Skip Lists: A Linked List with Self-Balancing BST-Like Properties (http://msdn.microsoft.com/en-us/library/
ms379573(VS.80).aspx#datastructures20_4_topic4) on MSDN in C# 2.0
SkipDB, a BerkeleyDB-style database implemented using skip lists. (http://dekorte.com/projects/opensource/
SkipDB/)
Skip Lists lecture (MIT OpenCourseWare: Introduction to Algorithm) (http://videolectures.net/
mit6046jf05_demaine_lec12/)
Open Data Structures - Chapter 4 - Skiplists (http://opendatastructures.org/versions/edition-0.1e/ods-java/
4_Skiplists.html)
Demo applets
Skip List Applet (http://people.ksp.sk/~kuko/bak/index.html) by Kubo Kovac
Thomas Wenger's demo applet on skiplists (http://iamwww.unibe.ch/~wenger/DA/SkipList/)
Implementations
A generic Skip List in C++ (http://codingplayground.blogspot.com/2009/01/generic-skip-list-skiplist.html)
by Antonio Gulli
Algorithm::SkipList, implementation in Perl on CPAN (https://metacpan.org/module/Algorithm::SkipList)
John Shipman's implementation in Python (http://infohost.nmt.edu/tcc/help/lang/python/examples/pyskip/)
Raymond Hettinger's implementation in Python (http://code.activestate.com/recipes/576930/)
A Lua port of John Shipman's Python version (http://love2d.org/wiki/Skip_list)
Java Implementation with index based access (https://gist.github.com/dmx2010/5426422)
ConcurrentSkipListSet documentation for Java 6 (http://java.sun.com/javase/6/docs/api/java/util/
concurrent/ConcurrentSkipListSet.html) (and sourcecode (http://www.docjar.com/html/api/java/util/
concurrent/ConcurrentSkipListSet.java.html))
299
B-tree
300
B-tree
B-tree
Type
Tree
Invented
1972
Worst case
Space
O(n)
O(n)
Search
O(log n)
O(log n)
Insert
O(log n)
O(log n)
Delete
O(log n)
O(log n)
In computer science, a B-tree is a tree data structure that keeps data sorted and allows searches, sequential access,
insertions, and deletions in logarithmic time. The B-tree is a generalization of a binary search tree in that a node can
have more than two children. (Comer 1979, p.123) Unlike self-balancing binary search trees, the B-tree is optimized
for systems that read and write large blocks of data. It is commonly used in databases and filesystems.
Overview
In B-trees, internal (non-leaf) nodes
can have a variable number of child
nodes within some pre-defined range.
When data are inserted or removed
from a node, its number of child nodes
changes. In order to maintain the
A B-tree of order 2 (Bayer & McCreight 1972) or order 5 (Knuth 1998).
pre-defined range, internal nodes may
be joined or split. Because a range of
child nodes is permitted, B-trees do not need re-balancing as frequently as other self-balancing search trees, but may
waste some space, since nodes are not entirely full. The lower and upper bounds on the number of child nodes are
typically fixed for a particular implementation. For example, in a 2-3 B-tree (often simply referred to as a 2-3 tree),
each internal node may have only 2 or 3 child nodes.
Each internal node of a B-tree will contain a number of keys. The keys act as separation values which divide its
subtrees. For example, if an internal node has 3 child nodes (or subtrees) then it must have 2 keys: a1 and a2. All
values in the leftmost subtree will be less than a1, all values in the middle subtree will be between a1 and a2, and all
values in the rightmost subtree will be greater than a2.
Usually, the number of keys is chosen to vary between
and
, where
is the minimum degree or branching factor of the tree. In practice, the keys take up the most space in a node.
The factor of 2 will guarantee that nodes can be split or combined. If an internal node has
key to that node can be accomplished by splitting the
parent node. Each split node has the required minimum number of keys. Similarly, if an internal node and its
neighbor each have keys, then a key may be deleted from the internal node by combining with its neighbor.
Deleting the key would make the internal node have
B-tree
301
more key brought down from the neighbor's parent. The result is an entirely full node of
keys.
The number of branches (or child nodes) from a node will be one more than the number of keys stored in the node.
In a 2-3 B-tree, the internal nodes will store either one key (with two child nodes) or two keys (with three child
nodes). A B-tree is sometimes described with the parameters
A B-tree is kept balanced by requiring that all leaf nodes be at the same depth. This depth will increase slowly as
elements are added to the tree, but an increase in the overall depth is infrequent, and results in all leaf nodes being
one more node farther away from the root.
B-trees have substantial advantages over alternative implementations when otherwise the time to access the data of a
node greatly exceeds the time spent processing these data, because then the cost of accessing the node may be
amortized over multiple operations within the node. This usually occurs when the node data are in secondary storage
such as disk drives. By maximizing the number of keys within each internal node, the height of the tree decreases
and the number of expensive node accesses is reduced. In addition, rebalancing of the tree occurs less often. The
maximum number of child nodes depends on the information that must be stored for each child node and the size of
a full disk block or an analogous size in secondary storage. While 2-3 B-trees are easier to explain, practical B-trees
using secondary storage need a large number of child nodes to improve performance.
Variants
The term B-tree may refer to a specific design or it may refer to a general class of designs. In the narrow sense, a
B-tree stores keys in its internal nodes but need not store those keys in the records at the leaves. The general class
includes variations such as the B+-tree and the B*-tree.
In the B+-tree, copies of the keys are stored in the internal nodes; the keys and records are stored in leaves; in
addition, a leaf node may include a pointer to the next leaf node to speed sequential access. (Comer 1979, p.129)
The B*-tree balances more neighboring internal nodes to keep the internal nodes more densely packed.(Comer
1979, p.129) This variant requires non-root nodes to be at least 2/3 full instead of 1/2. (Knuth 1998, p.488) To
maintain this, instead of immediately splitting up a node when it gets full, its keys are shared with a node next to
it. When both nodes are full, then the two nodes are split into three.
Counted B-trees store, with each pointer within the tree, the number of elements in the subtree below that
pointer.[1] This allows rapid searches for the Nth record in key order, or counting the number of records between
any two records, and various other related operations.
Etymology unknown
Rudolf Bayer and Ed McCreight invented the B-tree while working at Boeing Research Labs in 1971 (Bayer &
McCreight 1972), but they did not explain what, if anything, the B stands for. Douglas Comer explains:
The origin of "B-tree" has never been explained by the authors. As we shall see, "balanced," "broad," or
"bushy" might apply. Others suggest that the "B" stands for Boeing. Because of his contributions,
however, it seems appropriate to think of B-trees as "Bayer"-trees. (Comer 1979, p.123 footnote 1)
Donald Knuth speculates on the etymology of B-trees in his May, 1980 lecture on the topic "CS144C classroom
lecture about disk storage and B-trees", suggesting the "B" may have originated from Boeing or from Bayer's
name.[2]
B-tree
302
Large databases have historically been kept on disk drives. The time to read a record on a disk drive far exceeds the
time needed to compare keys once the record is available. The time to read a record from a disk drive involves a seek
time and a rotational delay. The seek time may be 0 to 20 or more milliseconds, and the rotational delay averages
about half the rotation period. For a 7200 RPM drive, the rotation period is 8.33 milliseconds. For a drive such as the
Seagate ST3500320NS, the track-to-track seek time is 0.8 milliseconds and the average reading seek time is 8.5
milliseconds.[3] For simplicity, assume reading from disk takes about 10 milliseconds.
Naively, then, the time to locate one record out of a million would take 20 disk reads times 10 milliseconds per disk
read, which is 0.2 seconds.
The time won't be that bad because individual records are grouped together in a disk block. A disk block might be 16
kilobytes. If each record is 160 bytes, then 100 records could be stored in each block. The disk read time above was
actually for an entire block. Once the disk head is in position, one or more disk blocks can be read with little delay.
With 100 records per block, the last 6 or so comparisons don't need to do any disk readsthe comparisons are all
within the last disk block read.
To speed the search further, the first 13 to 14 comparisons (which each required a disk access) must be sped up.
disk reads to
B-tree
303
In addition, a B-tree minimizes waste by making sure the interior nodes are at least half full. A B-tree can handle an
arbitrary number of insertions and deletions.
Technical description
Terminology
Unfortunately, the literature on B-trees is not uniform in its use of terms relating to B-trees. (Folk & Zoellick 1992,
p.362)
Bayer & McCreight (1972), Comer (1979), and others define the order of B-tree as the minimum number of keys in
a non-root node. Folk & Zoellick (1992) points out that terminology is ambiguous because the maximum number of
keys is not clear. An order 3 B-tree might hold a maximum of 6 keys or a maximum of 7 keys. Knuth (1998, p.483)
avoids the problem by defining the order to be maximum number of children (which is one more than the maximum
number of keys).
The term leaf is also inconsistent. Bayer & McCreight (1972) considered the leaf level to be the lowest level of keys,
but Knuth considered the leaf level to be one level below the lowest keys. (Folk & Zoellick 1992, p.363) There are
many possible implementation choices. In some designs, the leaves may hold the entire data record; in other designs,
the leaves may only hold pointers to the data record. Those choices are not fundamental to the idea of a B-tree.[4]
There are also unfortunate choices like using the variable k to represent the number of children when k could be
confused with the number of keys.
For simplicity, most authors assume there are a fixed number of keys that fit in a node. The basic assumption is the
key size is fixed and the node size is fixed. In practice, variable length keys may be employed. (Folk & Zoellick
B-tree
304
1992, p.379)
Definition
According to Knuth's definition, a B-tree of order m is a tree which satisfies the following properties:
1.
2.
3.
4.
5.
Each internal nodes keys act as separation values which divide its subtrees. For example, if an internal node has 3
child nodes (or subtrees) then it must have 2 keys: a1 and a2. All values in the leftmost subtree will be less than a1,
all values in the middle subtree will be between a1 and a2, and all values in the rightmost subtree will be greater than
a2.
Internal nodes
Internal nodes are all nodes except for leaf nodes and the root node. They are usually represented as an ordered
set of elements and child pointers. Every internal node contains a maximum of U children and a minimum of
L children. Thus, the number of elements is always 1 less than the number of child pointers (the number of
elements is between L1 and U1). U must be either 2L or 2L1; therefore each internal node is at least half
full. The relationship between U and L implies that two half-full nodes can be joined to make a legal node, and
one full node can be split into two legal nodes (if theres room to push one element up into the parent). These
properties make it possible to delete and insert new values into a B-tree and adjust the tree to preserve the
B-tree properties.
The root node
The root nodes number of children has the same upper limit as internal nodes, but has no lower limit. For
example, when there are fewer than L1 elements in the entire tree, the root will be the only node in the tree,
with no children at all.
Leaf nodes
Leaf nodes have the same restriction on the number of elements, but have no children, and no child pointers.
A B-tree of depth n+1 can hold about U times as many items as a B-tree of depth n, but the cost of search, insert, and
delete operations grows with the depth of the tree. As with any balanced tree, the cost grows much more slowly than
the number of elements.
Some balanced trees store values only at leaf nodes, and use different kinds of nodes for leaf nodes and internal
nodes. B-trees keep values in every node in the tree, and may use the same structure for all nodes. However, since
leaf nodes never have children, the B-trees benefit from improved performance if they use a specialized structure.
B-tree
Let d be the minimum number of children an internal (non-root) node can have. For an ordinary B-tree, d=m/2.
The worst case height[citation needed] of a B-tree is:
Comer (1979, p.127) and Cormen et al. (2001, pp.383384) give a slightly different expression for the worst case
height (perhaps because the root node is considered to have height 0).
305
B-tree
306
Algorithms
Search
Searching is similar to searching a binary search tree. Starting at the root, the tree is recursively traversed from top to
bottom. At each level, the search chooses the child pointer (subtree) whose separation values are on either side of the
search value.
Binary search is typically (but not necessarily) used within nodes to find the separation values and child tree of
interest.
Insertion
All insertions start at a leaf node. To insert a new element, search the tree
to find the leaf node where the new element should be added. Insert the
new element into that node with the following steps:
1. If the node contains fewer than the maximum legal number of
elements, then there is room for the new element. Insert the new
element in the node, keeping the node's elements ordered.
2. Otherwise the node is full, evenly split it into two nodes so:
1. A single median is chosen from among the leaf's elements and the
new element.
2. Values less than the median are put in the new left node and values
greater than the median are put in the new right node, with the
median acting as a separation value.
3. The separation value is inserted in the node's parent, which may
cause it to be split, and so on. If the node has no parent (i.e., the
node was the root), create a new root above this node (increasing
the height of the tree).
If the splitting goes all the way up to the root, it creates a new root with a
single separator value and two children, which is why the lower bound on
the size of internal nodes does not apply to the root. The maximum
number of elements per node is U1. When a node is split, one element
moves to the parent, but one element is added. So, it must be possible to
divide the maximum number U1 of elements into two legal nodes. If
this number is odd, then U=2L and one of the new nodes contains
(U2)/2 = L1 elements, and hence is a legal node, and the other contains
one more element, and hence it is legal too. If U1 is even, then U=2L1,
so there are 2L2 elements in the node. Half of this number is L1, which
is the minimum number of elements allowed per node.
An improved algorithm (Mond & Raz 1985) supports a single pass down the tree from the root to the node where the
insertion will take place, splitting any full nodes encountered on the way. This prevents the need to recall the parent
nodes into memory, which may be expensive if the nodes are on secondary storage. However, to use this improved
algorithm, we must be able to send one element to the parent and split the remaining U2 elements into two legal
nodes, without adding a new element. This requires U = 2L rather than U = 2L1, which accounts for why some
textbooks impose this requirement in defining B-trees.
B-tree
307
Deletion
There are two popular strategies for deletion from a B-tree.
1. Locate and delete the item, then restructure the tree to regain its invariants, OR
2. Do a single pass down the tree, but before entering (visiting) a node, restructure the tree so that once the key to be
deleted is encountered, it can be deleted without triggering the need for any further restructuring
The algorithm below uses the former strategy.
There are two special cases to consider when deleting an element:
1. The element in an internal node is a separator for its child nodes
2. Deleting an element may put its node under the minimum number of elements and children
The procedures for these cases are in order below.
Deletion from a leaf node
1. Search for the value to delete.
2. If the value is in a leaf node, simply delete it from the node.
3. If underflow happens, rebalance the tree as described in section "Rebalancing after deletion" below.
Deletion from an internal node
Each element in an internal node acts as a separation value for two subtrees, therefore we need to find a replacement
for separation. Note that the largest element in the left subtree is still less than the separator. Likewise, the smallest
element in the right subtree is still greater than the separator. Both of those elements are in leaf nodes, and either one
can be the new separator for the two subtrees. Algorithmically described below:
1. Choose a new separator (either the largest element in the left subtree or the smallest element in the right subtree),
remove it from the leaf node it is in, and replace the element to be deleted with the new separator.
2. The previous step deleted an element (the new separator) from a leaf node. If that leaf node is now deficient (has
fewer than the required number of nodes), then rebalance the tree starting from the leaf node.
Rebalancing after deletion
Rebalancing starts from a leaf and proceeds toward the root until the tree is balanced. If deleting an element from a
node has brought it under the minimum size, then some elements must be redistributed to bring all nodes up to the
minimum. Usually, the redistribution involves moving an element from a sibling node that has more than the
minimum number of nodes. That redistribution operation is called a rotation. If no sibling can spare a node, then the
deficient node must be merged with a sibling. The merge causes the parent to lose a separator element, so the parent
may become deficient and need rebalancing. The merging and rebalancing may continue all the way to the root.
Since the minimum element count doesn't apply to the root, making the root be the only deficient node is not a
problem. The algorithm to rebalance the tree is as follows:[citation needed]
If the deficient node's right sibling exists and has more than the minimum number of elements, then rotate left
1. Copy the separator from the parent to the end of the deficient node (the separator moves down; the deficient
node now has the minimum number of elements)
2. Replace the separator in the parent with the first element of the right sibling (right sibling loses one node but
still has at least the minimum number of elements)
3. The tree is now balanced
Otherwise, if the deficient node's left sibling exists and has more than the minimum number of elements, then
rotate right
1. Copy the separator from the parent to the start of the deficient node (the separator moves down; deficient node
now has the minimum number of elements)
B-tree
308
2. Replace the separator in the parent with the last element of the left sibling (left sibling loses one node but still
has at least the minimum number of elements)
3. The tree is now balanced
Otherwise, if both immediate siblings have only the minimum number of elements, then merge with a sibling
sandwiching their separator taken off from their parent
1. Copy the separator to the end of the left node (the left node may be the deficient node or it may be the sibling
with the minimum number of elements)
2. Move all elements from the right node to the left node (the left node now has the maximum number of
elements, and the right node empty)
3. Remove the separator from the parent along with its empty right child (the parent loses an element)
If the parent is the root and now has no elements, then free it and make the merged node the new root (tree
becomes shallower)
Otherwise, if the parent has fewer than the required number of elements, then rebalance the parent
Note: The rebalancing operations are different for B+-trees (e.g., rotation is different because parent has copy of the key) and B*-tree (e.g.,
three siblings are merged into two siblings).
Sequential access
While freshly loaded databases tend to have good sequential behavior, this behavior becomes increasingly difficult
to maintain as a database grows, resulting in more random I/O and performance challenges.
Initial construction
In applications, it is frequently useful to build a B-tree to represent a large existing collection of data and then update
it incrementally using standard B-tree operations. In this case, the most efficient way to construct the initial B-tree is
not to insert every element in the initial collection successively, but instead to construct the initial set of leaf nodes
directly from the input, then build the internal nodes from these. This approach to B-tree construction is called
bulkloading. Initially, every leaf but the last one has one extra element, which will be used to build the internal
nodes.[citation needed]
For example, if the leaf nodes have maximum size 4 and the initial collection is the integers 1 through 24, we would
initially construct 4 leaf nodes containing 5 values each and 1 which contains 4 values:
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24
We build the next level up from the leaves by taking the last element from each leaf node except the last one. Again,
each node except the last will contain one extra value. In the example, suppose the internal nodes contain at most 2
values (3 child pointers). Then the next level up of internal nodes would be:
20
5 10 15
1 2 3 4
6 7 8 9
11 12 13 14
16 17 18 19
21 22 23 24
This process is continued until we reach a level with only one node and it is not overfilled. In the example only the
root level remains:
B-tree
309
15
20
5 10
1 2 3 4
6 7 8 9
11 12 13 14
16 17 18 19
21 22 23 24
In filesystems
In addition to its use in databases, the B-tree is also used in filesystems to allow quick random access to an arbitrary
block in a particular file. The basic problem is turning the file block address into a disk block (or perhaps to a
cylinder-head-sector) address.
Some operating systems require the user to allocate the maximum size of the file when the file is created. The file
can then be allocated as contiguous disk blocks. Converting to a disk block: the operating system just adds the file
block address to the starting disk block of the file. The scheme is simple, but the file cannot exceed its created size.
Other operating systems allow a file to grow. The resulting disk blocks may not be contiguous, so mapping logical
blocks to physical blocks is more involved.
MS-DOS, for example, used a simple File Allocation Table (FAT). The FAT has an entry for each disk block,[6] and
that entry identifies whether its block is used by a file and if so, which block (if any) is the next disk block of the
same file. So, the allocation of each file is represented as a linked list in the table. In order to find the disk address of
file block , the operating system (or disk utility) must sequentially follow the file's linked list in the FAT. Worse,
to find a free disk block, it must sequentially scan the FAT. For MS-DOS, that was not a huge penalty because the
disks and files were small and the FAT had few entries and relatively short file chains. In the FAT12 filesystem
(used on floppy disks and early hard disks), there were no more than 4,080 [7] entries, and the FAT would usually be
resident in memory. As disks got bigger, the FAT architecture began to confront penalties. On a large disk using
FAT, it may be necessary to perform disk reads to learn the disk location of a file block to be read or written.
TOPS-20 (and possibly TENEX) used a 0 to 2 level tree that has similarities to a B-tree[citation needed]. A disk block
was 512 36-bit words. If the file fit in a 512 (29) word block, then the file directory would point to that physical disk
block. If the file fit in 218 words, then the directory would point to an aux index; the 512 words of that index would
either be NULL (the block isn't allocated) or point to the physical address of the block. If the file fit in 227 words,
then the directory would point to a block holding an aux-aux index; each entry would either be NULL or point to an
aux index. Consequently, the physical disk block for a 227 word file could be located in two disk reads and read on
the third.
Apple's filesystem HFS+, Microsoft's NTFS, AIX (jfs2) and some Linux filesystems, such as btrfs and Ext4, use
B-trees.
B*-trees are used in the HFS and Reiser4 file systems.
B-tree
Variations
Access concurrency
Lehman and Yao showed that all read locks could be avoided (and thus concurrent access greatly improved) by
linking the tree blocks at each level together with a "next" pointer. This results in a tree structure where both
insertion and search operations descend from the root to the leaf. Write locks are only required as a tree block is
modified. This maximizes access concurrency by multiple users, an important consideration for databases and/or
other B-tree based ISAM storage methods. The cost associated with this improvement is that empty pages cannot be
removed from the btree during normal operations. (However, see [8] for various strategies to implement node
merging, and source code at.)
United States Patent 5283894, granted in 1994, appears to show a way to use a 'Meta Access Method' [9] to allow
concurrent B+Tree access and modification without locks. The technique accesses the tree 'upwards' for both
searches and updates by means of additional in-memory indexes that point at the blocks in each level in the block
cache. No reorganization for deletes is needed and there are no 'next' pointers in each block as in Lehman and Yao.
Notes
[1] Counted B-Trees (http:/ / www. chiark. greenend. org. uk/ ~sgtatham/ algorithms/ cbtree. html), retrieved 2010-01-25
[2] Knuth's video lectures from Stanford (http:/ / scpd. stanford. edu/ knuth/ index. jsp)
[3] Seagate Technology LLC, Product Manual: Barracuda ES.2 Serial ATA, Rev. F., publication 100468393, 2008 (http:/ / www. seagate. com/
staticfiles/ support/ disc/ manuals/ NL35 Series & BC ES Series/ Barracuda ES. 2 Series/ 100468393f. pdf), page 6
[4] avoided the issue by saying an index element is a (physically adjacent) pair of (x,a) where x is the key, and a is some associated information.
The associated information might be a pointer to a record or records in a random access, but what it was didn't really matter. states, "For this
paper the associated information is of no further interest."
[5] If n is zero, then no root node is needed, so the height of an empty tree is not well defined.
[6] For FAT, what is called a "disk block" here is what the FAT documentation calls a "cluster", which is fixed-size group of one or more
contiguous whole physical disk sectors. For the purposes of this discussion, a cluster has no significant difference from a physical sector.
[7] Two of these were reserved for special purposes, so only 4078 could actually represent disk blocks (clusters).
[8] http:/ / www. dtic. mil/ cgi-bin/ GetTRDoc?AD=ADA232287& Location=U2& doc=GetTRDoc. pdf
[9] Lockless Concurrent B+Tree (http:/ / www. freepatentsonline. com/ 5283894. html)
References
Bayer, R.; McCreight, E. (1972), "Organization and Maintenance of Large Ordered Indexes" (http://www.
minet.uni-jena.de/dbis/lehre/ws2005/dbs1/Bayer_hist.pdf), Acta Informatica 1 (3): 173189
Comer, Douglas (June 1979), "The Ubiquitous B-Tree", Computing Surveys 11 (2): 123137, doi:
10.1145/356770.356776 (http://dx.doi.org/10.1145/356770.356776), ISSN 0360-0300 (http://www.
worldcat.org/issn/0360-0300).
Cormen, Thomas; Leiserson, Charles; Rivest, Ronald; Stein, Clifford (2001), Introduction to Algorithms (Second
ed.), MIT Press and McGraw-Hill, pp.434454, ISBN0-262-03293-7. Chapter 18: B-Trees.
Folk, Michael J.; Zoellick, Bill (1992), File Structures (2nd ed.), Addison-Wesley, ISBN0-201-55713-4
Knuth, Donald (1998), Sorting and Searching, The Art of Computer Programming, Volume 3 (Second ed.),
Addison-Wesley, ISBN0-201-89685-0. Section 6.2.4: Multiway Trees, pp.481491. Also, pp.476477 of
section 6.2.3 (Balanced Trees) discusses 2-3 trees.
Mond, Yehudit; Raz, Yoav (1985), "Concurrency Control in B+-Trees Databases Using Preparatory Operations"
(http://www.informatik.uni-trier.de/~ley/db/conf/vldb/MondR85.html), VLDB'85, Proceedings of 11th
International Conference on Very Large Data Bases: 331334.
310
B-tree
311
Original papers
Bayer, Rudolf; McCreight, E. (July 1970), Organization and Maintenance of Large Ordered Indices,
Mathematical and Information Sciences Report No. 20, Boeing Scientific Research Laboratories.
Bayer, Rudolf (1971), "Binary B-Trees for Virtual Memory", Proceedings of 1971 ACM-SIGFIDET Workshop
on Data Description, Access and Control, San Diego, California. November 1112, 1971.
External links
B-Tree animation applet (http://slady.net/java/bt/view.php) by slady
B-tree and UB-tree on Scholarpedia (http://www.scholarpedia.org/article/B-tree_and_UB-tree) Curator: Dr
Rudolf Bayer
B-Trees: Balanced Tree Data Structures (http://www.bluerwhite.org/btree)
NIST's Dictionary of Algorithms and Data Structures: B-tree (http://www.nist.gov/dads/HTML/btree.html)
B-Tree Tutorial (http://cis.stvincent.edu/html/tutorials/swd/btree/btree.html)
The InfinityDB BTree implementation (http://www.boilerbay.com/infinitydb/
TheDesignOfTheInfinityDatabaseEngine.htm)
Cache Oblivious B(+)-trees (http://supertech.csail.mit.edu/cacheObliviousBTree.html)
Dictionary of Algorithms and Data Structures entry for B*-tree (http://www.nist.gov/dads/HTML/bstartree.
html)
Open Data Structures - Section 14.2 - B-Trees (http://opendatastructures.org/versions/edition-0.1e/ods-java/
14_2_B_Trees.html)
Counted B-Trees (http://www.chiark.greenend.org.uk/~sgtatham/algorithms/cbtree.html)
B+ tree
A B+ tree is an n-ary tree with a
variable but often large number of
children per node. A B+ tree consists
of a root, internal nodes and leaves.
The root may be either a leaf or a node
with two or more children.[1]
A B+ tree can be viewed as a B-tree in
which each node contains only keys
(not pairs), and to which an additional
level is added at the bottom with linked
leaves.
A simple B+ tree example linking the keys 17 to data values d1-d7. The linked list (red)
allows rapid in-order traversal.
B+ tree
312
Overview
The order, or branching factor, b of a B+ tree measures the capacity of nodes (i.e., the number of children nodes) for
internal nodes in the tree. The actual number of children for a node, referred to here as m, is constrained for internal
nodes so that
. The root is an exception: it is allowed to have as few as two children. For
example, if the order of a B+ tree is 7, each internal node (except for the root) may have between 4 and 7 children;
the root may have between 2 and 7. Leaf nodes have no children, but are constrained so that the number of keys must
be at least
and at most
. In the situation where a B+ tree is nearly empty, it only contains one node,
which is a leaf node. (The root is also the single leaf, in this case.) This node is permitted to have as little as one key
if necessary, and at most b.
Node Type
Children Type
Min
Children
Max
Children
Example b =
7
Example b =
100
Records
1-7
1 - 100
Root Node
2-7
2 - 100
Internal Node
4-7
50 - 100
Leaf Node
Records
b-1
3-6
50 - 99
Algorithms
Search
The root of a B+ Tree represents the whole range of values in the tree, where every internal node a subinterval.
We are looking for a value k in the B+ Tree. Starting from the root, we are looking for the leaf which may contain the
value k. At each node, we figure out which internal pointer we should follow. An internal B+ Tree node has at most d
b children, where every one of them represents a different sub-interval. We select the corresponding node by
searching on the key values of the node.
Function: search (k)
return tree_search (k, root);
Function: tree_search (k, node)
if node is a leaf then
return node;
switch k do
case k < k_0
return tree_search(k, p_0);
case k_i k < k_{i+1}
return tree_search(k, p_{i+1});
case k_d k
return tree_search(k, p_{d+1});
This pseudocode assumes that no duplicates are allowed.
B+ tree
313
Insertion
Perform a search to determine what bucket the new record should go into.
If the bucket is not full (at most b - 1 entries after the insertion), add the record.
Otherwise, split the bucket.
Allocate new leaf and move half the bucket's elements to the new bucket.
Insert the new leaf's smallest key and address into the parent.
If the parent is full, split it too.
Add the middle key to the parent node.
Repeat until a parent is found that need not split.
If the root splits, create a new root which has one key and two pointers. (That is, the value that gets pushed to the
new root gets removed from the original node)
B-trees grow at the root and not at the leaves.
Deletion
Start at root, find leaf L where entry belongs.
Remove the entry.
If L is at least half-full, done!
If L has fewer entries than it should,
Try to re-distribute, borrowing from sibling (adjacent node with same parent as L).
If re-distribution fails, merge L and sibling.
If merge occurred, must delete entry (pointing to L or sibling) from parent of L.
Merge could propagate to root, decreasing height.
Bulk-loading
Given a collection of data records, we want to create a B+ tree index on some key field. One approach is to insert
each record into an empty tree. However, it is quite expensive, because each entry requires us to start from the root
and go down to the appropriate leaf page. An efficient alternative is to use bulk-loading.
The first step is to sort the data entries according to a search key.
We allocate an empty page to serve as the root, and insert a pointer to the first page of entries into it.
When the root is full, we split the root, and create a new root page.
Keep inserting entries to the right most index page just above the leaf level, until all entries are indexed.
Note (1) when the right-most index page above the leaf level fills up, it is split; (2) this action may, in turn, cause a
split of the right-most index page on step closer to the root; and (3) splits only occur on the right-most path from the
root to the leaf level.
B+ tree
314
Characteristics
For a b-order B+ tree with h levels of index:
The maximum number of records stored is
The minimum number of records stored is
The minimum number of keys is
operations
operations
Implementation
The leaves (the bottom-most index blocks) of the B+ tree are often linked to one another in a linked list; this makes
range queries or an (ordered) iteration through the blocks simpler and more efficient (though the aforementioned
upper bound can be achieved even without this addition). This does not substantially increase space consumption or
maintenance on the tree. This illustrates one of the significant advantages of a B+tree over a B-tree; in a B-tree, since
not all keys are present in the leaves, such an ordered linked list cannot be constructed. A B+tree is thus particularly
useful as a database system index, where the data typically resides on disk, as it allows the B+tree to actually provide
an efficient structure for housing the data itself (this is described in [6] as index structure "Alternative 1").
If a storage system has a block size of B bytes, and the keys to be stored have a size of k, arguably the most efficient
B+ tree is one where b=(B/k)-1. Although theoretically the one-off is unnecessary, in practice there is often a little
extra space taken up by the index blocks (for example, the linked list references in the leaf blocks). Having an index
block which is slightly larger than the storage system's actual block represents a significant performance decrease;
therefore erring on the side of caution is preferable.
If nodes of the B+ tree are organized as arrays of elements, then it may take a considerable time to insert or delete an
element as half of the array will need to be shifted on average. To overcome this problem, elements inside a node can
be organized in a binary tree or a B+ tree instead of an array.
B+ trees can also be used for data stored in RAM. In this case a reasonable choice for block size would be the size of
processor's cache line.
Space efficiency of B+ trees can be improved by using some compression techniques. One possibility is to use delta
encoding to compress keys stored into each block. For internal blocks, space saving can be achieved by either
compressing keys or pointers. For string keys, space can be saved by using the following technique: Normally the ith
entry of an internal block contains the first key of block i+1. Instead of storing the full key, we could store the
shortest prefix of the first key of block i+1 that is strictly greater (in lexicographic order) than last key of block i.
There is also a simple way to compress pointers: if we suppose that some consecutive blocks i, i+1...i+k are stored
contiguously, then it will suffice to store only a pointer to the first block and the count of consecutive blocks.
All the above compression techniques have some drawbacks. First, a full block must be decompressed to extract a
single element. One technique to overcome this problem is to divide each block into sub-blocks and compress them
separately. In this case searching or inserting an element will only need to decompress or compress a sub-block
instead of a full block. Another drawback of compression techniques is that the number of stored elements may vary
considerably from a block to another depending on how well the elements are compressed inside each block.
B+ tree
315
History
The B tree was first described in the paper Organization and Maintenance of Large Ordered Indices. Acta
Informatica 1: 173189 (1972) by Rudolf Bayer and Edward M. McCreight. There is no single paper introducing the
B+ tree concept. Instead, the notion of maintaining all data in leaf nodes is repeatedly brought up as an interesting
variant. An early survey of B trees also covering B+ trees is Douglas Comer: "The Ubiquitous B-Tree [6]", ACM
Computing Surveys 11(2): 121137 (1979). Comer notes that the B+ tree was used in IBM's VSAM data access
software and he refers to an IBM published article from 1973.
References
[1]
[2]
[3]
[4]
[5]
[6]
External links
B+ tree in Python, used to implement a list (http://pypi.python.org/pypi/blist)
Dr. Monge's B+ Tree index notes (http://www.cecs.csulb.edu/~monge/classes/share/B+TreeIndexes.html)
Evaluating the performance of CSB+-trees on Mutithreaded Architectures (http://blogs.ubc.ca/lrashid/files/
2011/01/CCECE07.pdf)
Effect of node size on the performance of cache conscious B+-trees (http://www.cs.wisc.edu/~jignesh/publ/
cci.pdf)
Fractal Prefetching B+-trees (http://www.pittsburgh.intel-research.net/people/gibbons/papers/fpbptrees.pdf)
Towards pB+-trees in the field: implementations Choices and performance (http://leo.saclay.inria.fr/events/
EXPDB2006/PAPERS/Jonsson.pdf)
Cache-Conscious Index Structures for Main-Memory Databases (https://oa.doria.fi/bitstream/handle/10024/
2906/cachecon.pdf?sequence=1)
Cache Oblivious B(+)-trees (http://supertech.csail.mit.edu/cacheObliviousBTree.html)
The Power of B-Trees: CouchDB B+ Tree Implementation (http://books.couchdb.org/relax/appendix/btrees)
Implementations
316
In the example shown, keys are listed in the nodes and values below them. Each complete English word has an
arbitrary integer value associated with it. A trie can be seen as a deterministic finite automaton, although the symbol
on each edge is often implicit in the order of the branches.
It is not necessary for keys to be explicitly stored in nodes. (In the figure, words are shown only to illustrate how the
trie works.)
Though tries are most commonly keyed by character strings, they don't need to be. The same algorithms can easily
be adapted to serve similar functions of ordered lists of any construct, e.g., permutations on a list of digits or shapes.
In particular, a bitwise trie is keyed on the individual bits making up a short, fixed size of bits such as an integer
number or memory address.
Applications
As a replacement for other data structures
As mentioned, a trie has a number of advantages over binary search trees. A trie can also be used to replace a hash
table, over which it has the following advantages:
Looking up data in a trie is faster in the worst case, O(m) time (where m is the length of a search string),
compared to an imperfect hash table. An imperfect hash table can have key collisions. A key collision is the hash
function mapping of different keys to the same position in a hash table. The worst-case lookup speed in an
imperfect hash table is O(N) time, but far more typically is O(1), with O(m) time spent evaluating the hash.
There are no collisions of different keys in a trie.
Trie
317
Buckets in a trie which are analogous to hash table buckets that store key collisions are necessary only if a single
key is associated with more than one value.
There is no need to provide a hash function or to change hash functions as more keys are added to a trie.
A trie can provide an alphabetical ordering of the entries by key.
Tries do have some drawbacks as well:
Tries can be slower in some cases than hash tables for looking up data, especially if the data is directly accessed
on a hard disk drive or some other secondary storage device where the random-access time is high compared to
main memory.
Some keys, such as floating point numbers, can lead to long chains and prefixes that are not particularly
meaningful. Nevertheless a bitwise trie can handle standard IEEE single and double format floating point
numbers.
Some tries can require more space than a hash table, as memory may be allocated for each character in the search
string, rather than a single chunk of memory for the whole entry, as in most hash tables.
Dictionary representation
A common application of a trie is storing a predictive text or autocomplete dictionary, such as found on a mobile
telephone. Such applications take advantage of a trie's ability to quickly search for, insert, and delete entries;
however, if storing dictionary words is all that is required (i.e. storage of information auxiliary to each word is not
required), a minimal deterministic acyclic finite state automaton would use less space than a trie. This is because an
acyclic deterministic finite automaton can compress identical branches from the trie which correspond to the same
suffixes (or parts) of different words being stored.
Tries are also well suited for implementing approximate matching algorithms, including those used in spell checking
and hyphenation software.
Algorithms
We can describe lookup (and membership) easily. Given a recursive trie type, storing an optional value at each node,
and a list of children tries, indexed by the next character, (here, represented as a Haskell data type):
data Trie a =
Trie { value
:: Maybe a
, children :: [(Char,Trie a)] }
We can look up a value in the trie as follows:
find :: String -> Trie a ->
find []
t = value t
find (k:ks) t = case lookup
Nothing
Just ct
Maybe a
k (children t) of
-> Nothing
-> find ks ct
In an imperative style, and assuming an appropriate data type in place, we can describe the same algorithm in Python
(here, specifically for testing membership). Note that children is map of a node's children; and we say that a
"terminal" node is one which contains a valid word.
def find(node, key):
for char in key:
if char not in node.children:
return None
else:
Trie
318
node = node.children[char]
return node.value
A Ruby version
class Trie
def initialize
@root = Hash.new
end
def build(str)
node = @root
str.each_char do |ch|
node[ch] ||= Hash.new
node = node[ch]
end
node[:end] = true
end
def find(str)
node = @root
str.each_char do |ch|
return nil unless node = node[ch]
end
node[:end] && true
end
end
A Java version
public class Trie {
private Node root = new Node('\0', "");
public Trie() {}
public Trie(List<String> argInitialWords) {
for (String word:argInitialWords) {
addWord(word);
}
}
public void addWord(String argWord) {
addWord(argWord.toCharArray());
}
public void addWord(char[] argWord) {
Node currentNode = root;
Trie
319
for (int i = 0; i < argWord.length; i++) {
if (!currentNode.containsChildValue(argWord[i])) {
currentNode.addChild(new Node(argWord[i],
currentNode.getValue() + argWord[i]));
}
currentNode = currentNode.getChild(argWord[i]);
}
currentNode.setIsWord(true);
}
public boolean containsPrefix(String argPrefix) {
return contains(argPrefix.toCharArray(), false);
}
public boolean containsWord(String argWord) {
return contains(argWord.toCharArray(), true);
}
public Node getWord(String argString) {
Node node = getNode(argString.toCharArray());
return node != null && node.isWord() ? node : null;
}
public Node getPrefix(String argString) {
return getNode(argString.toCharArray());
}
@Override
public String toString() {
return root.toString();
}
private boolean contains(char[] argString, boolean argIsWord) {
Node node = getNode(argString);
return (node != null && node.isWord() && argIsWord) ||
(!argIsWord && node != null);
}
private Node getNode(char[] argString) {
Node currentNode = root;
for (int i = 0; i < argString.length && currentNode != null; i++) {
currentNode = currentNode.getChild(argString[i]);
if (currentNode == null) {
Trie
320
return null;
}
}
return currentNode;
}
}
class Node {
private
private
private
private
Trie
321
return isValidWord;
}
public void setIsWord(boolean argIsWord) {
isValidWord = argIsWord;
}
public String toString() {
return value;
}
}
Simplified Java version
This Java version does not store the character in each node and does not require separate methods for String vs.
char[].
public class Trie {
private Node root = new Node("");
public Trie() {}
public Trie(List<String> argInitialWords) {
for (String word:argInitialWords) {
addWord(word);
}
}
public void addWord(String argWord) {
char argChars[] = argWord.toCharArray();
Node currentNode = root;
for (int i = 0; i < argChars.length; i++) {
if (!currentNode.containsChildValue(argChars[i])) {
currentNode.addChild(argChars[i], new
Node(currentNode.getValue() + argChars[i]));
}
currentNode = currentNode.getChild(argChars[i]);
}
currentNode.setIsWord(true);
}
public boolean containsPrefix(String argPrefix) {
return contains(argPrefix, false);
Trie
322
}
public boolean containsWord(String argWord) {
return contains(argWord, true);
}
public Node getWord(String argString) {
Node node = getNode(argString);
return node != null && node.isWord() ? node : null;
}
public Node getPrefix(String argString) {
return getNode(argString);
}
class Node {
private final String value;
private Map<Character, Node> children = new HashMap<Character, Node>();
private boolean isValidWord;
public Node(String argValue) {
value = argValue;
}
Trie
323
Trie
324
break;
if (trie.containsWord(word))
System.out.println(word + " found");
else if (trie.containsPrefix(word)) {
if (confirm(word + " is a prefix. Add it
as a word?"))
trie.addWord(word);
}
else {
if (confirm("Add " + word + "?"))
trie.addWord(word);
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
public static boolean confirm( String question )
throws IOException
{
while (true) {
System.out.print(question + " ");
String ans = br.readLine().trim();
if (ans.equalsIgnoreCase("N") ||
ans.equalsIgnoreCase("NO"))
return false;
else if (ans.equalsIgnoreCase("Y") ||
ans.equalsIgnoreCase("YES"))
return true;
System.out.println("Please answer Y, YES, or N, NO");
}
}
}
A C version
#include<stdio.h>
#include<malloc.h>
typedef struct trie
{
int words;
int prefixes;
struct trie *edges[26];
}trie;
trie * initialize(trie *node)
Trie
325
{
int i;
if(node==NULL)
node=(trie *)malloc(sizeof(trie));
node->words=0;
node->prefixes=0;
for(i=0;i<26;i++)
node->edges[i]=NULL;
return node;
}
trie * addWord(trie *ver,char *str)
{
printf("%c",str[0]);
if(str[0]=='\0')
{
ver->words=ver->words+1;
}
else
{
ver->prefixes=(ver->prefixes)+1;
char k;
k=str[0];
str++;
int index=k-'a';
if(ver->edges[index]==NULL)
{
ver->edges[index]=initialize(ver->edges[index]);
}
ver->edges[index]=addWord(ver->edges[index],str);
}
return ver;
}
int countWords(trie *ver,char *str)
{
if(str[0]=='\0')
return ver->words;
else
{
int k=str[0]-'a';
str++;
if(ver->edges[k]==NULL)
return 0;
return countWords(ver->edges[k],str);
Trie
326
}
}
int countPrefix(trie *ver,char *str)
{
if(str[0]=='\0')
return ver->prefixes;
else
{
int k=str[0]-'a';
str++;
if(ver->edges[k]==NULL)
return 0;
return countPrefix(ver->edges[k],str);
}
}
int main()
{
trie *start=NULL;
start=initialize(start);
int ch=1;
while(ch)
{
printf("\n 1. Insert a word ");
printf("\n 2. Count words");
printf("\n 3. Count prefixes");
printf("\n 0. Exit\n");
printf("\nEnter your choice: ");
scanf("%d",&ch);
char input[1000];
switch(ch)
{
case 1:
printf("\nEnter a word to insert: ");
scanf("%s",input);
start=addWord(start,input);
break;
case 2:
printf("\nEnter a word to count words: ");
scanf("%s",input);
printf("\n%d",countWords(start,input));
break;
case 3:
printf("\nEnter a word to count prefixes: ");
scanf("%s",input);
Trie
327
printf("\n%d",countPrefix(start,input));
break;
}
}
return 0;
}
A Python version
from collections import defaultdict
class Trie:
def __init__(self):
self.root = defaultdict(Trie)
self.value = None
def add(self, s, value):
"""Add the string `s` to the
`Trie` and map it to the given value."""
head, tail = s[0], s[1:]
cur_node = self.root[head]
if not tail:
cur_node.value = value
return # No further recursion
cur_node.add(tail, value)
def lookup(self, s, default=None):
"""Look up the value corresponding to
the string `s`. Expand the trie to cache the search."""
head, tail = s[0], s[1:]
node = self.root[head]
if tail:
return node.lookup(tail)
return node.value or default
def remove(self, s):
"""Remove the string s from the Trie.
Returns *True* if the string was a member."""
head, tail = s[0], s[1:]
if head not in self.root:
return False # Not contained
node = self.root[head]
if tail:
return node.remove(tail)
else:
del node
return True
Trie
328
def prefix(self, s):
"""Check whether the string `s` is a prefix
of some member. Don't expand the trie on negatives
(cf.lookup)"""
if not s:
return True
head, tail = s[0], s[1:]
if head not in self.root:
return False # Not contained
node = self.root[head]
return node.prefix(tail)
def items(self):
"""Return an iterator over the items of the `Trie`."""
for char, node in self.root.iteritems():
if node.value is None:
yield node.items
else:
yield node
Sorting
Lexicographic sorting of a set of keys can be accomplished with a simple trie-based algorithm as follows:
Insert all keys in a trie.
Output all keys in the trie by means of pre-order traversal, which results in output that is in lexicographically
increasing order. Pre-order traversal is a kind of depth-first traversal. In-order traversal is another kind of
depth-first traversal that is more appropriate for outputting the values that are in a binary search tree rather than a
trie.
This algorithm is a form of radix sort.
A trie forms the fundamental data structure of Burstsort, which (in 2007) was the fastest known string sorting
algorithm. However, now there are faster string sorting algorithms.
Bitwise tries
Bitwise tries are much the same as a normal character based trie except that individual bits are used to traverse what
effectively becomes a form of binary tree. Generally, implementations use a special CPU instruction to very quickly
find the first set bit in a fixed length key (e.g. GCC's __builtin_clz() intrinsic). This value is then used to index a 32
or 64 entry table which points to the first item in the bitwise trie with that number of leading zero bits. The search
then proceeds by testing each subsequent bit in the key and choosing child[0] or child[1] appropriately until the item
is found.
Although this process might sound slow, it is very cache-local and highly parallelizable due to the lack of register
dependencies and therefore in fact has excellent performance on modern out-of-order execution CPUs. A red-black
tree for example performs much better on paper, but is highly cache-unfriendly and causes multiple pipeline and
TLB stalls on modern CPUs which makes that algorithm bound by memory latency rather than CPU speed. In
Trie
329
comparison, a bitwise trie rarely accesses memory and when it does it does so only to read, thus avoiding SMP cache
coherency overhead, and hence is becoming increasingly the algorithm of choice for code which does a lot of
insertions and deletions such as memory allocators (e.g. recent versions of the famous Doug Lea's allocator
(dlmalloc) and its descendents).
A reference implementation of bitwise tries in C and C++ useful for further study can be found at http:/ / www.
nedprod.com/programs/portable/nedtries/.
Compressing tries
When the trie is mostly static, i.e. all insertions or deletions of keys from a prefilled trie are disabled and only
lookups are needed, and when the trie nodes are not keyed by node specific data (or if the node's data is common) it
is possible to compress the trie representation by merging the common branches. This application is typically used
for compressing lookup tables when the total set of stored keys is very sparse within their representation space.
For example it may be used to represent sparse bitsets (i.e. subsets of a much larger fixed enumerable set) using a trie
keyed by the bit element position within the full set, with the key created from the string of bits needed to encode the
integral position of each element. The trie will then have a very degenerate form with many missing branches, and
compression becomes possible by storing the leaf nodes (set segments with fixed length) and combining them after
detecting the repetition of common patterns or by filling the unused gaps.
Such compression is also typically used in the implementation of the various fast lookup tables needed to retrieve
Unicode character properties (for example to represent case mapping tables, or lookup tables containing the
combination of base and combining characters needed to support Unicode normalization). For such application, the
representation is similar to transforming a very large unidimensional sparse table into a multidimensional matrix, and
then using the coordinates in the hyper-matrix as the string key of an uncompressed trie. The compression will then
consist of detecting and merging the common columns within the hyper-matrix to compress the last dimension in the
key; each dimension of the hypermatrix stores the start position within a storage vector of the next dimension for
each coordinate value, and the resulting vector is itself compressible when it is also sparse, so each dimension
(associated to a layer level in the trie) is compressed separately.
Some implementations do support such data compression within dynamic sparse tries and allow insertions and
deletions in compressed tries, but generally this has a significant cost when compressed segments need to be split or
merged, and some tradeoff has to be made between the smallest size of the compressed trie and the speed of updates,
by limiting the range of global lookups for comparing the common branches in the sparse trie.
The result of such compression may look similar to trying to transform the trie into a directed acyclic graph (DAG),
because the reverse transform from a DAG to a trie is obvious and always possible, however it is constrained by the
form of the key chosen to index the nodes.
Another compression approach is to "unravel" the data structure into a single byte array. This approach eliminates
the need for node pointers which reduces the memory requirements substantially and makes memory mapping
possible which allows the virtual memory manager to load the data into memory very efficiently.
Another compression approach is to "pack" the trie. Liang describes a space-efficient implementation of a sparse
packed trie applied to hyphenation, in which the descendants of each node may be interleaved in memory.
Trie
330
Unlike most other data structures, tries have the peculiar feature that the code path, and hence the time required, is
almost identical for insert, delete, and find operations. As a result, for situations where code is inserting, deleting and
finding in equal measure, tries can handily beat binary search trees, as well as provide a better basis for the CPU's
instruction and branch caches.
The following are the main advantages of tries over binary search trees (BSTs):
Looking up keys is faster. Looking up a key of length m takes worst case O(m) time. A BST performs O(log(n))
comparisons of keys, where n is the number of elements in the tree, because lookups depend on the depth of the
tree, which is logarithmic in the number of keys if the tree is balanced. Hence in the worst case, a BST takes O(m
log n) time. Moreover, in the worst case log(n) will approach m. Also, the simple operations tries use during
lookup, such as array indexing using a character, are fast on real machines.
Tries are more space-efficient when they contain a large number of short keys, since nodes are shared between
keys with common initial subsequences.
Tries facilitate longest-prefix matching.
The number of internal nodes from root to leaf equals the length of the key. Balancing the tree is therefore of no
concern.
The following are the main advantages of tries over hash tables:
Tries support ordered iteration, whereas iteration over a hash table will result in a pseudorandom order given by
the hash function (and further affected by the order of hash collisions, which is determined by the
implementation).
Tries facilitate longest-prefix matching, but hashing does not, as a consequence of the above. Performing such a
"closest fit" find can, depending on implementation, be as quick as an exact find.
Tries tend to be faster on average at insertion than hash tables because hash tables must rebuild their index when
it becomes full - a very expensive operation. Tries therefore have much better bounded worst-case time costs,
which is important for latency-sensitive programs.
Since no hash function is used, tries are generally faster than hash tables for small keys.
Trie
331
Notes
References
de la Briandais, R. (1959). "File Searching Using Variable Length Keys". Proceedings of the Western Joint
Computer Conference: 295298.
External links
Radix tree
In computer science, a radix tree (also patricia trie or radix trie or
compact prefix tree) is a space-optimized trie data structure where
each node with only one child is merged with its child. The result is
that every internal node has up to the number of children of the radix r
of the radix trie, where r is a positive integer, where r is a power x of 2,
and where x 1. Unlike in regular tries, edges can be labeled with
sequences of elements as well as single elements. This makes them
much more efficient for small sets (especially if the strings are long)
and for sets of strings that share long prefixes. Unlike regular trees
(where whole keys are compared en masse from their beginning up to the point of inequality), the key at each node is
compared chunk-of-bits by chunk-of-bits, where the quantity of bits in that chunk at that node is the radix r of the
radix trie. When the r is 2, the radix trie is binary (i.e., compare that node's 1-bit portion of the key), which
minimizes sparseness at the expense of maximizing trie depthi.e., maximizing up to conflation of nondiverging
bit-strings in the key. When r is an integer power of 2 greater or equal to 4, then the radix trie is an r-ary trie, which
lessens the depth of the radix trie at the expense of potential sparseness.
As an optimization, edge labels can be stored in constant size by using two pointers to a string (for the first and last
elements).
Note that although the examples in this article show strings as sequences of characters, the type of the string
elements can be chosen arbitrarily (for example, as a bit or byte of the string representation when using multibyte
character encodings or Unicode).
Radix tree
332
Applications
As mentioned, radix trees are useful for constructing associative arrays with keys that can be expressed as strings.
They find particular application in the area of IP routing, where the ability to contain large ranges of values with a
few exceptions is particularly suited to the hierarchical organization of IP addresses.[1] They are also used for
inverted indexes of text documents in information retrieval.
Operations
Radix trees support insertion, deletion, and searching operations. Insertion adds a new string to the trie while trying
to minimize the amount of data stored. Deletion removes a string from the trie. Searching operations include (but are
not necessarily limited to) exact lookup, find predecessor, find successor, and find all strings with a prefix. All of
these operations are O(k) where k is the maximum length of all strings in the set, where length is measured in the
quantity of bits equal to the radix of the radix trie.
Lookup
The lookup operation determines if a string exists in a trie. Most
operations modify this approach in some way to handle their specific
tasks. For instance, the node where a string terminates may be of
importance. This operation is similar to tries except that some edges
consume multiple elements.
The following pseudo code assumes that these classes exist.
Edge
Node targetNode
string label
Node
Array of Edges edges
function isLeaf()
function lookup(string x)
{
// Begin at the root with no elements found
Node traverseNode := root;
int elementsFound := 0;
Radix tree
333
// A match is found if we arrive at a leaf node and have used up exactly x.length elements
return (traverseNode != null && traverseNode.isLeaf() && elementsFound == x.length);
}
Insertion
To insert a string, we search the tree until we can make no further progress. At this point we either add a new
outgoing edge labeled with all remaining elements in the input string, or if there is already an outgoing edge sharing
a prefix with the remaining input string, we split it into two edges (the first labeled with the common prefix) and
proceed. This splitting step ensures that no node has more children than there are possible string elements.
Several cases of insertion are shown below, though more may exist. Note that r simply represents the root. It is
assumed that edges can be labelled with empty strings to terminate strings where necessary and that the root has no
incoming edge.
Insert 'slower'
while keeping
'slow'
Radix tree
Deletion
To delete a string x from a tree, we first locate the leaf representing x. Then, assuming x exists, we remove the
corresponding leaf node. If the parent of our leaf node has only one other child, then that child's incoming label is
appended to the parent's incoming label and the child is removed.
Additional operations
Find all strings with common prefix: Returns an array of strings which begin with the same prefix.
Find predecessor: Locates the largest string less than a given string, by lexicographic order.
Find successor: Locates the smallest string greater than a given string, by lexicographic order.
History
Donald R. Morrison first described what he called "Patricia trees" in 1968;[2] the name comes from the acronym
PATRICIA, which stands for "Practical Algorithm To Retrieve Information Coded In Alphanumeric". Gernot
Gwehenberger independently invented and described the data structure at about the same time.[3] PATRICIA tries
are radix tries with radix equals 2, which means that each bit of the key is compared individually and each node is a
two-way (i.e., left versus right) branch.
Variants
A common extension of radix trees uses two colors of nodes, 'black' and 'white'. To check if a given string is stored
in the tree, the search starts from the top and follows the edges of the input string until no further progress can be
made. If the search-string is consumed and the final node is a black node, the search has failed; if it is white, the
search has succeeded. This enables us to add a large range of strings with a common prefix to the tree, using white
nodes, then remove a small set of "exceptions" in a space-efficient manner by inserting them using black nodes.
The HAT-trie is a radix tree based cache-conscious data structure that offers efficient string storage and retrieval,
and ordered iterations. Performance, with respect to both time and space, is comparable to the cache-conscious
334
Radix tree
hashtable. See HAT trie implementation notes at [4]
References
[1] Knizhnik, Konstantin. "Patricia Tries: A Better Index For Prefix Searches" (http:/ / www. ddj. com/ architect/ 208800854), Dr. Dobb's
Journal, June, 2008.
[2] Morrison, Donald R. Practical Algorithm to Retrieve Information Coded in Alphanumeric (http:/ / portal. acm. org/ citation. cfm?id=321481)
[3] G. Gwehenberger, Anwendung einer binren Verweiskettenmethode beim Aufbau von Listen. (http:/ / cr. yp. to/ bib/ 1968/ gwehenberger.
html) Elektronische Rechenanlagen 10 (1968), pp. 223226
[4] http:/ / code. google. com/ p/ hat-trie
External links
Algorithms and Data Structures Research & Reference Material: PATRICIA (http://www.csse.monash.edu.
au/~lloyd/tildeAlgDS/Tree/PATRICIA/), by Lloyd Allison, Monash University
Patricia Tree (http://www.nist.gov/dads/HTML/patriciatree.html), NIST Dictionary of Algorithms and Data
Structures
Crit-bit trees (http://cr.yp.to/critbit.html), by Daniel J. Bernstein
Radix Tree API in the Linux Kernel (http://lwn.net/Articles/175432/), by Jonathan Corbet
Kart (key alteration radix tree) (http://code.dogmap.org/kart/), by Paul Jarc
Implementations
GNU C++ Standard library has a trie implementation (http://gcc.gnu.org/onlinedocs/libstdc++/ext/pb_ds/
trie_based_containers.html)
Java implementation of Radix Tree (http://badgenow.com/p/radixtree/), by Tahseen Ur Rehman
Java implementation of Concurrent Radix Tree (http://code.google.com/p/concurrent-trees/), by Niall
Gallagher
C# implementation of a Radix Tree (http://paratechnical.blogspot.com/2011/03/
radix-tree-implementation-in-c.html)
Practical Algorithm Template Library (http://code.google.com/p/patl/), a C++ library on PATRICIA tries
(VC++ >=2003, GCC G++ 3.x), by Roman S. Klyujkov
Patricia Trie C++ template class implementation (http://www.codeproject.com/KB/string/
PatriciaTrieTemplateClass.aspx), by Radu Gruian
Haskell standard library implementation (http://hackage.haskell.org/packages/archive/containers/latest/doc/
html/Data-IntMap.html) "based on big-endian patricia trees". Web-browsable source code (http://hackage.
haskell.org/packages/archive/containers/latest/doc/html/src/Data-IntMap.html).
Patricia Trie implementation in Java (http://code.google.com/p/patricia-trie/), by Roger Kapsi and Sam Berlin
Crit-bit trees (http://github.com/agl/critbit) forked from C code by Daniel J. Bernstein
Patricia Trie implementation in C (http://cprops.sourceforge.net/gen/docs/trie_8c-source.html), in libcprops
(http://cprops.sourceforge.net)
Patricia Trees : efficient sets and maps over integers in (http://www.lri.fr/~filliatr/ftp/ocaml/ds) OCaml, by
Jean-Christophe Fillitre
335
References
Inenaga, S.; Hoshino, H.; Shinohara, A.; Takeda, M.; Arikawa, S. (2001), "On-line construction of symmetric
compact directed acyclic word graphs" [1], Proc. 8th Int. Symp. String Processing and Information Retrieval,
2001. SPIRE 2001, pp.96110, doi:10.1109/SPIRE.2001.989743 [2], ISBN0-7695-1192-9.
Crochemore, Maxime; Vrin, Renaud (1997), "Direct construction of compact directed acyclic word graphs",
Combinatorial Pattern Matching, Lecture Notes in Computer Science, Springer-Verlag, pp.116129,
doi:10.1007/3-540-63220-4_55 [3].
Epifanio, Chiara; Mignosi, Filippo; Shallit, Jeffrey; Venturini, Ilaria (2004), "Sturmian graphs and a conjecture of
Moser", in Calude, Cristian S.; Calude, Elena; Dineen, Michael J., Developments in language theory.
Proceedings, 8th international conference (DLT 2004), Auckland, New Zealand, December 2004, Lecture Notes
in Computer Science 3340, Springer-Verlag, pp.175187, ISBN3-540-24014-4, Zbl1117.68454 [4]
Do, H.H.; Sung, W.K. (2011), "Compressed Directed Acyclic Word Graph with Application in Local Alignment",
Computing and Combinatorics, Lecture Notes in Computer Science 6842, Springer-Verlag, pp.503518,
doi:10.1007/978-3-642-22685-4_44 [5], ISBN978-3-642-22684-7
References
[1]
[2]
[3]
[4]
[5]
336
Suffix tree
337
Suffix tree
In computer science, a suffix tree (also called PAT
tree or, in an earlier form, position tree) is a
compressed trie containing all the suffixes of the given
text as their keys and positions in the text as their
values. Suffix tree allows a particularly fast
implementation of many important string operations.
The construction of such a tree for the string
time and space linear in the length of
takes
. Once
History
Suffix tree for the text BANANA. Each substring is terminated with
special character $. The six paths from the root to a leaf (shown as
boxes) correspond to the six suffixes A$, NA$, ANA$, NANA$,
ANANA$ and BANANA$. The numbers in the leaves give the start
position of the corresponding suffix. Suffix links, drawn dashed, are
used during construction.
Definition
The suffix tree for the string
of length
denoted $). This ensures that no suffix is a prefix of another, and that there will be leaf nodes, one for each of the
suffixes of . Since all internal non-root nodes are branching, there can be at most n 1 such nodes, and
n+(n1)+1=2n nodes in total (n leaves, n1 internal non-root nodes, 1 root).
Suffix tree
338
Suffix links are a key feature for older linear-time construction algorithms, although most newer algorithms, which
are based on Farach's algorithm, dispense with suffix links. In a complete suffix tree, all internal non-root nodes have
a suffix link to another internal node. If the path from the root to a node spells the string
, where is a single
character and
is a string (possibly empty), it has a suffix link to the internal node representing
. See for
example the suffix link from the node for ANA to the node for NA in the figure above. Suffix links are also used in
some algorithms running on the tree.
Functionality
A suffix tree for a string
of length
can be built in
in a polynomial range (in particular, this is true for constant-sized alphabets).[3] For larger alphabets, the running
time is dominated by first sorting the letters to bring them into a range of size
; in general, this takes
time. The costs below are given under the assumption that the alphabet is constant.
Assume that a suffix tree has been built for the string of length , or that a generalised suffix tree has been built
for the set of strings
of total length
. You can:
substring in
in
time.
Find properties of the strings:
time.
time.[5]
and a
The suffix tree can be prepared for constant time lowest common ancestor retrieval between nodes in
time.[11]
maximal palindromes in
[15]
,[14] or
in
.[12]
time, where z is the number of
are allowed, or
if
Suffix tree
339
Applications
Suffix trees can be used to solve a large number of string problems that occur in text-editing, free-text search,
computational biology and other application areas. Primary applications include:
String search, in O(m) complexity, where m is the length of the sub-string (but with initial O(n) time required to
build the suffix tree for the string)
Finding the longest repeated substring
Finding the longest common substring
Finding the longest palindrome in a string
Suffix trees are often used in bioinformatics applications, searching for patterns in DNA or protein sequences (which
can be viewed as long strings of characters). The ability to search efficiently with mismatches might be considered
their greatest strength. Suffix trees are also used in data compression; they can be used to find repeated data, and can
be used for the sorting stage of the BurrowsWheeler transform. Variants of the LZW compression schemes use
suffix trees (LZSS). A suffix tree is also used in suffix tree clustering, a data clustering algorithm used in some
search engines.[19]
Implementation
If each node and edge can be represented in
total length of all the strings on all of the edges in the tree is
length of a substring of S, giving a total space usage of
space. The
be the size of the alphabet. Then you have the following costs:
Lookup
Insertion
Traversal
Note that the insertion cost is amortised, and that the costs for hashing are given for perfect hashing.
The large amount of information in each edge and node makes the suffix tree very expensive, consuming about 10 to
20 times the memory size of the source text in good implementations. The suffix array reduces this requirement to a
factor of 8 (for array including LCP values built within 32-bit address space and 8-bit characters.) This factor
depends on the properties and may reach 2 with usage of 4-byte wide characters (needed to contain any symbol in
some UNIX-like systems, see wchar t) on 32-bit systems. Researchers have continued to find smaller indexing
structures.
Suffix tree
External construction
Suffix trees quickly outgrow the main memory on standard machines for sequence collections in the order of
gigabytes. As such, their construction calls for external memory approaches.
There are theoretical results for constructing suffix trees in external memory. The algorithm by Farach-Colton,
Ferragina & Muthukrishnan (2000) is theoretically optimal, with an I/O complexity equal to that of sorting. However
the overall intricacy of this algorithm has prevented, so far, its practical implementation.[20]
On the other hand, there have been practical works for constructing disk-based suffix trees which scale to (few)
GB/hours. The state of the art methods are TDD, TRELLIS, DiGeST, and B2ST.
TDD and TRELLIS scale up to the entire human genome approximately 3GB resulting in a disk-based suffix
tree of a size in the tens of gigabytes,. However, these methods cannot handle efficiently collections of sequences
exceeding 3GB. DiGeST performs significantly better and is able to handle collections of sequences in the order of
6GB in about 6 hours. . All these methods can efficiently build suffix trees for the case when the tree does not fit in
main memory, but the input does. The most recent method, B2ST, scales to handle inputs that do not fit in main
memory. ERA is a recent parallel suffix tree construction method that is significantly faster. ERA can index the
entire human genome in 19 minutes on an 8-core desktop computer with 16GB RAM. On a simple Linux cluster
with 16 nodes (4GB RAM per node), ERA can index the entire human genome in less than 9 minutes.[21]
Notes
[1] Giegerich & Kurtz (1997).
[2] http:/ / www. cs. uoi. gr/ ~kblekas/ courses/ bioinformatics/ Suffix_Trees1. pdf
[3] Farach (1997).
[4] , p.92.
[5] , p.123.
[6] Baeza-Yates & Gonnet (1996).
[7] , p.132.
[8] , p.125.
[9] , p.144.
[10] , p.166.
[11] , Chapter 8.
[12] , p.196.
[13] , p.200.
[14] , p.198.
[15] , p.201.
[16] , p.204.
[17] , p.205.
[18] , pp.197199.
[19] First introduced by .
[20] Smyth (2003).
[21] Mansour et al. (2011).
References
Baeza-Yates, Ricardo A.; Gonnet, Gaston H. (1996), "Fast text searching for regular expressions or automaton
searching on tries", Journal of the ACM 43 (6): 915936, doi: 10.1145/235809.235810 (http://dx.doi.org/10.
1145/235809.235810).
Barsky, Marina; Stege, Ulrike; Thomo, Alex; Upton, Chris (2008), "A new method for indexing genomes using
on-disk suffix trees", CIKM '08: Proceedings of the 17th ACM Conference on Information and Knowledge
Management, New York, NY, USA: ACM, pp.649658.
Barsky, Marina; Stege, Ulrike; Thomo, Alex; Upton, Chris (2009), "Suffix trees for very large genomic
sequences", CIKM '09: Proceedings of the 18th ACM Conference on Information and Knowledge Management,
340
Suffix tree
Phoophakdee, Benjarath; Zaki, Mohammed J. (2007), "Genome-scale disk-based suffix tree indexing", SIGMOD
'07: Proceedings of the ACM SIGMOD International Conference on Management of Data, New York, NY, USA:
ACM, pp.833844.
Smyth, William (2003), Computing Patterns in Strings, Addison-Wesley.
Tata, Sandeep; Hankins, Richard A.; Patel, Jignesh M. (2003), "Practical Suffix Tree Construction", VLDB '03:
Proceedings of the 30th International Conference on Very Large Data Bases, Morgan Kaufmann, pp.3647.
Ukkonen, E. (1995), "On-line construction of suffix trees" (http://www.cs.helsinki.fi/u/ukkonen/
SuffixT1withFigs.pdf), Algorithmica 14 (3): 249260, doi: 10.1007/BF01206331 (http://dx.doi.org/10.1007/
BF01206331).
Weiner, P. (1973), "Linear pattern matching algorithms", 14th Annual IEEE Symposium on Switching and
Automata Theory, pp.111, doi: 10.1109/SWAT.1973.13 (http://dx.doi.org/10.1109/SWAT.1973.13).
Zamir, Oren; Etzioni, Oren (1998), "Web document clustering: a feasibility demonstration", SIGIR '98:
Proceedings of the 21st annual international ACM SIGIR conference on Research and development in
information retrieval, New York, NY, USA: ACM, pp.4654.
External links
Suffix Trees (http://www.cise.ufl.edu/~sahni/dsaaj/enrich/c16/suffix.htm) by Sartaj Sahni
Suffix Trees (http://www.allisons.org/ll/AlgDS/Tree/Suffix/) by Lloyd Allison
NIST's Dictionary of Algorithms and Data Structures: Suffix Tree (http://www.nist.gov/dads/HTML/
suffixtree.html)
suffix_tree (http://mila.cs.technion.ac.il/~yona/suffix_tree/) ANSI C implementation of a Suffix Tree
libstree (http://www.cl.cam.ac.uk/~cpk25/libstree/), a generic suffix tree library written in C
Tree::Suffix (https://metacpan.org/module/Tree::Suffix), a Perl binding to libstree
Strmat (http://www.cs.ucdavis.edu/~gusfield/strmat.html) a faster generic suffix tree library written in C
(uses arrays instead of linked lists)
SuffixTree (http://hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees/) a Python binding to Strmat
Universal Data Compression Based on the Burrows-Wheeler Transformation: Theory and Practice (http://www.
balkenhol.net/papers/t1043.pdf.gz), application of suffix trees in the BWT
Theory and Practice of Succinct Data Structures (http://www.cs.helsinki.fi/group/suds/), C++
implementation of a compressed suffix tree]
341
Suffix tree
342
Practical Algorithm Template Library (http://code.google.com/p/patl/), a C++ library with suffix tree
implementation on PATRICIA trie, by Roman S. Klyujkov
A Java implementation (http://en.literateprograms.org/Suffix_tree_(Java))
A Java implementation of Concurrent Suffix Tree (http://code.google.com/p/concurrent-trees/)
Suffix array
Suffix array
Type
Array
Invented by
Time complexity
in big O notation
Average
Worst case
Space
Construction
In computer science, a suffix array is a sorted array of all suffixes of a string. It is a simple, yet powerful data
structure which is used, among others, in full text indices, data compression algorithms and within the field of
bioinformatics[1].
Suffix arrays were introduced by Manber & Myers (1990) as a simple, space efficient alternative to suffix trees.
They have independently been discovered by Gonnet, Baeza-Yates & Snider (1992) under the name PAT array.
Definition
Let
The suffix array
to
ranging from
Example
Consider the text
to be indexed:
i
1 2 3 4 5 6 7
S[i] b a n a n a $
The text ends with the special sentinel letter $ that is unique and lexicographically smaller than any other character.
The text has the following suffixes:
Suffix array
343
Suffix
banana$ 1
anana$
nana$
ana$
na$
a$
Suffix
a$
ana$
anana$
banana$ 1
na$
nana$
1 2 3 4 5 6 7
A[i] 7 6 4 2 1 5 3
So for example,
is the suffix
within
, which
Suffix array
344
Space Efficiency
Suffix arrays were introduced by Manber & Myers (1990) in order to improve over the space requirements of suffix
trees: Suffix arrays store integers. Assuming an integer requires bytes, a suffix array requires
bytes in total.
This is significantly less than the
However, in certain applications, the space requirements of suffix arrays may still be prohibitive. Analyzed in bits, a
suffix array requires
space, whereas the original text over an alphabet of size does only require
bits. For a human genome with
and
Construction Algorithms
A naive approach to construct a suffix array is to use a comparison-based sorting algorithm. These algorithms
require
suffix comparisons, but a suffix comparison runs in
time, so the overall runtime of this
approach is
More advanced algorithms take advantage of the fact that the suffixes to be sorted are not arbitrary strings but related
to each other. These algorithms strive to achieve the following goals:[4]
minimal asymptotic complexity
lightweight in space, meaning little or no working memory beside the text and the suffix array itself is needed
fast in practice
One of the first algorithms to achieve all goals is the SA-IS algorithm of Nong, Zhang & Chan (2009). The algorithm
is also rather simple (< 100 LOC) and can be enhanced to simultaneously construct the LCP array.[5] The SA-IS
algorithm is one of the fastest known suffix array construction algorithms. A careful implementation by Yuta Mori
[6]
outperforms most other linear or super-linear construction approaches.
Beside time and space requirements, suffix array construction algorithms are also differentiated by their supported
alphabet: constant alphabets where the alphabet size is bound by a constant, integer alphabets where characters are
integers in a range depending on and general alphabets where only character comparisons are allowed.[7]
Most suffix array construction algorithms are based on one of the following approaches:[4]
Prefix doubling algorithms are based on a strategy of Karp, Miller & Rosenberg (1972). The idea is to find
prefixes that honor the lexicographic ordering of suffixes. The assessed prefix length doubles in each iteration of
the algorithm until a prefix is unique and provides the rank of the associated suffix.
Recursive algorithms follow the approach of the suffix tree construction algorithm by Farach (1997) to
recursively sort a subset of suffixes. This subset is then used to infer a suffix array of the remaining suffixes. Both
of these suffix arrays are then merged to compute the final suffix array.
Induced copying algorithms are similar to recursive algorithms in the sense that they use an already sorted subset
to induce a fast sort of the remaining suffixes. The difference is that these algorithms favor iteration over
recursion to sort the selected suffix subset. A survey of this diverse group of algorithms has been put together by
Puglisi, Smyth & Turpin (2007).
A well-known recursive algorithm for integer alphabets is the DC3 / skew algorithm of Krkkinen & Sanders
(2003). It runs in linear time and has successfully been used as the basis for parallel[8] and external memory[9] suffix
array construction algorithms.
Recent work by Salson et al. (2009) proposes an algorithm for updating the suffix array of a text that has been edited
instead of rebuilding a new suffix array from scratch. Even if the theoretical worst-case time complexity is
, it appears to perform well in practice: experimental results from the authors showed that their
Suffix array
345
implementation of dynamic suffix arrays is generally more efficient than rebuilding when considering the insertion
of a reasonable number of letters in the original text.
Applications
The suffix array of a string can be used as an index to quickly locate every occurrence of a substring pattern
within the string
. Finding every occurrence of the pattern is equivalent to finding every suffix that begins with
the substring. Thanks to the lexicographical ordering, these suffixes will be grouped together in the suffix array and
can be found efficiently with two binary searches. The first search locates the starting position of the interval, and the
second one determines the end position:
def search(P):
l = 1; r = n + 1
while l < r:
mid = (l+r) / 2
if P > suffixAt(A[mid]):
l = mid + 1
else:
r = mid
s = l; r = n + 1
while l < r:
mid = (l+r) / 2
if P == suffixAt(A[mid]):
l = mid
else:
r = mid - 1
return (s, r)
Finding the substring pattern
of length
in the string
of length
takes
Suffix array
Notes
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
References
Abouelhoda, Mohamed Ibrahim; Kurtz, Stefan; Ohlebusch, Enno (2004). "Replacing suffix trees with enhanced
suffix arrays". Journal of Discrete Algorithms 2: 53. doi: 10.1016/S1570-8667(03)00065-0 (http://dx.doi.org/
10.1016/S1570-8667(03)00065-0).
Manber, Udi; Myers, Gene (1990). "Suffix arrays: a new method for on-line string searches". In Proceedings of
the first annual ACM-SIAM symposium on Discrete algorithms 90 (319): 327.
Gonnet, G.H; Baeza-Yates, R.A; Snider, T (1992). "New indices for text: PAT trees and PAT arrays".
Information retrieval: data structures and algorithms.
Kurtz, S (1999). "Reducing the space requirement of suffix trees". Software-Practice and Experience 29 (13):
1149. doi: 10.1002/(SICI)1097-024X(199911)29:13<1149::AID-SPE274>3.0.CO;2-O (http://dx.doi.org/10.
1002/(SICI)1097-024X(199911)29:13<1149::AID-SPE274>3.0.CO;2-O).
Abouelhoda, Mohamed Ibrahim; Kurtz, Stefan; Ohlebusch, Enno (2002). "The Enhanced Suffix Array and Its
Applications to Genome Analysis". Algorithms in Bioinformatics. Lecture Notes in Computer Science 2452.
p.449. doi: 10.1007/3-540-45784-4_35 (http://dx.doi.org/10.1007/3-540-45784-4_35).
ISBN978-3-540-44211-0.
Puglisi, Simon J.; Smyth, W. F.; Turpin, Andrew H. (2007). "A taxonomy of suffix array construction
algorithms". ACM Computing Surveys 39 (2): 4. doi: 10.1145/1242471.1242472 (http://dx.doi.org/10.1145/
1242471.1242472).
Nong, Ge; Zhang, Sen; Chan, Wai Hong (2009). "Linear Suffix Array Construction by Almost Pure
Induced-Sorting". 2009 Data Compression Conference. p.193. doi: 10.1109/DCC.2009.42 (http://dx.doi.org/
10.1109/DCC.2009.42). ISBN978-0-7695-3592-0.
Fischer, Johannes (2011). "Inducing the LCP-Array". Algorithms and Data Structures. Lecture Notes in
Computer Science 6844. p.374. doi: 10.1007/978-3-642-22300-6_32 (http://dx.doi.org/10.1007/
978-3-642-22300-6_32). ISBN978-3-642-22299-3.
Salson, M.; Lecroq, T.; Lonard, M.; Mouchard, L. (2010). "Dynamic extended suffix arrays". Journal of
Discrete Algorithms 8 (2): 241. doi: 10.1016/j.jda.2009.02.007 (http://dx.doi.org/10.1016/j.jda.2009.02.
007).
Burkhardt, Stefan; Krkkinen, Juha (2003). "Fast Lightweight Suffix Array Construction and Checking".
Combinatorial Pattern Matching. Lecture Notes in Computer Science 2676. p.55. doi: 10.1007/3-540-44888-8_5
(http://dx.doi.org/10.1007/3-540-44888-8_5). ISBN978-3-540-40311-1.
Karp, Richard M.; Miller, Raymond E.; Rosenberg, Arnold L. (1972). "Rapid identification of repeated patterns
in strings, trees and arrays". Proceedings of the fourth annual ACM symposium on Theory of computing - STOC
'72. p.125. doi: 10.1145/800152.804905 (http://dx.doi.org/10.1145/800152.804905).
Farach, M. (1997). "Optimal suffix tree construction with large alphabets". Proceedings 38th Annual Symposium
on Foundations of Computer Science. p.137. doi: 10.1109/SFCS.1997.646102 (http://dx.doi.org/10.1109/
SFCS.1997.646102). ISBN0-8186-8197-7.
346
Suffix array
Krkkinen, Juha; Sanders, Peter (2003). "Simple Linear Work Suffix Array Construction". Automata, Languages
and Programming. Lecture Notes in Computer Science 2719. p.943. doi: 10.1007/3-540-45061-0_73 (http://dx.
doi.org/10.1007/3-540-45061-0_73). ISBN978-3-540-40493-4.
Dementiev, Roman; Krkkinen, Juha; Mehnert, Jens; Sanders, Peter (2008). "Better external memory suffix
array construction". Journal of Experimental Algorithmics 12: 1. doi: 10.1145/1227161.1402296 (http://dx.doi.
org/10.1145/1227161.1402296).
Kulla, Fabian; Sanders, Peter (2007). "Scalable parallel suffix array construction". Parallel Computing 33 (9):
605. doi: 10.1016/j.parco.2007.06.004 (http://dx.doi.org/10.1016/j.parco.2007.06.004).
External links
Suffix sorting module for BWT in C code (http://code.google.com/p/compression-code/downloads/list)
Suffix Array Implementation in Ruby (http://www.codeodor.com/index.cfm/2007/12/24/The-Suffix-Array/
1845)
Suffix array library and tools (http://sary.sourceforge.net/index.html.en)
Project containing various Suffix Array c/c++ Implementations with a unified interface (http://pizzachili.dcc.
uchile.cl/)
A fast, lightweight, and robust C API library to construct the suffix array (http://code.google.com/p/
libdivsufsort/)
Suffix Array implementation in Python (http://code.google.com/p/pysuffix/)
347
348
Non-binary tree
1975
O(n)
Search
O(log log n)
Insert
O(log log n)
Delete
O(log log n)
A Van Emde Boas tree (or Van Emde Boas priority queue), also known as a vEB tree, is a tree data structure
which implements an associative array with m-bit integer keys. It performs all operations in O(logm) time. Notice
that m is the size of the keys therefore O(logm) is O(log logn) in a tree where every key below n is set,
exponentially better than a full self-balancing binary search tree. The vEB tree also has good space efficiency when
it contains a large number of elements, as discussed below. It was invented by a team led by Peter van Emde Boas in
1975.[1]
Supported operations
A vEB supports the operations of an ordered associative array, which includes the usual associative array operations
along with two more order operations, FindNext and FindPrevious:[2]
A vEB tree also supports the operations Minimum and Maximum, which return the minimum and maximum element
stored in the tree respectively.[3] These both run in O(1) time, since the minimum and maximum element are stored
as attributes in each tree.
How it works
For the sake of simplicity, let log2 m = k for some integer k. Define
M=2m. A vEB tree T over the universe {0,...,M-1} has a root node that
stores an array T.children of length M1/2. T.children[i] is a pointer to a
vEB tree that is responsible for the values {iM1/2,...,(i+1)M1/2-1}.
Additionally, T stores two values T.min and T.max as well as an
auxiliary vEB tree T.aux.
Data is stored in a vEB tree as follows: The smallest value currently in
the tree is stored in T.min and largest value is stored in T.max. Note
An example Van Emde Boas tree with dimension
that T.min is not stored anywhere else in the vEB tree, while T.max is.
5 and the root's aux structure after 1, 2, 3, 5, 8 and
10 have been inserted.
If T is empty then we use the convention that T.max=-1 and T.min=M.
Any other value x is stored in the subtree T.children[i] where
. The auxiliary tree T.aux keeps track of which children are non-empty, so T.aux contains the
value j if and only if T.children[j] is non-empty.
FindNext
The operation FindNext(T, x) that searches for the successor of an element x in a vEB tree proceeds as follows: If
xT.min then the search is complete, and the answer is T.min. If x>T.max then the next element does not exist, return
M. Otherwise, let i=x/M1/2. If xT.children[i].max then the value being searched for is contained in T.children[i] so
the search proceeds recursively in T.children[i]. Otherwise, We search for the value i in T.aux. This gives us the
index j of the first subtree that contains an element larger than x. The algorithm then returns T.children[j].min. The
element found on the children level needs to be composed with the high bits to form a complete next element.
function FindNext(T, x)
if x T.min then
return T.min
if x > T.max then // no next element
return M
i = floor(x/
)
lo = x %
hi = x - lo
if lo T.children[i].max then
return hi + FindNext(T.children[i], lo)
return hi + T.children[FindNext(T.aux, i)].min
end
Note that, in any case, the algorithm performs O(1) work and then possibly recurses on a subtree over a universe of
size M1/2 (an m/2 bit universe). This gives a recurrence for the running time of T(m)=T(m/2) + O(1), which resolves
to O(log m) = O(log log M).
349
350
Insert
The call insert(T, x) that inserts a value x into a vEB tree T operates as follows:
If T is empty then we set T.min = T.max = x and we are done.
Otherwise, if x<T.min then we insert T.min into the subtree i responsible for T.min and then set T.min = x. If
T.children[i] was previously empty, then we also insert i into T.aux
Otherwise, if x>T.max then we insert x into the subtree i responsible for x and then set T.max = x. If T.children[i]
was previously empty, then we also insert i into T.aux
Otherwise, T.min< x < T.max so we insert x into the subtree i responsible for x. If T.children[i] was previously
empty, then we also insert i into T.aux.
In code:
function Insert(T, x)
if T.min > T.max then // T is empty
T.min = T.max = x;
return
if T.min == T.max then
if x < T.min then
T.min = x
return
if x > T.max then
T.max = x
if x < T.min then
swap(x, T.min)
if x > T.max then
T.max = x
i = floor(x /
Insert(T.children[i], x %
Delete
Deletion from vEB trees is the trickiest of the operations. The call Delete(T, x) that deletes a value x from a vEB tree
T operates as follows:
If T.min = T.max = x then x is the only element stored in the tree and we set T.min = M and T.max = -1 to indicate
that the tree is empty.
Otherwise, if x = T.min then we need to find the second-smallest value y in the vEB tree, delete it from its current
location, and set T.min=y. The second-smallest value y is either T.max or T.children[T.aux.min].min, so it can be
found in O(1) time. In the latter case we delete y from the subtree that contains it.
Similarly, if x = T.max then we need to find the second-largest value y in the vEB tree and set T.max=y. The
second-largest value y is either T.min or T.children[T.aux.max].max, so it can be found in O(1) time. We alse delete
x from the subtree that contains it.
351
In case where x is not T.min or T.max, and T has no other elements, we know x is not in T and return without further
operations.
Otherwise, we have the typical case where xT.min and xT.max. In this case we delete x from the subtree
T.children[i] that contains x.
In any of the above cases, if we delete the last element x or y from any subtree T.children[i] then we also delete i
from T.aux
In code:
function Delete(T, x)
if T.min == T.max == x then
T.min = M
T.max = -1
return
if x == T.min then
if T.aux is empty then
T.min = T.max
return
else
x = T.children[T.aux.min].min
T.min = x
if x == T.max then
if T.aux is empty then
T.max = T.min
return
else
T.max = T.children[T.aux.max].max
if T.aux is empty then
return
i = floor(x /
)
Delete(T.children[i], x %
Discussion
The assumption that log m is an integer is unnecessary. The operations x/
and x%
can be replaced by
taking only higher-order ceil(m/2) and the lower-order floor(m/2) bits of x, respectively. On any existing machine,
this is more efficient than division or remainder computations.
The implementation described above uses pointers and occupies a total space of
seen as follows. The recurrence is
by induction.
. This can be
One
can,
fortunately,
also
show
that
References
[1] Peter van Emde Boas: Preserving order in a forest in less than logarithmic time (Proceedings of the 16th Annual Symposium on Foundations
of Computer Science 10: 75-84, 1975)
[2] Gudmund Skovbjerg Frandsen: Dynamic algorithms: Course notes on van Emde Boas trees (PDF) (http:/ / www. daimi. au. dk/ ~gudmund/
dynamicF04/ vEB. pdf) (University of Aarhus, Department of Computer Science)
[3] * Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Third Edition. MIT Press,
2009. ISBN 0-262-53305-8. Chapter 20: The van Emde Boas tree, pp.531560.
Further reading
Erik Demaine, Shantonu Sen, and Jeff Lindy. Massachusetts Institute of Technology. 6.897: Advanced Data
Structures (Spring 2003). Lecture 1 notes: Fixed-universe successor problem, van Emde Boas (http://theory.
csail.mit.edu/classes/6.897/spring03/scribe_notes/L1/lecture1.pdf). Lecture 2 notes: More van Emde Boas,
... (http://theory.csail.mit.edu/classes/6.897/spring03/scribe_notes/L2/lecture2.pdf).
van Emde Boas, P.; Kaas, R.; Zijlstra, E. (1976). "Design and implementation of an efficient priority queue"
(http://www.springerlink.com/content/h63507n460256241/). Mathematical Systems Theory 10: 99127. doi:
10.1007/BF01683268 (http://dx.doi.org/10.1007/BF01683268).
352
Fusion tree
Fusion tree
A fusion tree is a type of tree data structure that implements an associative array on w-bit integers. It uses O(n)
space and performs searches in O(logw n) time, which is asymptotically faster than a traditional self-balancing binary
search tree, and actually better than the van Emde Boas tree when w is large. It achieves this speed by exploiting
certain constant-time operations that can be done on a machine word. Fusion trees were invented in 1990 by Michael
Fredman and Dan Willard.[1]
Several advances have been made since Fredman and Willard's original 1990 paper. In 1999 [2] it was shown how to
implement fusion trees under the AC0 model, in which multiplication no longer takes constant time. A dynamic
version of fusion trees using Hash tables was proposed in 1996 [3] which matched the O(logw n) runtime in
expectation. Another dynamic version using Exponential tree was proposed in 2007 [4] which yields worst-case
runtimes of O(logw n + log log u) per operation, where u is the size of the largest key. It remains open whether
dynamic fusion trees can achieve O(logw n) per operation with high probability.
How it works
A fusion tree is essentially a B-tree with branching factor of w1/5 (any small exponent is also possible), which gives
it a height of O(logw n). To achieve the desired runtimes for updates and queries, the fusion tree must be able to
search a node containing up to w1/5 keys in constant time. This is done by compressing ("sketching") the keys so that
all can fit into one machine word, which in turn allows comparisons to be done in parallel. The rest of this article
will describe the operation of a static Fusion Tree; that is, only queries are supported.
353
Fusion tree
Sketching
Sketching is the method by which each w-bit key at a node containing k keys is compressed into only k-1 bits. Each
key x may be thought of as a path in the full binary tree of height w starting at the root and ending at the leaf
corresponding to x. To distinguish two paths, it suffices to look at their branching point (the first bit where the two
keys differ). All k paths together have k-1 branching points, so at most k-1 bits are needed to distinguish any two of
the k keys.
An important property of the sketch function is that it preserves the order of the keys. That is, sketch(x) <
sketch(y) for any two keys x < y.
354
Fusion tree
355
3. (br + mr) - (b1 - m1) r4. That is, the sketch bits are packed into a range of size at most r4.
An inductive argument shows how the mi can be constructed. Let m1 = w b1. Suppose that 1 < t r and that m1,
m2... mt have already been chosen. Then pick the smallest integer mt such that both properties (1) and (2) are
satisfied. Property (1) requires that mt bi bj + ml for all 1 i, j r and 1 l t-1. Thus, there are less than tr2
r3 values that mt must avoid. Since mt is chosen to be minimal, (bt + mt) (bt-1 + mt-1) + r3. This implies Property
(3).
The approximate sketch is thus computed as follows:
1. Mask out all but the sketch bits with a bitwise AND.
2. Multiply the key by the predetermined constant m. This operation actually requires two machine words, but this
can still by done in constant time.
3. Mask out all but the shifted sketch bits. These are now contained in a contiguous block of at most r4 < w4/5 bits.
For the rest of this article, sketching will be taken to mean approximate sketching.
Parallel comparison
The purpose of the compression achieved by sketching is to allow all of the keys to be stored in one w-bit word. Let
the node sketch of a node be the bit string
1sketch(x1)1sketch(x2)...1sketch(xk)
We can assume that the sketch function uses exactly b r4 bits. Then each block uses 1 + b w4/5 bits, and since k
w1/5, the total number of bits in the node sketch is at most w.
A brief notational aside: for a bit string s and nonnegative integer m, let sm denote the concatenation of s to itself m
times. If t is also a bit string st denotes the concatenation of t to s.
The node sketch makes it possible to search the keys for any b-bit integer y. Let z = (0y)k, which can be computed in
constant time (multiply y by the constant (0b1)k). Note that 1sketch(xi) - 0y is always positive, but preserves its
leading 1 iff sketch(xi) y. We can thus compute the smallest index i such that sketch(xi) y as follows:
1.
2.
3.
4.
Desketching
For an arbitrary query q, parallel comparison computes the index i such that
sketch(xi-1) sketch(q) sketch(xi)
Unfortunately, the sketch function is not in general order-preserving outside the set of keys, so it is not necessarily
the case that xi-1 q xi. What is true is that, among all of the keys, either xi-1 or xi has the longest common prefix
with q. This is because any key y with a longer common prefix with q would also have more sketch bits in common
with q, and thus sketch(y) would be closer to sketch(q) than any sketch(xj).
The length longest common prefix between two w-bit integers a and b can be computed in constant time by finding
the most significant bit of the bitwise XOR between a and b. This can then be used to mask out all but the longest
common prefix.
Note that p identifies exactly where q branches off from the set of keys. If the next bit of q is 0, then the successor of
q is contained in the p1 subtree, and if the next bit of q is 1, then the predecessor of q is contained in the p0 subtree.
This suggests the following algorithm:
1. Use parallel comparison to find the index i such that sketch(xi-1) sketch(q) sketch(xi).
Fusion tree
2. Compute the longest common prefix p of q and either xi-1 or xi (taking the longer of the two).
3. Let l-1 be the length of the longest common prefix p.
1. If the l-th bit of q is 0, let e = p10w-l. Use parallel comparison to search for the successor of sketch(e). This
is the actual predecessor of q.
2. If the l-th bit of q is 1, let e = p01w-l. Use parallel comparison to search for the predecessor of sketch(e).
This is the actual successor of q.
4. Once either the predecessor or successor of q is found, the exact position of q among the set of keys is
determined.
References
[1] M. L. Fredman and D. E. Willard. BLASTING through the information theoretic barrier with FUSION TREES. Proceedings of the
twenty-second annual ACM symposium on Theory of Computing, 1-7, 1990.
[2] A. Andersson, P. B. Miltersen, and M. Thorup. Fusion trees can be implemented with AC0 instructions only. Theoretical Computer Science,
215:337-344, 1999.
[3] R. Raman. Priority queues: Small, monotone, and trans-dichotomous. Algorithms - ESA 96, 121-137, 1996.
[4] A. Andersson and M. Thorup. Dynamic ordered sets with exponential search trees. Journal of the ACM, 54:3:13, 2007.
356
357
358
359
360
361
362
363
364
365
License
License
Creative Commons Attribution-Share Alike 3.0
//creativecommons.org/licenses/by-sa/3.0/
366