String Matching
@Copyright:AM
Date:13.10.2014
The naive string-matching algorithm
The naive algorithm finds all valid shifts using a loop that checks the condition P[1
. . m] =T[s + 1 . . s + m] for each of the n - m +1 possible values of s.
NAI VE- STRI NG- MATCHER( T, P)
1 n length[ T]
2 m l engt h[ P]
3 for s 0 to n - m
4 do if P[ 1 . . m] = T[ s + 1 . . s + m]
5 then pr i nt " Pat t er n occur s wi t h shi f t " s
The naive string-matching procedure can be interpreted graphically as sliding a
"template" containing the pattern over the text, noting for which shifts all of the
characters on the template equal the corresponding characters in the text, as
illustrated in Figure. The for loop beginning on line 3 considers each possible shift
explicitly. The test on line 4 determines whether the current shift is valid or not;
this test involves an implicit loop to check corresponding character positions until
all positions match successfully or a mismatch is found. Line 5 prints out each
valid shift s.
Figure: The operation of the naive string matcher for the pattern P
= aab and the text T = acaabc. We can imagine the pattern P as a "template"
that we slide next to the text. Parts (a)-(d) show the four successive
alignments tried by the naive string matcher. In each part, vertical lines
connect corresponding regions found to match (shown shaded), and a
jagged line connects the first mismatched character found, if any. One
occurrence of the pattern is found, at shift s = 2, shown in part (c).
Procedure NAI VE- STRI NG-MATCHER takes time ((n - m +1)m) in the worst case. For
example, consider the text string a
n
(a string of n a's) and the pattern a
m
. For each of
the n - m +1 possible values of the shift s, the implicit loop on line 4 to compare
corresponding characters must execute m times to validate the shift. The worst-
case running time is thus ((n - m +1)m), which is (n
2
) if m = n/2.
As we shall see, NAI VE- STRI NG-MATCHER is not an optimal procedure for this
problem. Indeed, in this chapter we shall show an algorithm with a worst-case
String Matching
@Copyright:AM
Date:13.10.2014
running time of O(n +m). The naive string-matcher is inefficient because
information gained about the text for one value of s is totally ignored in
considering other values of s. Such information can be very valuable, however. For
example, if P =aaab and we find that s =0 is valid, then none of the shifts 1, 2, or
3 are valid, since T[4] =b. In the following sections, we examine several ways to
make effective use of this sort of information.
Knuth-Morris-Pratt Algorithm
The Partial Match Table
The key to KMP, of course, is the partial match table. The main obstacle
between me and understanding KMP was the fact that I didnt quite fully
grasp what the values in the partial match table really meant. I will now try
to explain them in the simplest words possible.
Heres the partial match table for the pattern abababca:
1
2
3
char: | a | b | a | b | a | b | c | a |
index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
value: | 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 |
If I have an eight-character pattern (lets say abababca for the duration of
this example), my partial match table will have eight cells. If Im looking at
the eighth and last cell in the table, Im interested in the entire pattern
(abababca). If Im looking at the seventh cell in the table, Im only
interested in the first seven characters in the pattern (abababc); the
eighth one (a) is irrelevant, and can go fall off a building or something. If
Im looking at the sixth cell of the in the table you get the idea. Notice that
I havent talked about what each cell means yet, but just what its referring
to.
Now, in order to talk about the meaning, we need to know about proper
prefixes and proper suffixes.
String Matching
@Copyright:AM
Date:13.10.2014
Proper prefix: All the characters in a string, with one or more cut off the
end. S, Sn, Sna, and Snap are all the proper prefixes of Snape.
Proper suffix: All the characters in a string, with one or more cut off the
beginning. agrid, grid, rid, id, and d are all proper suffixes of
Hagrid.
With this in mind, I can now give the one-sentence meaning of the values in
the partial match table:
The length of the longest proper prefix in the (sub)pattern that matches a
proper suffix in the same (sub)pattern.
Lets examine what I mean by that. Say were looking in the third cell. As
youll remember from above, this means were only interested in the first
three characters (aba). In aba, there are two proper prefixes (a and
ab) and two proper suffixes (a and ba). The proper prefix ab does
not match either of the two proper suffixes. However, the proper prefix a
matches the proper suffix a. Thus, the length of the longest proper prefix
that matches a proper suffix, in this case, is 1.
Lets try it for cell four. Here, were interested in the first four characters
(abab). We have three proper prefixes (a, ab, and aba) and three
proper suffixes (b, ab, and bab). This time, ab is in both, and is two
characters long, so cell four gets value 2.
Just because its an interesting example, lets also try it for cell five, which
concerns ababa. We have four proper prefixes (a, ab, aba, and
abab) and four proper suffixes (a, ba, aba, and baba). Now, we
have two matches: a and aba are both proper prefixes and proper
suffixes. Since aba is longer than a, it wins, and cell five gets value 3.
Lets skip ahead to cell seven (the second-to-last cell), which is concerned
with the pattern abababc. Even without enumerating all the proper
prefixes and suffixes, it should be obvious that there arent going to be any
String Matching
@Copyright:AM
Date:13.10.2014
matches; all the suffixes will end with the letter c, and none of the prefixes
will. Since there are no matches, cell seven gets 0.
Finally, lets look at cell eight, which is concerned with the entire pattern
(abababca). Since they both start and end with a, we know the value will
be at least 1. However, thats where it ends; at lengths two and up, all the
suffixes contain a c, while only the last prefix (abababc) does. This seven-
character prefix does not match the seven-character suffix (bababca), so
cell eight gets 1.
How to use the Partial Match Table
We can use the values in the partial match table to skip ahead (rather than
redoing unnecessary old comparisons) when we find partial matches. The
formula works like this:
If a partial match of length partial_match_length is found
and t abl e[ par t i al _mat ch_l engt h] > 1, we may skip
ahead par t i al _mat ch_l engt h - t abl e[ par t i al _mat ch_l engt h -
1] characters.
Lets say were matching the pattern abababca against the text
bacbababaabcbab. Heres our partial match table again for easy reference:
1
2
3
char: | a | b | a | b | a | b | c | a |
index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
value: | 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 |
The first time we get a partial match is here:
1
2
3
bacbababaabcbab
|
Abababca
This is a partial_match_length of 1. The value
at t abl e[ par t i al _mat ch_l engt h - 1] (or t abl e[ 0] ) is 0, so we dont get to
skip ahead any. The next partial match we get is here:
String Matching
@Copyright:AM
Date:13.10.2014
1
2
3
bacbababaabcbab
|||||
abababca
This is a partial_match_length of 5. The value
at t abl e[ par t i al _mat ch_l engt h - 1] (or t abl e[ 4] ) is 3. That means we get
to skip ahead par t i al _mat ch_l engt h - t abl e[ par t i al _mat ch_l engt h -
1] (or 5 - t abl e[ 4] or 5 - 3 or 2) characters:
1
2
3
4
5
// x denotes a skip
bacbababaabcbab
xx|||
abababca
This is a partial_match_length of 3. The value
at t abl e[ par t i al _mat ch_l engt h - 1] (or t abl e[ 2] ) is 1. That means we get
to skip ahead par t i al _mat ch_l engt h - t abl e[ par t i al _mat ch_l engt h -
1] (or 3 - t abl e[ 2] or 3 - 1 or 2) characters:
1
2
3
4
5
// x denotes a skip
bacbababaabcbab
xx|
abababca
At this point, our pattern is longer than the remaining characters in the
text, so we know theres no match.
The Knuth-Morris-Pratt matching algorithm is given in pseudocode below as the
procedure KMP-MATCHER. It is mostly modeled after FI NI TE-AUTOMATON-MATCHER, as
we shall see. KMP-MATCHER calls the auxiliary procedure COMPUTE-PREFI X-
FUNCTI ON to compute .
KMP- MATCHER( T, P)
1 n length[ T]
2 m length[ P]
3 COMPUTE- PREFI X- FUNCTI ON( P)
4 q 0
5 for i 1 to n
6 do while q > 0 and P[ q + 1] T[ i]
7 do q [ q]
8 if P[ q + 1] = T[ i]
9 then q q + 1
10 if q == m
String Matching
@Copyright:AM
Date:13.10.2014
11 then pr i nt " Pat t er n occur s wi t h shi f t " i - m
12 q [ q]
COMPUTE- PREFI X- FUNCTI ON( P)
1 m length[ P]
2 [ 1] 0
3 k 0
4 for q 2 to m
5 do while k > 0 and P[ k + 1] P[ q]
6 do k [k]
7 if P[ k + 1] = P[ q]
8 then k k + 1
9 [q] k
10 return
We begin with an analysis of the running times of these procedures. Proving these
procedures correct will be more complicated.
Running-time analysis
The running time of COMPUTE-PREFI X-FUNCTI ON is O(m), using an amortized analysis
(see Chapter 18). We associate a potential of k with the current state k of the
algorithm. This potential has an initial value of 0, by line 3. Line 6
decreases k whenever it is executed, since [k] <k. Since [k] 0 for
all k, however, k can never become negative. The only other line that affects k is
line 8, which increases k by at most one during each execution of the for loop
body. Since k <q upon entering the for loop, and since q is incremented in each
iteration of the for loop body, k <q always holds. (This justifies the claim that
[q] <q as well, by line 9.) We can pay for each execution of the while loop body
on line 6 with the corresponding decrease in the potential function, since [k]
<k. Line 8 increases the potential function by at most one, so that the amortized
cost of the loop body on lines 5-9 is O(1). Since the number of outer-loop
iterations is O(m), and since the final potential function is at least as great as the
initial potential function, the total actual worst-case running time of COMPUTE-
PREFI X-FUNCTI ON is O(m).
The Knuth-Morris-Pratt algorithm runs in time O(m + n). The call of COMPUTE-
PREFI X-FUNCTI ON takes O(m) time as we have just seen, and a similar amortized
analysis, using the value of q as the potential function, shows that the remainder of
KMP-MATCHER takes O(n) time.
String Matching
@Copyright:AM
Date:13.10.2014
Compared to Fl NI TE-AUTOMATON-MATCHER, by using rather than , we have
reduced the time for preprocessing the pattern from O(m | |) to O(m), while
keeping the actual matching time bounded by O(m + n).
10/14/2014
4/13
Matching with Finite State Automata
Finite state automata are machines that have a finite number of states, and move between states as
they read input one symbol at a time.
Algorithms that construct (or simulate) finite state automata can be very efficient matchers, but they
require some preprocessing to construct.
Finite State Automata
Finite state automata are widely used in computability theory as well as in practical algorithms. It's
worth knowing about them even if you are not doing string matching.
A finite state automaton (FSA) or finite automaton M is a 5-tuple (Q, q
0
, A, , ) where
Q is a finite set of states
q
0
Q is the start state
A Q is a set of accepting states
is a finite input alphabet
: Q x Q is the transition function of M.
The FSA starts in state q
0
. As each character (symbol) of the input string is read, it uses to determine
what state to transition into. Whenever M is in a state of A, the input read so far is accepted.
Here is a simple example: q
0
= 0 and A = {1}. (What are Q, , and ?)
10/14/2014
5/13
What strings does this FSA accept?
We define the final state function :
*
Q such that (w) is the final state M ends up in after
reading w:
() = q
0
(wa) = ((w), a) for w
*
, a
String Matching Automata
Let's see how they work by an example. This is the string matching automaton for P = ababaca:
The start state is 0 and the only accepting state is 7. Any transitions not shown
(e.g., if c is read while in states 1, 2, 3, 4, 6, and 7) are assumed to go back to
state 0.
This automaton can be represented with the table to the right. The shading
shows the sequence of states through which a successful match transitions.
These transitions correspond to the darkened arrows in the diagram.
We can run this automaton continuously on a text T, and if and when state 7 is
entered output the relevant positions: the match will start with shift i m or at
position i m + 1.
The following code simulates any FSA on input text T, given the FSA's table
and pattern length m:
For example, below is a run on T = abababacaba, which includes P = ababaca starting at position
3: ab ababaca ba.
10/14/2014
6/13
State 7 is reached at i = 9, so the pattern occurs starting at i m + 1, or 9 7 +
1 = 3. The FSA keeps going after a match, as it may find other occurrences of
the pattern.
Unlike the brute force approach, past work is not thrown away: the transitions
following either failure to match a character or success to match the entire
pattern pick up with the input read so far.
For example,
After Failure: At i = 5, ababa has been matched in
T[1 .. 5] and c was expected but not found at T[6].
Rather than starting over, the FSA transitions to state
(5, b) = 4 to indicate that the pattern prefix P
4
=
abab has matched the present text suffix T[3 .. 6].
After Success: At, i = 9, we are in state 7 (success), and a b is seen. We need not start over: the
FSA transitions to state (7, b) = 2 to reflect the fact that there is already a match to the prefix
P
2
= ab at T[9 .. 11].
This makes FSAs much more efficient than brute force in the matching phase. In fact, matching is
(n). But how do we build them?
Constructing Finite State Automata (Preprocessing Phase)
In general, the FSA is constructed so that the state number tells us how much of a prefix of P has been
matched.
If the pattern P is of length m and the FSA is in state m, then the pattern has been matched.
If the state number is smaller than m, then the state number is the length of the prefix of P
matched.
Another definition is needed to formalize this.
Definitions and Strategy
The suffix function corresponding to P of length m is
P
:
*
{0, 1, ... m} such that
P
(w) is the
length of the longest prefix of P that is also a suffix of x:
P
(w) = max {k : P
k
w}.
For example, if P = ab then
() = 0
(ccaca) = 1
(ccab) = 2
10/14/2014
7/13
(For simplicity, we leave out the subscript P when it is clear from the context.) Then we can define
the automaton for pattern P[1 .. m] as:
Q = {0, 1 .. m}
q
0
= 0
A = {m}
is a superset of the characters in P
(q, a) = (P
q
a) for any state q and character a. (P
q
a is the concatenation of the first q
characters of P with the character a.)
By defining (q, a) = (P
q
a), the state of the FSA keeps track of the longest prefix of the pattern P
that has matched the input text T so far.
In order for a substring of T ending at T[i] to match some prefix P
j
, then this prefix P
j
must be a suffix
of T[i].
We design such that the state q = (T[i]) gives the length of the longest prefix of P that matches a
suffix of T. We have:
q = (T
i
) = (T
i
), and P
q
T
i
(P
q
is a suffix of T
i
).
Given a match so far to P
q
(which may be ) and reading character a, there are two kinds of
transitions:
When a = P[q + 1], a continues to match the pattern, so (q, a) = q + 1 (going along the dark
"spine" arrows of the example).
When a P[q + 1], a fails to match the pattern. The preprocessing algorithm given below
matches the pattern against itself to identify the longest smaller prefix of P that is still
matched.
An example of this second case was already noted above for
(5, b) = 4.
Preprocessing Procedure
The following procedure computes the transition function from a pattern P[1 .. m].
10/14/2014
8/13
The nested loops process all combinations of states q and characters a needed for the cells of the table
representing .
Lines 4-8 set (q, a) to the largest k such that P
k
P
q
a.
The preprocessor is matching P against itself.
Thus, knowledge of the structure of P is used to retain information about the match so far, even
when matches fail.
By starting at the largest possible value of k (line 4) and working down (lines 5-7) we guarantee
that we get the longest prefix of P that has been matched so far.
If the match succeeds at k = q + 1 then this transition indicates a successful match for the
current q and a.
The loop is guaranteed to end at k = 0, because P
0
= is a suffix of any string.
Analysis
Compute-Transition-Function requires m*|| for the
nested outer loops.
Within these loops, the inner repeat runs at most m + 1
times; and the test on line 7 can require comparing up to m
characters. Together lines 5-7 contribute O(m
2
).
Therefore, Compute-Transition-Function is O(m
3
||).
This is rather expensive preprocessing, but the (n)
matching is the best that can be expected.
You an see C code and a Java applet animation at http://www-igm.univ-
mlv.fr/~lecroq/string/nod43.html