KEMBAR78
String matching algorithms-pattern matching. | PPTX
String Matching
Algorithms
Presented by
Swapan Shakhari
Under the guidance of
Dr. Prasun Ghosal
What is String Matching?
• Checking whether two or more strings are
same or not.
• Finding a string (pattern) into another string
(text).  Looking for substring
Text ATGCTTATCG
Pattern ATC
Algorithms Discussed
• Knuth–Morris–Pratt algorithm
• Boyer–Moore string search algorithm
• Bitap algorithm (for exact string searching)
Knuth–Morris–Pratt Algorithm
Knuth–Morris–Pratt algorithm
Inventors
• Donald Knuth
• Vaughan Pratt and
• James H. Morris.
Knuth–Morris–Pratt algorithm
Outline of the Algorithm
• The Knuth–Morris–Pratt string searching
algorithm (or KMP algorithm) searches for
occurrences of a "word" W within a main "text
string" S by employing the observation that
when a mismatch occurs.
Knuth–Morris–Pratt algorithm
Outline of the Algorithm
• The word itself embodies sufficient
information to determine where the next
match could begin.
• Thus bypassing re-examination of previously
matched characters.
Knuth–Morris–Pratt algorithm
Worked example
• Let, W = "ABCDABD" and
S = "ABC ABCDAB ABCDABCDABDE".
• At any given time, the algorithm is in a state
determined by two integers:
– m, denoting the position within S where the
prospective match for W begins,
– i, denoting the index of the currently considered
character in W.
Knuth–Morris–Pratt algorithm
Worked example
• In each step we compare S[m+i] with W[i] and
advance if they are equal. This is depicted, at
the start of the run, like
Knuth–Morris–Pratt algorithm
Worked example
• We proceed by comparing successive
characters of W to "parallel" characters of S,
moving from one to the next if they match.
• In the fourth step, we get S[3] = ' ' and W[3] =
'D', a mismatch.
Knuth–Morris–Pratt algorithm
Worked example
• Rather than beginning to search again at S[1],
we note that no 'A' occurs between positions
0 and 3 in S, except at 0.
Knuth–Morris–Pratt algorithm
Worked example
• Hence, having checked all those characters
previously, we know that there is no chance of
finding the beginning of a match if we check
them again.
Knuth–Morris–Pratt algorithm
Worked example
• Therefore, we move on to the next character,
setting m = 4 and i = 0.
Knuth–Morris–Pratt algorithm
Worked example
• Therefore, we move on to the next character,
setting m = 4 and i = 0.
Knuth–Morris–Pratt algorithm
Worked example
• At W[6] & S[10], we again have a mismatch.
Knuth–Morris–Pratt algorithm
Worked example
• The algorithm passed an "AB", which could be
the beginning of a new match.
Knuth–Morris–Pratt algorithm
Worked example
• The algorithm passed an "AB", which could be
the beginning of a new match.
– it will simply reset m = 8, i = 2
Knuth–Morris–Pratt algorithm
Worked example
• This search fails immediately in the first trial.
Knuth–Morris–Pratt algorithm
Worked example
• This search fails immediately in the first trial.
– reset m = 11, i = 0.
Knuth–Morris–Pratt algorithm
Worked example
• We again have a mismatch.
– W[6]==‘D’ but S[17]==‘C’.
Knuth–Morris–Pratt algorithm
Worked example
• Reasoning as before (S[15]==W[0]), we set m
= 15, and to start at the two-character
string "AB“ set i = 2.
Knuth–Morris–Pratt algorithm
Worked example
• Reasoning as before (S[15]==W[0]), we set m
= 15, and to start at the two-character
string "AB“ set i = 2.
• Found a match at S[15].
Boyer–Moore string search
Algorithm
The standard benchmark for practical
string search literature!!
Boyer–Moore string search
Algorithm
Inventors
• Robert S. Boyer and
• J Strother Moore
• in 1977
Boyer–Moore string search
Algorithm
Some Definitions Required
• S[i] refers to the character at index i of
string S, counting from 1.
• S[i..j] refers to the substring of string S starting
at index i and ending at j, inclusive.
• A prefix of S is a substring S[1..i] for some i in
range [1, n], where n is the length of S.
Boyer–Moore string search
Algorithm
Some Definitions Required
• A suffix of S is a substring S[i..n] for some i in
range [1, n], where n is the length of S.
• The string to be searched for is called
the pattern and is referred to with symbol P.
• The string being searched in is called
the text and is referred to with symbol T.
Boyer–Moore string search
Algorithm
Some Definitions Required
• The length of P is n.
• The length of T is m.
• An alignment of P to T is an index k in T such
that the last character of P is aligned with
index k of T.
• A match or occurrence of P occurs at an
alignment if P is equivalent to T[(k-n+1)..k].
Boyer–Moore string search
Algorithm
Explanation
The Boyer-Moore algorithm searches for
occurrences of P in T by performing explicit
character comparisons at different
alignments. Instead of a brute-force search of
all alignments (of which there are m - n + 1),
Boyer-Moore uses information gained by
preprocessing P to skip as many alignments as
possible.
Boyer–Moore string search
Algorithm
Explanation
The algorithm begins at alignment k = n,
so the start of P is aligned with the start of T.
Characters in P and T are then compared
starting at index n in P and k in T , moving
backward: the strings are matched from the
end of P to the start of P.
Boyer–Moore string search
Algorithm
Explanation
The comparisons continue until either the
beginning of P is reached (which means there
is a match)
Or a mismatch occurs upon which the
alignment is shifted to the right according to
the maximum value permitted by a number
of rules.
Boyer–Moore string search
Algorithm
Explanation
The comparisons are performed again at
the new alignment, and the process repeats
until the alignment is shifted past the end
of T, which means no further matches will be
found.
The shift rules are implemented as
constant-time table lookups, using tables
generated during the preprocessing of P.
Boyer–Moore string search
Algorithm
Explanation
Shift Rules
A shift is calculated by applying two rules:
the bad character rule and the good suffix
rule. The actual shifting offset is the maximum
of the shifts calculated by these rules.
Boyer–Moore string search
Algorithm
Explanation
Shift Rules: The Bad Character Rule
The idea of Bad Character Rule is to shift P
more than 1 character when possible.
For each character x, let R(x) be the position
of the right-most occurrence of character x in
P.
Boyer–Moore string search
Algorithm
Explanation
Shift Rules: The Bad Character Rule
R(x) is defined to be zero if x does not occur in
P.
Time to construct table R: O(n) – length of P.
Space used by R: O(|∑|)
Access time of R: O(1)
Boyer–Moore string search
Algorithm
Explanation
Shift Rules: The Bad Character Rule
Example of R
Pattern P=
R=
R(P)=
A C C T T T
O/W A C T
0 1 3 6
Boyer–Moore string search
Algorithm
Explanation
Shift Rules: The Bad Character Rule
In a particular alignment of P against T
Let The rightmost n-i characters of P match the
corresponding characters in T and the character
P(i) does not match with T(k). Let the rightmost
position of character T(k) in P, R(T(k)), be j.
Boyer–Moore string search
Algorithm
Explanation
Shift Rules: The Bad Character Rule
If j<i, then shift P so that P[j] is aligned below
T[k].
Shift by max{1, i-R(T(k))}
Boyer–Moore string search
Algorithm
Explanation
Shift Rules: The Bad Character Rule
 If j>i, then shift P to the right by 1.
 If R(T(k))=0, that is, T(k) does not occur in P.
 Align P[1,…,n] with T[k+1,…,k+n].
Boyer–Moore string search
Algorithm
Explanation
Shift Rules: The Bad Character Rule
R=
T: R(C)=3
P: i=5
P shift: Shift 5-3
G A A C C T T T
A C C T T T
A C C T T T
O/W A C T
0 1 3 6
Boyer–Moore string search
Algorithm
Explanation
Shift Rules: The Good Suffix Rule
Description: Suppose for a given alignment
of P and T, a substring t of T matches a suffix
of P, but a mismatch occurs at the next
comparison to the left.
T=
P=
t
G A A A G A A
A T G G C A A T T G G A A A G A A T T G A T
Boyer–Moore string search
Algorithm
Explanation
Shift Rules: The Good Suffix Rule
Description: Then find, if it exists, the right-most
copy t' of t in P such that t' is not a suffix of P and the
character to the left of t' in P differs from the
character to the left of t in P.
T=
P=
t’ t
A T G G C A A T T G G A A A G A A T T G A T
G A A A G A A
Boyer–Moore string search
Algorithm
Explanation
Shift Rules: The Good Suffix Rule
Description: Shift P to the right so that
substring t' in P aligns with substring t in T.
T=
P=
t’ t
A T G G C A A T T G G A A A G A A T T G A T
G A A A G A A
Boyer–Moore string search
Algorithm
Explanation
Shift Rules: The Good Suffix Rule
Description: Shift P to the right so that
substring t' in P aligns with substring t in T.
T=
P=
t’ t
A T G G C A A T T G G A A A G A A T T G A T
G A A A G A A
Boyer–Moore string search
Algorithm
Explanation
Shift Rules: The Good Suffix Rule
Description: If no such shift is possible, then
shift P by n places to the right.
(Example with different text and pattern)
T=
P=
A T G G C A T G A A G A A A G A A T T G A T
A G A A G A A
Boyer–Moore string search
Algorithm
Explanation
Shift Rules: The Good Suffix Rule
Description: If no such shift is possible, then
shift P by n places to the right.
T=
P=
A T G G C A A T T G G A A A G A A T T G A T
G A A A G A A
Boyer–Moore string search
Algorithm
Explanation
Shift Rules: The Good Suffix Rule
Description: If an occurrence of P is found, then
shift P by the least amount so that a proper prefix of
the shifted P matches a suffix of the occurrence
of P in T.
T=
P=
A T G G C A A T T G G A A A G A A T T G A T
G A A A G A A
Boyer–Moore string search
Algorithm
Explanation
Shift Rules: The Good Suffix Rule
Description: If an occurrence of P is found, then
shift P by the least amount so that a proper prefix of
the shifted P matches a suffix of the occurrence
of P in T.
T=
P=
A T G G C A A T T G G A A A G A A T T G A T
G A A A G A A
Boyer–Moore string search
Algorithm
Explanation
Shift Rules: The Good Suffix Rule
Description: If an occurrence of P is found, then
shift P by the least amount so that a proper prefix of
the shifted P matches a suffix of the occurrence
of P in T.
T=
P=
A T G G C A A T T G G A A A G A A T T G A T
G A A A G A A
Boyer–Moore string search
Algorithm
Explanation
Shift Rules: The Good Suffix Rule
Description: If no such shift is possible, then
shift P by n places, that is, shift P past t.
(Example with different text and pattern)
T=
P=
A T G G C A A T G C G A A A G A A T T G A T
A T G C
Boyer–Moore string search
Algorithm
Explanation
Shift Rules: The Good Suffix Rule
Description: If no such shift is possible, then
shift P by n places, that is, shift P past t.
(Example with different text and pattern)
T=
P=
A T G G C A A T G C G A A A G A A T T G A T
A T G C
Bitap Algorithm
(for exact string searching)
Bitap Algorithm
(for exact string searching)
Inventors
• The bitap algorithm for exact string searching
was invented by Bálint Dömölki in 1964
and
extended by R. K. Shyamasundar in 1977.
Bitap Algorithm
(for exact string searching)
Pseudo code
bitap_search(text : string, pattern : string)
m := length(pattern)
if m == 0 return -1
/* Initialize the bit array R. */
R := new array[m+1] of bit, initially all 0
R[0] = 1
Bitap Algorithm
(for exact string searching)
Pseudo code
bitap_search(text : string, pattern : string)
for i = 0; i < length(text); i += 1:
/* Update the bit array. */
for k = m; k >= 1; k -= 1:
R[k] = R[k-1] & (text[i] ==
pattern[k-1])
if R[m]: return i - m + 1
return -1
Bitap Algorithm
(for exact string searching)
Explanation of the Algorithm
The algorithm begins by pre-computing a set
of bitmasks (bit array) containing one bit for
each element of the pattern and an extra bit.
Then it is able to do most of the work
with bitwise operations, which are extremely
fast.
Bitap Algorithm
(for exact string searching)
Explanation of the Algorithm
Initially first position of the bit array contains 1
and all the remaining positions contains 0.
Now, try to update the bit array from end
position to the first position (1st, not 0th) for
every character of the text from start to end.
Bitap Algorithm
(for exact string searching)
Explanation of the Algorithm
The current bit array position will set to 1
if, the previous bit array position is 1 and the
text character & the pattern character of the
previous bit array position are same.
Bitap Algorithm
(for exact string searching)
Explanation of the Algorithm
Bit_array[current_position]=Bit_array[previous_position]
&
text[i]==pattern[previous_position]
for(i = 0; i < text.size(); i += 1)
for(k = m; k >= 1; k -= 1)
r[k] = r[k-1] & (text[i] == pattern[k-1]);
Bitap Algorithm
(for exact string searching)
Explanation of the Algorithm
A match is found when, the contents of the
last position of the bit array becomes 1.
if(Bit_array[last_position])
found a match!
Bitap Algorithm
(for exact string searching)
Explanation with an example
The text is: ATTGCAC
The pattern is: TGCA
m = 4 (pattern length)
i= index of the text
r= bit array
Initial bit array is: 1 0 0 0 0
Bitap Algorithm
(for exact string searching)
Explanation with an example
i= 0
text = ATTGCAC
pattern = TGCA
k= 4, r= 1 0 0 0 0
k= 3, r= 1 0 0 0 0
k= 2, r= 1 0 0 0 0
k= 1, r= 1 0 0 0 0
Bitap Algorithm
(for exact string searching)
Explanation with an example
i= 1
text = ATTGCAC
pattern = TGCA
k= 4, r= 1 0 0 0 0
k= 3, r= 1 0 0 0 0
k= 2, r= 1 0 0 0 0
k= 1, r= 1 1 0 0 0
Bitap Algorithm
(for exact string searching)
Explanation with an example
i= 2
text = ATTGCAC
pattern = TGCA
k= 4, r= 1 1 0 0 0
k= 3, r= 1 1 0 0 0
k= 2, r= 1 1 0 0 0
k= 1, r= 1 1 0 0 0
Bitap Algorithm
(for exact string searching)
Explanation with an example
i= 3
text = ATTGCAC
pattern = TGCA
k= 4, r= 1 1 0 0 0
k= 3, r= 1 1 0 0 0
k= 2, r= 1 1 1 0 0
k= 1, r= 1 0 1 0 0
Bitap Algorithm
(for exact string searching)
Explanation with an example
i= 4
text = ATTGCAC
pattern = TGCA
k= 4, r= 1 0 1 0 0
k= 3, r= 1 0 1 1 0
k= 2, r= 1 0 0 1 0
k= 1, r= 1 0 0 1 0
Bitap Algorithm
(for exact string searching)
Explanation with an example
i= 5
text = ATTGCAC
pattern = TGCA
k= 4, r= 1 0 0 1 1
k= 3, r= 1 0 0 0 1
k= 2, r= 1 0 0 0 1
k= 1, r= 1 0 0 0 1
Bitap Algorithm
(for exact string searching)
Properties
Due to the data structures required by the
algorithm, it performs best on patterns less than
a constant, and also prefers inputs over a small
alphabet. (Suitable for DNA strings)
It runs in O(mn) operations, no matter the
structure of the text or the pattern.
References
• http://en.wikipedia.org/wiki/Knuth%E2%80%
93Morris%E2%80%93Pratt_algorithm
• http://www.ijsce.org/attachments/File/Vol-
1_Issue-6/F0304111611.pdf
• http://en.wikipedia.org/wiki/Boyer%E2%80%9
3Moore_string_search_algorithm
• http://en.wikipedia.org/wiki/Bitap_algorithm
String matching algorithms-pattern matching.

String matching algorithms-pattern matching.