KEMBAR78
String Matching | PDF | Automata Theory | Theoretical Computer Science
0% found this document useful (0 votes)
92 views30 pages

String Matching

Uploaded by

smtptesting021
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views30 pages

String Matching

Uploaded by

smtptesting021
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

String Matching:

Sushma Prajapati
Assistant Professor
CO Dept
CKPCET,surat
Email:sushma.prajapati@ckpcet.ac.in
Outline
● Introduction
● The naive string matching algorithm
● The Rabin-Karp algorithm
● String Matching with finite automata
● The Knuth-Morris-Pratt algorithm.
Introduction
● String Matching Algorithm is also called "String Searching Algorithm”
● As with most algorithms, the main considerations for string searching are speed
and efficiency.
● Problem is to find all occurrences of pattern P[1..m] within text T[1..n]
● P occurs with shift s (beginning at s+1):
○ P[1]=T[s+1], P[2]=T[s+2],…,P[m]=T[s+m].
● If so, call s is a valid shift, otherwise, an invalid shift.
The naive string matching algorithm
Input: P and T, the pattern and text strings; m, the length of P. n, length of T. The pattern
is assumed to be nonempty.

Output: The return value is the index in T where a copy of P begins, or -1 if no match for
P is found.
The naive string matching algorithm:
Introduction
● Naïve pattern searching is the simplest method among other pattern searching
algorithms
● It checks for all character of the main string to the pattern. This algorithm is helpful
for smaller texts. It does not need any pre-processing phases. We can find substring
by checking once for the string. It also does not occupy extra space to perform the
operation.
● The naive approach tests all the possible placement of Pattern P [1…….m] relative
to text T [1……n]. We try shift s = 0, 1…….n-m, successively and for each shift s.
Compare T [s+1…….s+m] to P [1……m].It returns all the valid shifts found.
The naive string matching : Algorithm

The naive string matching : Time Analysis
● Best case occurs when the first character of the pattern is not present in text at all.
○ Example: T[] = "BBACCAADDEE"; P[] = "HBB";
○ The number of comparisons in best case is O(n)
● worst case occurs in following scenarios.
○ When all characters of the text and pattern are same.
■ T[] = "DDDDDDDDDDDD" ; P[]="DDDDD"
○ Occurs when only the last character is different.
■ T[] = "VVVVVVVVVVVVK" ; P[]="VVVK"
○ The number of comparisons in the worst case is O(m*(n-m+1))
Problem with naive string matching
algorithm
● The naive string matcher is inefficient because information gained about the text
for one value of s is entirely ignored in considering other values of s.
● Example

T=xabxyabxyabxz P=abxyabxz
abxyabxz
X Whenever a character mismatch occurs after
abxyabxz matching of several characters, the comparison
vvvvvvvX begins by going back in T from the character
abxyabxz which follows the last beginning character.
Better Approach for string matching
● To do some preprocessing based on either pattern or text
● Some of String matching algorithms based on these are
○ The Rabin-Karp Algorithm
○ String Matching with finite automata
○ The Knuth-Morris-Pratt algorithm.
Rabin – Karp Algorithm
Rabin – Karp Algorithm
● The Rabin-Karp string searching algorithm calculates a hash value for the pattern,
and for each M-character subsequence of text to be compared.
● If the hash values are unequal, the algorithm will calculate the hash value for next
M-character sequence.
● If the hash values are equal, the algorithm will do a Brute Force comparison
between the pattern and the M-character sequence.
● In this way, there is only one comparison per text subsequence, and Brute Force is
only needed when hash values match.
Notation used in algorithm
● Let Σ = {0,1,2, . . .,9}.
● We can view a string of k consecutive characters as representing a length-k decimal
number.
● Let p denote the decimal number(hashcode) for P[1..m]
● Let ts denote the decimal value(hashcode) of the length-m substring T[s+1..s+m]
of T[1..n] for s = 0, 1, . . ., n-m.
● ts = p if and only if
○ T[s+1..s+m] = P[1..m], and s is a valid shift.
● p = P[m] + 10(P[m-1] +10(P[m-2]+ . . . +10(P[2]+10(P[1]))
We can compute p in O(m) time
● Similarly we can compute t0 from T[1..m] in O(m) time.
Notation used in algorithm(Contd…)
● ts+1 can be computed from ts in constant time.

ts+1 = 10(ts –10m-1 T[s+1])+ T[s+m+1]

● Example : T = 314152
ts = 31415, s = 0, m= 5, T[s+1]=3 and T[s+m+1] = 2

ts+1= 10(31415 –10000*3) +2 = 14152

● Thus p and t0, t1, . . ., tn-m can all be computed in O(n+m) time.
● And all occurences of the pattern P[1..m] in the text T[1..n] can be found in time
O(n+m).
Notation used in algorithm(Contd…)
● However, p and ts may be too large to work with conveniently.
Do we have a simple solution!!
● mod all calculations by a selected value, q
● for a d-ary alphabet select q to be a large prime such that dq fits into one computer
word

● The recurrence equation can be rewritten as

where h = dm-1(mod q) is the value of the digit “1” in the high order position of an
m-digit text window.
Rabin – Karp : Algorithm
RABIN-KARP-MATCHER(T,P,d,q)

Input : Text T, pattern P, radix d ,

and the prime q.


Rabin – Karp Algorithm : Example
Rabin - Karp : Time Analysis
● The average and best case O(n+m)
● worst-case time is O(nm).
● Worst case of Rabin-Karp algorithm occurs when all characters of pattern and text
are same as the hash values of all the substrings of txt[] match with hash value of
pat[]. For example pat[] = “AAA” and txt[] = “AAAAAAA”.
Rabin - Karp : Problem to Solve
Working modulo q=11, how many spurious hits does the Rabin-Karp matcher
encounter in the text T=3141592653589793 when looking for the pattern P=26?
String Matching with Fininite Automata
String Matching with Finite Automata
● In this algorithm we preprocess the pattern and build a 2D array that represents a
Finite Automata
● Construction of the FA is the main tricky part of this algorithm
● Once the FA is built, the searching is simple. In search, we simply need to start from
the first state of the automata and the first character of the text
● At every step, we consider next character of text, look for the next state in the built
FA and move to a new state
● If we reach the final state, then the pattern is found in the text.
● The matching takes O(n) time since each character is examined once.
Finite Automata
● A finite automaton (FA) is a simple idealized machine used to recognize patterns
within input taken from some character set (or alphabet) . The job of an FA is to
accept or reject an input depending on whether the pattern defined by the FA
occurs in the input.
● A finite automaton is a 5 tuple (Q,Σ,δ, q0, F):where,
○ Q: the finite set of states
○ Σ: the finite input alphabet
○ δ: the “transition function” from Q × Σ → Q
○ q0: the start state
○ F : the set of final (accepting) states
Finite Automata
Finite Automata
Finite Automata : Algorithm
● Once we have constructed a finite automaton
for the pattern,searching a text t1,t2....tn for
the pattern works wonderfully.
● Search time is O(n). Each character in the text
is examined just once, in sequential order.
KMP Algorithm
● Knuth, Morris and Pratt proposed a linear time algorithm for the string matching
problem
● The key observation
● This approach is similar to the finite state automaton
● When there is a mismatch after several characters match, then the pattern and
search string contain the same values; therefore we can match the pattern
against itself by precomputing a prefix function to find out how far we can
shift ahead
● This means we can dispense with computing the transition function 𝛿
altogether
Components of KMP Algorithm
● The Prefix Function (Π):
○ The Prefix Function, Π for a pattern encapsulates knowledge about how the pattern matches against
the shift of itself. This information can be used to avoid a useless shift of the pattern 'p.' In other
words, this enables avoiding backtracking of the string 'S.'
● The KMP Matcher:
○ With string 'S,' pattern 'p' and prefix function 'Π' as inputs, find the occurrence of 'p' in 'S' and returns
the number of shifts of 'p' after which occurrences are found.
Computing the prefix function
COMPUTE- PREFIX- FUNCTION (P)
1. m ←length [P] //'p' pattern to be matched
2. Π [1] ← 0
3. k ← 0
4. for q ← 2 to m
5. do while k > 0 and P [k + 1] ≠ P [q]
6. do k ← Π [k]
7. If P [k + 1] = P [q]
8. then k← k + 1
9. Π [q] ← k
10. Return Π
Example of prefix function
KMP Matcher
The KMP Matcher, with pattern ‘p’, string ‘S’ and KMP-MATCHER (T, P)
prefix function ‘Π’ as input, finds a match of p in S. 1. n ← length [T]
Following pseudocode computes the matching 2. m ← length [P]
component of KMP algorithm: 3. Π← COMPUTE-PREFIX-FUNCTION (P)
4. q ← 0 // numbers of characters matched
5. for i ← 1 to n // scan S from left to right
6. do while q > 0 and P [q + 1] ≠ T [i]
7. do q ← Π [q] // next character does not
match
8. If P [q + 1] = T [i]
9. then q ← q + 1 // next character matches
10 . If q = m // is all of p matched?
11. then print "Pattern occurs with shift" i - m
12. q ← Π [q] // look for the next
match
KMP Runtime Analysis
The for loop beginning in step 5 runs 'n' times, i.e., as long as the length of the string 'S.'
Since step 1 to step 4 take constant times, the running time is dominated by this for the
loop. Thus running time of the matching function is O (n).

You might also like