KEMBAR78
Approximate Matching (String Algorithms 2007) | ODP
Approximate pattern matching Given string  x = abbacbbbababacabbbba  and  pattern  p = bbba  find all “almost”-occurrences of  p  ind  x x = a bba c bbba babacab b a bba 17 6 1
String distance A number of string-distances have been suggested, e.g.: Hamming  distance: d( x , y )=number of characters that differs between  x  and  y d( abca,abaa ) = 1, d( abca,abab ) = 2 Levenshtein  distance: d( x , y )=number of deletions and insertions needed to transform  x  into  y : d( abca,abaa ) = 2, d( abca,aba ) = 1 Edit  distance: d( x , y )=number of insertions, deletions, or substitutions needed to transform  x  into  y d( abca,abaa ) = 1, d( abca,cca ) = 2
k-Approximate matching Given string  x  and pattern  p  find all indices in  x  where: i i+h d( x [ i..i+h ], p )  ≤   k d(   ,  )  ≤   k Generic problem for the various distance functions d
Generic Algorithm for  i=1..| x |: if  d( x [i..i'], p ) <= k for some i'>i: report match at i i i' Time usage: O(n 2   ∙  “time to calculate distance”) (But see Sect. 10.1 for a O(nm) dynamic programming algorithm that works for most distance functions)
The k-mismatch problem Let, for strings  x  and  y , | x |=| y |= n ,  d( x , y ) = |{i | i=1..n,  x [i] ≠ y [i]}| (the  Hamming  distance) The  k -mismatch problem: Given string  x  and pattern  p  find all indices in  x  where: i i+m-1 d( x [ i..i+m-1 ], p )  ≤   k d(   ,  )  ≤   k
Simple k-mismatch algorithm for  i=1..| x |: count = 0 for  j=1..| p |: if   x [i]!= p [j]: count = count + 1 if  count <= k: report match at i Time usage: O(| x || p| )
Bit-vector approach Inspired by SHIFT-and-OR Use a state bit-vector  s  to speed up the simple algorithm For each index  j  in  p ,  s  uses log(| p |+1) bits s [ j ] = d( x [ i-j+1..i ], p [1.. j ]) i j
Using state vector  s Notice that  s  holds information about more than one comparison! j s [j]==l j' s [j']==h Conceptually,  p  is positioned | p | places along  x : s  tries to match  p  at positions   i -| p |+1 ..  i
Using state vector  s When  s [| p |] ≤ k  and  i ≥ | p |, we have an occurrence of  p  in  x  at  i -| p |+1 s [| p |] <= k i i-| p |+1
Example: 2-mismatch x = babacbbbababacabbbba i=0 p = bbba s = 000  000 000 000 s [1]==0 p = bbba p = bbba p = bbba s [2]==0 s [3]==0 s [4]==0
Example: 2-mismatch x = babacbbbababacabbbba i=1 p = b bba s =000   000 000 000 s [1]==0 p = bbba p = bbba p = bbba s [2]==0 s [3]==0 s [4]==0
Example: 2-mismatch x = babacbbbababacabbbba i=2 p = b bba s =001  001  000 000 s [1]==1 p = b b ba p = bbba p = bbba s [2]==1 s [3]==0 s [4]==0
Example: 2-mismatch x = babacbbbababacabbbba i=3 p = b bba s =000  001 001  000 s [1]==0 p = b b ba p = b b b a p = bbba s [2]==1 s [3]==1 s [4]==0
Example: 2-mismatch x = babacbbbababacabbbba i=4 p = b bba s =001  001 010  001 s [1]==1 p = b b ba p = b b b a p = b b ba s [2]==1 s [3]==2 s [4]==1 Match at i-4+1=1
Example: 2-mismatch x = babacbbbababacabbbba i=5 p = b bba s =001  010 010 011 s [1]==1 p = bb ba p = b bb a p = b b ba s [2]==2 s [3]==2 s [4]==3
Example: 2-mismatch x = babacbbbababacabbbba i=6 p = b bba s =000  001 010 011 s [1]==0 p = b b ba p = bb b a p = b bba s [2]==1 s [3]==2 s [4]==3
Example: 2-mismatch x = babacbbbababacabbbba i=7 p = b bba s =000  000 001 011 s [1]==0 p = bb ba p = b bb a p = bb b a s [2]==0 s [3]==1 s [4]==3
Example: 2-mismatch x = babacbbbababacabbbba i=8 p = b bba s =000  000 001  010 s [1]==0 p = bb ba p = bbb a p = b bb a s [2]==0 s [3]==0 s [4]==2 Match at i-4+1=5
Example: 2-mismatch x = babacbbbababacabbbba i=9 p = b bba s =001  001 001  000 s [1]==1 p = b b ba p = bb b a p = bbba s [2]==1 s [3]==1 s [4]==0 Match at i-4+1=6
Constructing  s Let  s i  be the state vector in iteration  i . Then  s i [ j ]= s i-1 [ j -1] +  t where  t =0 if  p [ j ]= x [ i ] and  t =1 otherwise i j s i-1 [j-1] Special case:  s i [0] = 0 for all  i
Table t The number  t ,  where  t =0 if  p [ j ]= x [ i ] and  t =1 otherwise can be pre-calculated and stored in a matrix: t [h,j] = 0 log(| p |+1) if  p [j] == h 0 log(| p |+1)-1 1 if  p [j] != h with rows indexed by the alphabet and columns indexed by indices in  p , using log(| p |+1) bits per cell.
Table t Why use log(| p |+1) bits per cell? To be able to add  t [h] and  s  a word at a time, silly! s i-1 : t [h]: s i : + + 0 + + + + + + + + + + + + words
Table t for pattern p=bbba 1  2  3  4 t ['a']:  001 001 001 000 t ['b']:  000 000 000 001 t ['c']:  001 001 001 001
Example x = babacbbbababacabbbba i=0 p = bbba s =0 00 000 000 000 000 s 0 [1]==0 p = bbba p = bbba p = bbba s 0 [2]==0 s 0 [3]==0 s 0 [4]==0 s 0 [0]==0
Example x = babacbbbababacabbbba i=1 p = b bba   0 00 000 000 000 001 s 1 [1]==0 p = bbba p = bbba p = bbba s 1 [2]==0 s 1 [3]==0 s 1 [4]==1 s 1 [0]==0 t [b] 0 + 0 + 0 + 1 + s 0 [1]==0 s 0 [2]==0 s 0 [3]==0 s 0 [4]==0 s 0 [0]==0 s 1  =   0 00 000 000 000 000 s 0  =   0 00 000 000 001 t [b] =
Example x = babacbbbababacabbbba i=2 p = b bba   0 00 001 001 001 000 s 2 [1]==1 p = b b ba p = bbba p = bbba s 2 [2]==1 s 2 [3]==1 s 2 [4]==0 s 2 [0]==0 t [a] 1 + 1 + 1 + 0 + s 1 [1]==0 s 1 [2]==0 s 1 [3]==0 s 1 [4]==1 s 1 [0]==0 s 2  =   0 00 000 000 000 001 s 1  =   0 01 001 001 000 t [a] =
Example x = babacbbbababacabbbba i=3 p = b bba   0 00 000 001 001 010 s 3 [1]==0 p = b b ba p = b b b a p = bbba s 3 [2]==1 s 3 [3]==1 s 3 [4]==2 s 3 [0]==0 t [b] 0 + 0 + 0 + 1 + s 2 [1]==1 s 2 [2]==1 s 2 [3]==1 s 2 [4]==0 s 2 [0]==0 s 3  =   0 00 001 001 001 000 s 2  =   0 00 000 000 001 t [b] =
Example x = babacbbbababacabbbba i=4 p = b bba   0 00 001 001 010  001 s 4 [1]==1 p = b b ba p = b b b a p = b b ba s 4 [2]==1 s 4 [3]==2 s 4 [4]==1 s 4 [0]==0 t [a] 1 + 1 + 1 + 0 + s 3 [1]==0 s 3 [2]==1 s 3 [3]==1 s 3 [4]==2 s 3 [0]==0 s 4  =   0 00 000 001 001 010 s 3  =   0 01 001 001 000 t [a] = MATCH!
Example x = babacbbbababacabbbba i=5 p = b bba   0 00 001 010 010 011 s 5 [1]==1 p = bb ba p = b bb a p = b b ba s 5 [2]==2 s 5 [3]==2 s 5 [4]==3 s 5 [0]==0 t [c] 1 + 1 + 1 + 1 + s 4 [1]==1 s 4 [2]==1 s 4 [3]==2 s 4 [4]==1 s 4 [0]==0 s 5  =   0 00 001 001 010 001 s 4  =   0 01 001 001 001 t [c] =
Example x = babacbbbababacabbbba i=6 p = b bba   0 00 000 001 010 011 s 6 [1]==0 p = b b ba p = bb b a p = b bba s 6 [2]==1 s 6 [3]==2 s 6 [4]==3 s 6 [0]==0 t [b] 0 + 0 + 0 + 1 + s 5 [1]==1 s 5 [2]==2 s 5 [3]==2 s 5 [4]==3 s 5 [0]==0 s 6  =   0 00 001 010 010 011 s 5  =   0 00 000 000 001 t [b] =
Algorithm Preprocessing: b = log(| p |+1) for  h  in     and  j=1..| p |: t [h,j] = 0 b-1 1 for  j=1..| p |: t [ p [j],j] = 0 b s  = 0 b(| p |+1) Main: for  i=1..| x |: s  = ( s  >> b) +  t [ x [i]] if   s [| p |]<=k and i>=| p |: report i-| p |+1 as match
Runtime complexity Preprocessing takes time O(|  || p |log(| p |+1) / w +| p |) Main search takes time O(| x || p |log(| p |+1) /w )
Improvements When  k   «  | p |, we are wasting bits (and time)! We count all mismatches, but we only need to know if there are more than  k  or not. We can replace the log(| p |+1) bit cells in  s  with log( k +1)+1 (for numbers 0..k and an overflow bit) and an additional vector,  o , remembering overflows
Updated operation We update the operation s  = ( s  >> b) +  t [ x [i]] where b=log(| p|+1), to  be s  = ( s  >> b) +  t [ x [i]] o  = ( o  >> b) | ( s  & OF_MASK) s  =  s  & NEG_OF_MASK where b=log( k +1)+1 and OF_MASK = (10 b-1 ) | p | NEG_OF_MASK = ~OF_MASK = (01 b ) | p |
Algorithm Preprocessing: b = log(k+1)+1 OF_MASK =  (10 b-1 ) | p |  ; NEG_OF_MASK = ~OF_MASK for  h  in     and  j=1..| p |:  t [h,j] = 0 b-1 1 for  j=1..| p |:  t [ p [j],j] = 0 b s  = 0 b(| p |+1)  ;  o  = 0 b(| p |+1) Main: for  i=1..| x |: s  = ( s  >> b) +  t [ x [i]] o  = ( o  >> b) | ( s  & OF_MASK) s  =  s  & NEG_OF_MASK if   s [| p |] +  o [| p |] <= k  and i>=| p |: report i-| p |+1 as match
Example x = babacacbababacabbbba i=0 p = bbba p = bbba p = bbba p = bbba s 0 == 000 000 000 000 000 o 0 == 000 000 000 000 000
Example x = babacacbababacabbbba i=1 p = b bba p = bbba p = bbba p = bbba ( s 0  >> b) == 000 000 000 000 000 t ['b'] ==  000 000 000 001 s 1' == 000 000 000 000 001 ( o 0  >> b) == 000 000 000 000 000 ( s 1'  & OF_MASK) == 000 000 000 000 000 o 1 == 000 000 000 000 000 s 1 == 000 000 000 000 001
Example x = babacacbababacabbbba i=2 p = b bba p = b b ba p = bbba p = bbba ( s 1  >> b) == 000 000 000 000 000 t ['a'] ==  001 001 001 000 s 2' == 000 001 001 001 000 ( o 1  >> b) == 000 000 000 000 000 ( s 2'  & OF_MASK) == 000 000 000 000 000 o 2 == 000 000 000 000 000 s 2 == 000 001 001 001 000
Example x = babacacbababacabbbba i=3 p = b bba p = b b ba p = b b b a p = bbba ( s 2  >> b) == 000 000 001 001 001 t ['b'] ==  000 000 000 001 s 3' == 000 000 001 001 010 ( o 2  >> b) == 000 000 000 000 000 ( s 3'  & OF_MASK) == 000 000 000 000 000 o 3 == 000 000 000 000 000 s 3 == 000 000 001 001 010
Example x = babacacbababacabbbba i=4 p = b bba p = b b ba p = b b b a p = b b ba ( s 3  >> b) == 000 000 000 001 001 t ['a'] ==  001 001 001 000 s 4' == 000 001 001 010 001 ( o 3  >> b) == 000 000 000 000 000 ( s 4'  & OF_MASK) == 000 000 000 000 000 o 4 == 000 000 000 000  000 s 4 == 000 001 001 010  001 MATCH!
Example x = babacacbababacabbbba i=5 p = b bba p = bb ba p = b bb a p = b b ba ( s 4  >> b) == 000 000 001 001 010 t ['c'] ==  001 001 001 001 s 5' == 000 001 010 010 011 ( o 4  >> b) == 000 000 000 000 000 ( s 5'  & OF_MASK) == 000 000 000 000 000 o 5 == 000 000 000 000 000 s 5 == 000 001 010 010 011
Example x = babacacbababacabbbba i=6 p = b bba p = bb ba p = bbb a p = b bb a ( s 5  >> b) == 000 000 001 010 010 t ['a'] ==  001 001 001 000 s 6' == 000 001 010 011 010 ( o 5  >> b) == 000 000 000 000 000 ( s 6'  & OF_MASK) == 000 000 000 000 000 o 6 == 000 000 000 000  000 s 6 == 000 001 010 011  010 MATCH!
Example x = babacacbababacabbbba i=7 p = b bba p = bb ba p = bbb a p = bbba ( s 6  >> b) == 000 000 001 010 011 t ['c'] ==  001 001 001 001 s 7' == 000 001 010 011 100 ( o 6  >> b) == 000 000 000 000 000 ( s 7'  & OF_MASK) == 000 000 000 000 100 o 7 == 000 000 000 000 100 s 7 == 000 001 010 011 000
Example x = babacacbababacabbbba i=8 p = b bba p = b b ba p = bb b a p = bbba ( s 7  >> b) == 000 000 001 010 011 t ['b'] ==  000 000 000 001 s 8' == 000 000 001 010 100 ( o 7  >> b) == 000 000 000 000 000 ( s 8'  & OF_MASK) == 000 000 000 000 100 o 8 == 000 000 000 000 100 s 8 == 000 000 001 010 000
Time complexity Preprocessing takes time O(|  || p |log(k) / w +| p |) Main search takes time O(| x || p |log(k) /w )

Approximate Matching (String Algorithms 2007)

  • 1.
    Approximate pattern matchingGiven string x = abbacbbbababacabbbba and pattern p = bbba find all “almost”-occurrences of p ind x x = a bba c bbba babacab b a bba 17 6 1
  • 2.
    String distance Anumber of string-distances have been suggested, e.g.: Hamming distance: d( x , y )=number of characters that differs between x and y d( abca,abaa ) = 1, d( abca,abab ) = 2 Levenshtein distance: d( x , y )=number of deletions and insertions needed to transform x into y : d( abca,abaa ) = 2, d( abca,aba ) = 1 Edit distance: d( x , y )=number of insertions, deletions, or substitutions needed to transform x into y d( abca,abaa ) = 1, d( abca,cca ) = 2
  • 3.
    k-Approximate matching Givenstring x and pattern p find all indices in x where: i i+h d( x [ i..i+h ], p ) ≤ k d( , ) ≤ k Generic problem for the various distance functions d
  • 4.
    Generic Algorithm for i=1..| x |: if d( x [i..i'], p ) <= k for some i'>i: report match at i i i' Time usage: O(n 2 ∙ “time to calculate distance”) (But see Sect. 10.1 for a O(nm) dynamic programming algorithm that works for most distance functions)
  • 5.
    The k-mismatch problemLet, for strings x and y , | x |=| y |= n , d( x , y ) = |{i | i=1..n, x [i] ≠ y [i]}| (the Hamming distance) The k -mismatch problem: Given string x and pattern p find all indices in x where: i i+m-1 d( x [ i..i+m-1 ], p ) ≤ k d( , ) ≤ k
  • 6.
    Simple k-mismatch algorithmfor i=1..| x |: count = 0 for j=1..| p |: if x [i]!= p [j]: count = count + 1 if count <= k: report match at i Time usage: O(| x || p| )
  • 7.
    Bit-vector approach Inspiredby SHIFT-and-OR Use a state bit-vector s to speed up the simple algorithm For each index j in p , s uses log(| p |+1) bits s [ j ] = d( x [ i-j+1..i ], p [1.. j ]) i j
  • 8.
    Using state vector s Notice that s holds information about more than one comparison! j s [j]==l j' s [j']==h Conceptually, p is positioned | p | places along x : s tries to match p at positions i -| p |+1 .. i
  • 9.
    Using state vector s When s [| p |] ≤ k and i ≥ | p |, we have an occurrence of p in x at i -| p |+1 s [| p |] <= k i i-| p |+1
  • 10.
    Example: 2-mismatch x= babacbbbababacabbbba i=0 p = bbba s = 000 000 000 000 s [1]==0 p = bbba p = bbba p = bbba s [2]==0 s [3]==0 s [4]==0
  • 11.
    Example: 2-mismatch x= babacbbbababacabbbba i=1 p = b bba s =000 000 000 000 s [1]==0 p = bbba p = bbba p = bbba s [2]==0 s [3]==0 s [4]==0
  • 12.
    Example: 2-mismatch x= babacbbbababacabbbba i=2 p = b bba s =001 001 000 000 s [1]==1 p = b b ba p = bbba p = bbba s [2]==1 s [3]==0 s [4]==0
  • 13.
    Example: 2-mismatch x= babacbbbababacabbbba i=3 p = b bba s =000 001 001 000 s [1]==0 p = b b ba p = b b b a p = bbba s [2]==1 s [3]==1 s [4]==0
  • 14.
    Example: 2-mismatch x= babacbbbababacabbbba i=4 p = b bba s =001 001 010 001 s [1]==1 p = b b ba p = b b b a p = b b ba s [2]==1 s [3]==2 s [4]==1 Match at i-4+1=1
  • 15.
    Example: 2-mismatch x= babacbbbababacabbbba i=5 p = b bba s =001 010 010 011 s [1]==1 p = bb ba p = b bb a p = b b ba s [2]==2 s [3]==2 s [4]==3
  • 16.
    Example: 2-mismatch x= babacbbbababacabbbba i=6 p = b bba s =000 001 010 011 s [1]==0 p = b b ba p = bb b a p = b bba s [2]==1 s [3]==2 s [4]==3
  • 17.
    Example: 2-mismatch x= babacbbbababacabbbba i=7 p = b bba s =000 000 001 011 s [1]==0 p = bb ba p = b bb a p = bb b a s [2]==0 s [3]==1 s [4]==3
  • 18.
    Example: 2-mismatch x= babacbbbababacabbbba i=8 p = b bba s =000 000 001 010 s [1]==0 p = bb ba p = bbb a p = b bb a s [2]==0 s [3]==0 s [4]==2 Match at i-4+1=5
  • 19.
    Example: 2-mismatch x= babacbbbababacabbbba i=9 p = b bba s =001 001 001 000 s [1]==1 p = b b ba p = bb b a p = bbba s [2]==1 s [3]==1 s [4]==0 Match at i-4+1=6
  • 20.
    Constructing sLet s i be the state vector in iteration i . Then s i [ j ]= s i-1 [ j -1] + t where t =0 if p [ j ]= x [ i ] and t =1 otherwise i j s i-1 [j-1] Special case: s i [0] = 0 for all i
  • 21.
    Table t Thenumber t , where t =0 if p [ j ]= x [ i ] and t =1 otherwise can be pre-calculated and stored in a matrix: t [h,j] = 0 log(| p |+1) if p [j] == h 0 log(| p |+1)-1 1 if p [j] != h with rows indexed by the alphabet and columns indexed by indices in p , using log(| p |+1) bits per cell.
  • 22.
    Table t Whyuse log(| p |+1) bits per cell? To be able to add t [h] and s a word at a time, silly! s i-1 : t [h]: s i : + + 0 + + + + + + + + + + + + words
  • 23.
    Table t forpattern p=bbba 1 2 3 4 t ['a']: 001 001 001 000 t ['b']: 000 000 000 001 t ['c']: 001 001 001 001
  • 24.
    Example x =babacbbbababacabbbba i=0 p = bbba s =0 00 000 000 000 000 s 0 [1]==0 p = bbba p = bbba p = bbba s 0 [2]==0 s 0 [3]==0 s 0 [4]==0 s 0 [0]==0
  • 25.
    Example x =babacbbbababacabbbba i=1 p = b bba 0 00 000 000 000 001 s 1 [1]==0 p = bbba p = bbba p = bbba s 1 [2]==0 s 1 [3]==0 s 1 [4]==1 s 1 [0]==0 t [b] 0 + 0 + 0 + 1 + s 0 [1]==0 s 0 [2]==0 s 0 [3]==0 s 0 [4]==0 s 0 [0]==0 s 1 = 0 00 000 000 000 000 s 0 = 0 00 000 000 001 t [b] =
  • 26.
    Example x =babacbbbababacabbbba i=2 p = b bba 0 00 001 001 001 000 s 2 [1]==1 p = b b ba p = bbba p = bbba s 2 [2]==1 s 2 [3]==1 s 2 [4]==0 s 2 [0]==0 t [a] 1 + 1 + 1 + 0 + s 1 [1]==0 s 1 [2]==0 s 1 [3]==0 s 1 [4]==1 s 1 [0]==0 s 2 = 0 00 000 000 000 001 s 1 = 0 01 001 001 000 t [a] =
  • 27.
    Example x =babacbbbababacabbbba i=3 p = b bba 0 00 000 001 001 010 s 3 [1]==0 p = b b ba p = b b b a p = bbba s 3 [2]==1 s 3 [3]==1 s 3 [4]==2 s 3 [0]==0 t [b] 0 + 0 + 0 + 1 + s 2 [1]==1 s 2 [2]==1 s 2 [3]==1 s 2 [4]==0 s 2 [0]==0 s 3 = 0 00 001 001 001 000 s 2 = 0 00 000 000 001 t [b] =
  • 28.
    Example x =babacbbbababacabbbba i=4 p = b bba 0 00 001 001 010 001 s 4 [1]==1 p = b b ba p = b b b a p = b b ba s 4 [2]==1 s 4 [3]==2 s 4 [4]==1 s 4 [0]==0 t [a] 1 + 1 + 1 + 0 + s 3 [1]==0 s 3 [2]==1 s 3 [3]==1 s 3 [4]==2 s 3 [0]==0 s 4 = 0 00 000 001 001 010 s 3 = 0 01 001 001 000 t [a] = MATCH!
  • 29.
    Example x =babacbbbababacabbbba i=5 p = b bba 0 00 001 010 010 011 s 5 [1]==1 p = bb ba p = b bb a p = b b ba s 5 [2]==2 s 5 [3]==2 s 5 [4]==3 s 5 [0]==0 t [c] 1 + 1 + 1 + 1 + s 4 [1]==1 s 4 [2]==1 s 4 [3]==2 s 4 [4]==1 s 4 [0]==0 s 5 = 0 00 001 001 010 001 s 4 = 0 01 001 001 001 t [c] =
  • 30.
    Example x =babacbbbababacabbbba i=6 p = b bba 0 00 000 001 010 011 s 6 [1]==0 p = b b ba p = bb b a p = b bba s 6 [2]==1 s 6 [3]==2 s 6 [4]==3 s 6 [0]==0 t [b] 0 + 0 + 0 + 1 + s 5 [1]==1 s 5 [2]==2 s 5 [3]==2 s 5 [4]==3 s 5 [0]==0 s 6 = 0 00 001 010 010 011 s 5 = 0 00 000 000 001 t [b] =
  • 31.
    Algorithm Preprocessing: b= log(| p |+1) for h in  and j=1..| p |: t [h,j] = 0 b-1 1 for j=1..| p |: t [ p [j],j] = 0 b s = 0 b(| p |+1) Main: for i=1..| x |: s = ( s >> b) + t [ x [i]] if s [| p |]<=k and i>=| p |: report i-| p |+1 as match
  • 32.
    Runtime complexity Preprocessingtakes time O(|  || p |log(| p |+1) / w +| p |) Main search takes time O(| x || p |log(| p |+1) /w )
  • 33.
    Improvements When k « | p |, we are wasting bits (and time)! We count all mismatches, but we only need to know if there are more than k or not. We can replace the log(| p |+1) bit cells in s with log( k +1)+1 (for numbers 0..k and an overflow bit) and an additional vector, o , remembering overflows
  • 34.
    Updated operation Weupdate the operation s = ( s >> b) + t [ x [i]] where b=log(| p|+1), to be s = ( s >> b) + t [ x [i]] o = ( o >> b) | ( s & OF_MASK) s = s & NEG_OF_MASK where b=log( k +1)+1 and OF_MASK = (10 b-1 ) | p | NEG_OF_MASK = ~OF_MASK = (01 b ) | p |
  • 35.
    Algorithm Preprocessing: b= log(k+1)+1 OF_MASK = (10 b-1 ) | p | ; NEG_OF_MASK = ~OF_MASK for h in  and j=1..| p |: t [h,j] = 0 b-1 1 for j=1..| p |: t [ p [j],j] = 0 b s = 0 b(| p |+1) ; o = 0 b(| p |+1) Main: for i=1..| x |: s = ( s >> b) + t [ x [i]] o = ( o >> b) | ( s & OF_MASK) s = s & NEG_OF_MASK if s [| p |] + o [| p |] <= k and i>=| p |: report i-| p |+1 as match
  • 36.
    Example x =babacacbababacabbbba i=0 p = bbba p = bbba p = bbba p = bbba s 0 == 000 000 000 000 000 o 0 == 000 000 000 000 000
  • 37.
    Example x =babacacbababacabbbba i=1 p = b bba p = bbba p = bbba p = bbba ( s 0 >> b) == 000 000 000 000 000 t ['b'] == 000 000 000 001 s 1' == 000 000 000 000 001 ( o 0 >> b) == 000 000 000 000 000 ( s 1' & OF_MASK) == 000 000 000 000 000 o 1 == 000 000 000 000 000 s 1 == 000 000 000 000 001
  • 38.
    Example x =babacacbababacabbbba i=2 p = b bba p = b b ba p = bbba p = bbba ( s 1 >> b) == 000 000 000 000 000 t ['a'] == 001 001 001 000 s 2' == 000 001 001 001 000 ( o 1 >> b) == 000 000 000 000 000 ( s 2' & OF_MASK) == 000 000 000 000 000 o 2 == 000 000 000 000 000 s 2 == 000 001 001 001 000
  • 39.
    Example x =babacacbababacabbbba i=3 p = b bba p = b b ba p = b b b a p = bbba ( s 2 >> b) == 000 000 001 001 001 t ['b'] == 000 000 000 001 s 3' == 000 000 001 001 010 ( o 2 >> b) == 000 000 000 000 000 ( s 3' & OF_MASK) == 000 000 000 000 000 o 3 == 000 000 000 000 000 s 3 == 000 000 001 001 010
  • 40.
    Example x =babacacbababacabbbba i=4 p = b bba p = b b ba p = b b b a p = b b ba ( s 3 >> b) == 000 000 000 001 001 t ['a'] == 001 001 001 000 s 4' == 000 001 001 010 001 ( o 3 >> b) == 000 000 000 000 000 ( s 4' & OF_MASK) == 000 000 000 000 000 o 4 == 000 000 000 000 000 s 4 == 000 001 001 010 001 MATCH!
  • 41.
    Example x =babacacbababacabbbba i=5 p = b bba p = bb ba p = b bb a p = b b ba ( s 4 >> b) == 000 000 001 001 010 t ['c'] == 001 001 001 001 s 5' == 000 001 010 010 011 ( o 4 >> b) == 000 000 000 000 000 ( s 5' & OF_MASK) == 000 000 000 000 000 o 5 == 000 000 000 000 000 s 5 == 000 001 010 010 011
  • 42.
    Example x =babacacbababacabbbba i=6 p = b bba p = bb ba p = bbb a p = b bb a ( s 5 >> b) == 000 000 001 010 010 t ['a'] == 001 001 001 000 s 6' == 000 001 010 011 010 ( o 5 >> b) == 000 000 000 000 000 ( s 6' & OF_MASK) == 000 000 000 000 000 o 6 == 000 000 000 000 000 s 6 == 000 001 010 011 010 MATCH!
  • 43.
    Example x =babacacbababacabbbba i=7 p = b bba p = bb ba p = bbb a p = bbba ( s 6 >> b) == 000 000 001 010 011 t ['c'] == 001 001 001 001 s 7' == 000 001 010 011 100 ( o 6 >> b) == 000 000 000 000 000 ( s 7' & OF_MASK) == 000 000 000 000 100 o 7 == 000 000 000 000 100 s 7 == 000 001 010 011 000
  • 44.
    Example x =babacacbababacabbbba i=8 p = b bba p = b b ba p = bb b a p = bbba ( s 7 >> b) == 000 000 001 010 011 t ['b'] == 000 000 000 001 s 8' == 000 000 001 010 100 ( o 7 >> b) == 000 000 000 000 000 ( s 8' & OF_MASK) == 000 000 000 000 100 o 8 == 000 000 000 000 100 s 8 == 000 000 001 010 000
  • 45.
    Time complexity Preprocessingtakes time O(|  || p |log(k) / w +| p |) Main search takes time O(| x || p |log(k) /w )