Knuth Morris Pratt (KMP) Algorithm

Knuth Morris Pratt Algorithm – Introduction

Knuth Morris Pratt algorithm is an effective way of finding patterns from the text.

  • This algorithm is also known as KMP (Knuth-Morris-Pratt) algorithm. This is the first linear time algorithm for string matching. It utilizes the concept of a naïve approach in some different ways. This approach keeps track of matched part of the pattern.
  • The main idea of the algorithm is to avoid computation of transition function d and reduce useless shifts performed in a naïve approach.
  • By maintaining the information of the already processed string, this approach reduces the time to O(m + n). That is, in the worst case, the KMP algorithm examines all characters of the input pattern and text exactly once.
  • This improvement is achieved by using the auxiliary function π. The naïve approach was not storing anything it has already matched, and that’s why it was shifting the pattern right side by just one. Computation of auxiliary function π takes O(m) time.
  • Function π is very useful, it computes the number of shifts of the pattern by finding pattern matches within itself.

Sliding rule:

  • The naïve approach slides the pattern by one right side. The KMP approach slides the string multiple steps and reduces comparisons. Let us consider the pattern
    P = p1p2p3…pm-1pm.
  • For any string P, sub-string P’ of the first i characters is called the prefix of P, where i = 1 to m. Null string also belongs prefix of P. String P’ is proper prefix of P if and only if P’ ≠ P.
  • Similarly, for any string P, substring P’ of last i characters is called the suffix of P, where i = 1 to m. Null string also belongs suffix of P. String P’ is the proper suffix of P if and only if P’ ≠ P.
  • Consider the pattern P = abcbdabc ;
  • Prefixes of P are : {f, a, ab, abc, abcd, abcda, abcdab, abcdabc}, where abcdabc is not the proper prefix of P.
  • Suffixes of P are : {f, c, bc, abc, dabc, cdabc, bcdabc, abcdabc}, where abcdabc is not the proper suffix of P.
  • Suppose the pattern P[1….q] is matched with the text at T[(i – q + 1)… i]. The first mismatch occurs at position P[q + 1] and T[i + 1]. Slide the pattern right by x, where x is the length of the longest proper prefix of P[1…q] and also the longest suffix of P[1…q]

Working principle

If P[1…q] matches with T[(i – q + 1) …i], there are two possibilities :

  • If P[q + 1] = T[i + 1], matched string length is increased by one. If q = m, a complete match is found and the algorithm terminates.
  • If P[q + 1] ≠ T[i + 1], then the slide pattern P to right by longest common suffix and prefix of P[1…q].

Knuth Morris Pratt – Algorithm

Algorithm PREFIX_FUNCTION(P)
// P is the pattern of length m

π[1] ← 0
k ← 0
for  q ← 2 to m do
   while k > 0 and P[k+1] ≠ P[q] do
      k ← π[k]
   end
   if P[k+1] == P[q] then
      k ← k + 1
   end
   π[q] ← k
end
return π

The following algorithm finds the pattern within the text :

Algorithm KMP_STRING_MATCHING(T, P)
// T is text of length n
// P  is pattern of length m

π  ← PREFIX_FUNCTION(P)

q ← 0

for i ← 1 to n do
   while  q > 0 and P[q + 1] ≠ T[i] do
      q ← π[q]
   end
   if P[q + 1] == T[i] then
      q ← q + 1
   end
end

if q == m then
   print “Pattern found with shift”, i – m
end
q ← π[q]

Complexity Analysis

Function π is computed in O(m) time where as the actual string-matching routine scans n characters. So worst case running time of the algorithm is given by  O(m + n).

Examples

Example: Compute the prefix function for pattern ababaca

Solution:

Initially, k = 0, q = 2, π[1] = 0 and m = 7

i1234567
P[i]ababaca
p[i]0      

Iteration 1 : q = 2, k = 0

k is not greater than 0, so skip while loop

P[k + 1]  =   P[1] = a and P[q] = P[2] = b

P[k + 1] ≠ P[q], so π[q] = k → π[2] = 0

i1234567
P[i]ababaca
p[i]00     

Iteration 2 : q = 3, k = 0

k is not greater than 0, so skip while loop

P[k + 1] = P[1] = a and P[q] = P[3] = a

P[k + 1] = P[q], so k = k + 1  → k = 1

π[q] = k → π[3] = 1

i1234567
P[i]ababaca
p[i]001    

Iteration 3 : q = 4, k = 1

k is greater than 0, so enter into while loop

P[k + 1]  =   P[2] = b and P[q] = P[4] = b,

k > 0 but P[k + 1] = P[q]         // break while loop

P[k + 1]  =   P[q], so k = k + 1  → k = 2

π[q] = k → π[4] = 2

i1234567
P[i]ababaca
p[i]0012   

Iteration 4 : q = 5, k = 2

k is greater than 0, so enter into while loop

P[k + 1] = P[3] = a and P[q] = P[5] = a,

k > 0 but P[k + 1] = P[q]         // break while loop

P[k + 1] = P[q], so k = k + 1  → k = 3

π[q] = k → π[5] = 3

i1234567
P[i]ababaca
p[i]00123  

Iteration 5 : q = 6, k = 3

k is greater than 0, so enter into while loop

P[k + 1] = P[4] = b and P[q] = P[6] = c,

k  > 0  and P[k + 1] ≠ P[q]      // Enter in while loop

k = π[k] → k = π[3] = 1

Now,   P[k + 1]  =   P[2] = b and P[q] = c

Still k > 0 and P[k + 1] ≠ P[q]      // Enter in while loop again

k = π[k] → k = π[1] = 0

As k    =   0, While loop breaks now

P[k + 1] =   P[1] = a and P[q] = P[6] = c

P[k + 1] ≠ P[q], So, π[q] = k  → π[6] = 0

i1234567
P[i]ababaca
p[i]001230 

Iteration 6 : q = 7, k = 0

k is not greater than 0, so skip while loop

P[k + 1] =   P[1] = a and P[q] = P[7] = a

P[k + 1] =   P[q], so k = k + 1 = 1

π[q] = k → π[7] = 1

i1234567
P[i]ababaca
p[i]0012301

Example: Check if pattern P = abababca exits in text T = bacbababaabcbab.

Solution:

  • The π function for a given pattern would be,
i12345678
P[i]abababca
p[i]00123401
  • The first partial match is found in the second position. The length of this partial match is 1. π[1] = 0, so we won’t be able to skip ahead. We will continue matching by moving patterns right by one.
Knuth Morris Pratt Algorithm
  • The next partial match is found in position 5. The length of this match is 5. The next character of P and T does not match. π[5] =3, which means we get to skip ahead.
Knuth Morris Pratt Algorithm

             Skip   =    partial length – π[partial length]

                              =   5 – π[5] = 5 – 3 = 2

  • So we can skip two characters ahead.
  • The next partial match is found in position 7. The length of this match is 3. The next character of P and T does not match. π[3] =1, which means we get to skip ahead. The Cross indicates skipped characters.
Knuth Morris Pratt Algorithm

                  Skip    =   partial length – π[partial length]

                              =   3 – π[3] = 3 – 1 = 2

  • So we can skip two characters ahead.
  • Now, the pattern is longer than the remaining text, so the match does not found.

Comparison of All Methods

If m is the length of the pattern and n is the length of the text to be searched then,

AlgorithmPreprocessing timeMatching time
Naïve string matching algorithm0O((n-m)*m)
Rabin Karp string matching algorithmO(m)Average: O(n + m) Worst: O((n-m)*m)
Finite state automataO(m |S|)O(n)
Knuth-Morris-Pratt algorithmO(m)O(n)

Youtube Channel: https://shorturl.at/inryW

Leave a Reply

Your email address will not be published. Required fields are marked *