Knuth Morris Pratt (KMP) Algorithm
Knuth Morris Pratt Algorithm – Introduction
Knuth Morris Pratt algorithm is an effective way of finding patterns from the text.
- This algorithm is also known as KMP (Knuth-Morris-Pratt) algorithm. This is the first linear time algorithm for string matching. It utilizes the concept of a naïve approach in some different ways. This approach keeps track of matched part of the pattern.
- The main idea of the algorithm is to avoid computation of transition function d and reduce useless shifts performed in a naïve approach.
- By maintaining the information of the already processed string, this approach reduces the time to O(m + n). That is, in the worst case, the KMP algorithm examines all characters of the input pattern and text exactly once.
- This improvement is achieved by using the auxiliary function π. The naïve approach was not storing anything it has already matched, and that’s why it was shifting the pattern right side by just one. Computation of auxiliary function π takes O(m) time.
- Function π is very useful, it computes the number of shifts of the pattern by finding pattern matches within itself.
Sliding rule:
- The naïve approach slides the pattern by one right side. The KMP approach slides the string multiple steps and reduces comparisons. Let us consider the pattern
P = p1p2p3…pm-1pm. - For any string P, sub-string P’ of the first i characters is called the prefix of P, where i = 1 to m. Null string also belongs prefix of P. String P’ is proper prefix of P if and only if P’ ≠ P.
- Similarly, for any string P, substring P’ of last i characters is called the suffix of P, where i = 1 to m. Null string also belongs suffix of P. String P’ is the proper suffix of P if and only if P’ ≠ P.
- Consider the pattern P = abcbdabc ;
- Prefixes of P are : {f, a, ab, abc, abcd, abcda, abcdab, abcdabc}, where abcdabc is not the proper prefix of P.
- Suffixes of P are : {f, c, bc, abc, dabc, cdabc, bcdabc, abcdabc}, where abcdabc is not the proper suffix of P.
- Suppose the pattern P[1….q] is matched with the text at T[(i – q + 1)… i]. The first mismatch occurs at position P[q + 1] and T[i + 1]. Slide the pattern right by x, where x is the length of the longest proper prefix of P[1…q] and also the longest suffix of P[1…q]
Working principle
If P[1…q] matches with T[(i – q + 1) …i], there are two possibilities :
- If P[q + 1] = T[i + 1], matched string length is increased by one. If q = m, a complete match is found and the algorithm terminates.
- If P[q + 1] ≠ T[i + 1], then the slide pattern P to right by longest common suffix and prefix of P[1…q].
Knuth Morris Pratt – Algorithm
Algorithm PREFIX_FUNCTION(P)
// P is the pattern of length m
π[1] ← 0
k ← 0
for q ← 2 to m do
while k > 0 and P[k+1] ≠ P[q] do
k ← π[k]
end
if P[k+1] == P[q] then
k ← k + 1
end
π[q] ← k
end
return π
The following algorithm finds the pattern within the text :
Algorithm KMP_STRING_MATCHING(T, P)
// T is text of length n
// P is pattern of length m
π ← PREFIX_FUNCTION(P)
q ← 0
for i ← 1 to n do
while q > 0 and P[q + 1] ≠ T[i] do
q ← π[q]
end
if P[q + 1] == T[i] then
q ← q + 1
end
end
if q == m then
print “Pattern found with shift”, i – m
end
q ← π[q]
Complexity Analysis
Function π is computed in O(m) time where as the actual string-matching routine scans n characters. So worst case running time of the algorithm is given by O(m + n).
Examples
Example: Compute the prefix function for pattern ababaca
Solution:
Initially, k = 0, q = 2, π[1] = 0 and m = 7
i | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
P[i] | a | b | a | b | a | c | a |
p[i] | 0 |
Iteration 1 : q = 2, k = 0
k is not greater than 0, so skip while loop
P[k + 1] = P[1] = a and P[q] = P[2] = b
P[k + 1] ≠ P[q], so π[q] = k → π[2] = 0
i | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
P[i] | a | b | a | b | a | c | a |
p[i] | 0 | 0 |
Iteration 2 : q = 3, k = 0
k is not greater than 0, so skip while loop
P[k + 1] = P[1] = a and P[q] = P[3] = a
P[k + 1] = P[q], so k = k + 1 → k = 1
π[q] = k → π[3] = 1
i | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
P[i] | a | b | a | b | a | c | a |
p[i] | 0 | 0 | 1 |
Iteration 3 : q = 4, k = 1
k is greater than 0, so enter into while loop
P[k + 1] = P[2] = b and P[q] = P[4] = b,
k > 0 but P[k + 1] = P[q] // break while loop
P[k + 1] = P[q], so k = k + 1 → k = 2
π[q] = k → π[4] = 2
i | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
P[i] | a | b | a | b | a | c | a |
p[i] | 0 | 0 | 1 | 2 |
Iteration 4 : q = 5, k = 2
k is greater than 0, so enter into while loop
P[k + 1] = P[3] = a and P[q] = P[5] = a,
k > 0 but P[k + 1] = P[q] // break while loop
P[k + 1] = P[q], so k = k + 1 → k = 3
π[q] = k → π[5] = 3
i | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
P[i] | a | b | a | b | a | c | a |
p[i] | 0 | 0 | 1 | 2 | 3 |
Iteration 5 : q = 6, k = 3
k is greater than 0, so enter into while loop
P[k + 1] = P[4] = b and P[q] = P[6] = c,
k > 0 and P[k + 1] ≠ P[q] // Enter in while loop
k = π[k] → k = π[3] = 1
Now, P[k + 1] = P[2] = b and P[q] = c
Still k > 0 and P[k + 1] ≠ P[q] // Enter in while loop again
k = π[k] → k = π[1] = 0
As k = 0, While loop breaks now
P[k + 1] = P[1] = a and P[q] = P[6] = c
P[k + 1] ≠ P[q], So, π[q] = k → π[6] = 0
i | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
P[i] | a | b | a | b | a | c | a |
p[i] | 0 | 0 | 1 | 2 | 3 | 0 |
Iteration 6 : q = 7, k = 0
k is not greater than 0, so skip while loop
P[k + 1] = P[1] = a and P[q] = P[7] = a
P[k + 1] = P[q], so k = k + 1 = 1
π[q] = k → π[7] = 1
i | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
P[i] | a | b | a | b | a | c | a |
p[i] | 0 | 0 | 1 | 2 | 3 | 0 | 1 |
Example: Check if pattern P = abababca exits in text T = bacbababaabcbab.
Solution:
- The π function for a given pattern would be,
i | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
P[i] | a | b | a | b | a | b | c | a |
p[i] | 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 |
- The first partial match is found in the second position. The length of this partial match is 1. π[1] = 0, so we won’t be able to skip ahead. We will continue matching by moving patterns right by one.
- The next partial match is found in position 5. The length of this match is 5. The next character of P and T does not match. π[5] =3, which means we get to skip ahead.
Skip = partial length – π[partial length]
= 5 – π[5] = 5 – 3 = 2
- So we can skip two characters ahead.
- The next partial match is found in position 7. The length of this match is 3. The next character of P and T does not match. π[3] =1, which means we get to skip ahead. The Cross indicates skipped characters.
Skip = partial length – π[partial length]
= 3 – π[3] = 3 – 1 = 2
- So we can skip two characters ahead.
- Now, the pattern is longer than the remaining text, so the match does not found.
Comparison of All Methods
If m is the length of the pattern and n is the length of the text to be searched then,
Algorithm | Preprocessing time | Matching time |
Naïve string matching algorithm | 0 | O((n-m)*m) |
Rabin Karp string matching algorithm | O(m) | Average: O(n + m) Worst: O((n-m)*m) |
Finite state automata | O(m |S|) | O(n) |
Knuth-Morris-Pratt algorithm | O(m) | O(n) |
Youtube Channel: https://shorturl.at/inryW