# Rabin-Karp String Matching Algorithm

## Working Mechanism

- Comparing numbers is easier and cheaper than comparing strings. Rabin Karp algorithm represents strings in numbers.
- Suppose p represents values corresponding to pattern P[1…m] of length m. And t
_{s}represents values of m-length substrings T[(s + 1) … (s + m)] for s = 0, 1, 2, …, n – m. - We can compute p in O(m) time and all t
_{s}can be computed in O(n – m + 1) time. - Rabin Karp algorithm is based on hashing technique. It first computes the hash value of p and t
_{s}. - If hash values are same, i.e. if hash(p) = hash(t
_{s}), we check the equality of inverse hash similar to a naïve method. If hash values are not same, no need to compare actual string. - On the hash match, actual characters of both strings are compared using brute force approach. If the pattern is found, then it is called
*hit*. Otherwise, it is called a*spurious hit*. - For example, let us consider the hash value of string T = ABCDE is 38 and hash of string P = ABCDX is 71. Clearly, hash values are not same, so strings cannot be same. Brute force approach does five comparisons whereas Rabin Karp dose only one comparison.
- However, the same hash value does not ensure the string match. Two different strings can have same hash values. That is why we need to compare them character by character on hash hit.
- Fig. 10.4.2 shows the difference between Brute Force and Rabin Karp approach.

- Given pattern P[1…m], we can derive its numeric value p in base d in O(m) time as follow:

p = P[m] + d(P[m – 1] + d(P[m – 2] + … + d(P[2] + dP[1]) … ))

- Similarly, we can derive numeric value of first substring t
_{0}of length m from text T[1…n] in O(m) time. Remaining all t_{i}, i = 1, 2, 3, …, n – m, can be derived in constant time. - Given t
_{s}, we can compute t_{s}+1 as,

t_{s+1 }= d(t_{s} – d^{m-1} T[s + 1]) + T[s + m + 1]

- Assume that T = [4, 3, 1, 5, 6, 7, 5, 9, 3] and P = [1, 5, 6]. Here length of P is 3, so m = 3. Consider that, for given pattern P, its value p = 156, and t
_{0}= 431.

t_{1 }= 10(431 – 10^{2}T[1]) + T[4]

= 10(431 – 400 ) + 5

= 315

- Values of p and t
_{s}may be too large to process. We can reduce these values by taking its modulo with suitable number q. typically, q is a prime number. - Mod function has some nice mathematical property.
- [(a mod k) + (b mod k)] mod k = (a + b) mod k

- (a mod k) mod k = a mod k

- Computing hash value of every subsequence of m character of text may turn out to be time-consuming. However, a bit of mathematics makes it easier.
- Suppose t
_{s}represents decimal value of substring

T[(s + 1) … (s + m)]. If hash of t_{s}is known, hash value can directly be derived for t_{s+1}as follow :

Hash for t_{s+1} = (d*(t_{s} – T[s + 1]h) + T[s + m + 1]) mod q,

Where h = d^{m – 1} mod q

**Two facts :**

t_{s} = p mod q, does not mean t_{s} = p

t_{s} ≠ p mod q, means t_{s} ≠ p

- If p = 45365, t
_{s}= 64371 and q = 11,

45365 mod 13 = 8

64371 mod 13 = 8

- From above example, it is clear that two
*different*strings might have same mod. We can reject all negative tests to rule out all possible invalid shifts. And all positive tests must be validated to overcome spurious hits. - Spurious hits reduce with a larger value of q, and it increases with smaller q.

## Algorithm

**Algorithm** RABIN_KARP(T, P)
// T is text of length n
// P is pattern of length m
h ← power(d, m – 1) mod q
p ← 0
t_{0} ← 0
**for** i ← 1 to m **do**
p ← (d*p + P[i]) mod q
t_{0} ← ((d * t_{0} ) + T[i]) mod q
**end**
**for** s ← 0 to n – m **do**
**if** p == t_{0} **then**
**if** P[1…m] == T[s+1 … s + m] **then**
**print** “Match found at shift”, s
**end**
**end**
**if** s < (n – m) **then**
t_{s + 1} ← (d*(t_{s} – T[s + 1]) * h ) + T[s + m + 1]
**end**
**end**

## Complexity Analysis

- Rabin Karp algorithm is a randomized algorithm. In most of the cases, it runs in linear time, i.e. in O(n). However, the worst case of Rabin-Karp algorithm is as bad as a naïve algorithm, i.e. O(mn), but it’s rare.
- It can happen only when prime number used for hashing is very small.

## Examples

**Example: Explain spurious hits in Rabin-Karp string matching algorithm with an example. Working modulo q = 13, how many spurious hits does the Rabin-Karp matcher encounter in the text T = 2359023141526739921 when looking for the pattern P = 31415 ? **

**Solution:**

Given pattern P = 31415, and prime number q = 13

P mod q = 31415 mod 13 = 7

Let us find hash value for given text :

T[1…5] = 23590 mod 13 = 8

T[2…6] = 35902 mod 13 = 9

:

T[(m – 4) … m] = 39921 mod 13 = 11

- The hash value of pattern P is 7. We have two such values in a hash(T). So there may be a spurious hit or actual string. In given text T, one hit is an actual match and one is spurious hit.
- If the hash of pattern and any substring in the text is same, we have two possibilities :

**(i) Hit**** :** Pattern and text are same

**(ii) Spurious hit**** :** Hash value is same but pattern and corresponding text is not same

Here, m = length of pattern = 5.

Consider

t_{s} = 23590. Next value t_{s+1} is derived as,

t_{s+1} = 10(t_{s}) – 10^{m}* T[s+1] + T[s + m + 1] )

= (10*23590) – 10^{5} * 2 + 2

= 235900 – 200000 + 2 = 35902

t_{s+2} = 10(t_{s+1}) – 10^{m}* T[s+2] + T[s + m + 2] )

= (10*35902) – 10^{5} * 3 + 3

= 359020 – 300000 + 3 = 59023

- In same way, we can compute the next t
_{s+i}using incremental approach. - Rabin Karp algorithm matches hash value, rather than directly comparing actual string value. If hash value of pattern P and the hash value of subsequence in string T are same, the actual value of strings is compared using brute force approach. Like t
_{s+1}, we can also derive hash value incrementally as shown below.

Calculation for hash of 14152

If t_{s} is known, the hash value can directly be derived for t_{s+1} as follow.

Hash for t_{s+1 }= (d*(t_{s} – T[s+1]h) + T[s + m + 1]) mod q

= 10*(31415 – (3*10^{4} mod 13)) + 2 mod 13

= 10(31415 – 9) + 2 mod 13

= 314062 mod 13 = 8