Understanding how rolling hash works with modulus in Rabin Karp algorithm

Asked 23/4, 2016 at 17:26 Answered 17/8, 2024 at 9:32

I am having trouble understanding how the rolling hash algorithm works after the hash has been reduced to modulus value by dividing by a prime number.

Consider the sequence of 5 digits in the number 123456.

The first chunk is 12345. I store the value, in the next window, 6 comes in and 1 goes out.

So the new hash will be (12345-1*10^4)*10 + 6 = 23456. This is fairly intuitive.

As obvious, these numbers are large, so we need a modulus function to keep them small. Suppose I take 101 as the prime number for this purpose.

So 12345 will reduce to 23. How then, from this, will I derive the rolling hash for the next window, 23456?

Myers answered 23/4, 2016 at 17:26 Comment(0)

You calculate it the same way you calculate 23456, but always with modulo 101.

(((23 - (10^4 mod 101))*10) mod 101 + 6) mod 101 = 24.

which is the value you want because 23456 mod 101 = 24.

Breathy answered 23/4, 2016 at 18:3 Comment(2)

Thanks. That was easy to understand! – Myers 23/4, 2016 at 18:25

Did you miss the first one? Shouldn't it be (((23 - (1*10^4 mod 101))*10) mod 101 + 6) mod 101 = 24 ? – Alliterative 19/6, 2020 at 21:44

Answer by @dejvuth is right - i would add this specifically when doing rabin-karp is that sometimes you may end up with a -ve modulus value - in that case it is better to take the +ve equivalent of that modulus value - so that checking if the same modulus was seen before is easier.

For example: with this pattern "abcdabc" - and hash function: hash(i) = (49*S[i]+7*S[i+1]+1*S[i+2])%1123

Results in:

"abc" -> 1046
"bcd" -> 1103
"cda" -> 33
"dab" -> 62
"abc" -> -77

second occurence of "abc" results is -77 that is modulo equivalent of 1046 since (-77 + 1123 = 1046)

PS: i don't have enough "reputation" at this time to add this as a comment..

Wuhu answered 22/7, 2020 at 20:35 Comment(1)

I cannot thank you enough for this answer. I do not know modular arithmetic yet, so I wasted 2 hours on basically rectifying what was not broken. – Christianna 6/1, 2021 at 22:20

Takes me really a long while pondering the answer given by @dejvuth, my mathematical intuition is simply too poor :(

I put the justification below just in case anyone hits the same confusion as I did, also kindly teach if anything wrong:

Firstly note the distributive laws in modulo:

(a + b) mod n = [(a mod n) + (b mod n)] mod n

a * b mod n = [(a mod n) * (b mod n)] mod n

Now let's define a hash function below, where 'c' is the character we currently processing, q is a small prime while m is the largest prime smaller than word size:

h = 0
for c in window:
    h = ( h * q + ord(c) ) % m

For simplicity and without loss of generality, let's analyze using window size of 2, note I denote ord(c1) simply as c1, etc:

h = ( ( ( 0 * q + c0 ) % m ) * q + c1 ) % m )
  = ( c0 * q^1 % m + c1 * q^0 ) % m
  = ( c0 * q^1 % m + c1 * q^0 % m ) % m, due to distributive law and a % b % b = a % b
  = ( c0 * q^1 + c1 * q^0 ) % m, due to distributive law

Now it becomes obvious that, even though modulo is used, the series can still be written in the form of

h = ( ∑ ci * q ^ ( n - i + 1 ) ) % m, where n is the window size.

Finally, let's see what need be done when window size is exceeded and we need to get rid of the first char in the window, still we use window size of 2 to demonstrate, please pay extra attention on which part the modulo is applied so as to get the result:

h = ( h - ( c0 * q^1 ) % m ) % m
  = ( ( c0 * q^1 + c1 * q^0 ) % m - ( c0 * q^1 ) % m ) % m
  = ( ( c0 * q^1 + c1 * q^0 ) - ( c0 * q^1 ) ) % m
  = ( c1 * q^0 ) % m

Bang, now we get what we want, the hash value is just the same as c1 being the first char in current window, simply let the rolling go on an on until we exhaust the string.

Syncopation answered 17/8, 2024 at 9:32 Comment(0)

Recommended topics

Hot tags