I think I conceptually understand RabinKarp pattern matching algorithm using rolling hash. While going through a sample implementation here, I find that a large prime number q
is added to the previously computed rolling hash.
for (int i = m; i < n; i++) {
// Remove leading digit, add trailing digit, check for match.
txtHash = (txtHash + q - RM*txt.charAt(i-m) % q) % q; //Why +q here?
txtHash = (txtHash*R + txt.charAt(i)) % q;
// match
int offset = i - m + 1;
if ((patHash == txtHash) && check(txt, offset))
return offset;
}
I am not sure why this is needed. Can I get some help with this?
In my limited testing, I get the same result whether or not q
term in included.
Does this have something to do with which version of algorithm (Monte Carlo/Las Vegas) is being implemented?
The +q
term is there to avoid dealing with negative numbers.
We want txtHash
to always lie in the interval [0;q[
, without this +q
it could also be in ]-q;0[
.
This could lead to missing the pattern. e.g if patHash = 0xdead
but you compute txtHash = -q+0xdead
. Those two values are equal mathematically mod q
but different with Java taken % q
.