Currently working on a string search program for a uni module, and I've managed to implement the algorithms successfully, at least to the point where they're finding the string consistently. I implemented Boyer Moore and Rabin Karp. I also threw in a Brute Force when one of my coursemates experienced this exact problem, and realised I had the same issue - that brute force is faster than Rabin-Karp on a wordlist.
Rabin-Karp seems to be taking the most amount of time performing the rolling hash, I was curious at first if I was just having a lot of collisions, but I managed to get collisions down to 3 with a huge prime number. That added on a little bit of time I'm assuming due to the size of the prime number, but it seemed pretty clear that rolling hash was causing the problem.
This is the rolling hash section:
//hashes don't match, rehash using rolling hash to move on to next string section
if (counter < (stringLength - patternLength)) {
stringHash = (MAXCHAR *(stringHash - stringFile[counter] * hash) + stringFile[counter + patternLength]) % prime;
if (stringHash < 0) {
stringHash += prime; //when hash value is negative, make it positive
}
}
if (!found) {
counter++;
}
I wanted to try searching through a huge text file so I used the rockyou wordlist, which Boyer Moore is very happy with, and Rabin-Karp is taking under a second. Brute Force is taking less than half the time of Rabin-Karp though which to me just doesn't make sense?
Am I misunderstanding how these algorithms should be applied, or is there a problem with the rolling hash process that I'm using?
Brute force string search is the special case of Rabin-Karp with a constant hash function (so every rolling hash matches).
The worst case complexity is the same for both algorithms, as is the average case complexity for most definitions of "average case".
In these situations, Rabin-Karp will take longer due to the overhead of computing and checking a good hash.
The problem with brute force, compared to Rabin-Karp, is that bad cases sometimes occur in real life. If you're searching for pathnames, for example, then it can happen that your pattern has a long prefix in common with many or most of the path names and parts of path names in the file, and that will make brute force take a long time.
With Rabin-Karp, the bad cases are very unlikely to occur in real life. They really only happen under "adversarial" conditions, where the file and pattern are constructed purposefully to take a long time, with specific knowledge of the hash function you use.
Even so... Rabin-Karp is not a great algorithm for single-pattern searching. It becomes much more useful when you're searching for many strings simultaneously and you can look up the rolling hash in a dictionary of potential matches.