c++cstringalgorithmhash

Rolling hash in Rabin-Karp


I am trying to implement the Rabin-Karp for finding the substring; and I got stuck at the rolling hash(trying to use the formula suggested in Wikipedia).

#define MOD 1000000007
unsigned long long rolling_hash(const char *str)
{
        unsigned long long hash = 0;
        size_t str_len = strlen(str);
        for(int i = 0, k = str_len -1; i < str_len; i++, k--) {
                hash = hash + str[i] * pow(257, k);
        //      hash = hash % MOD;
        }
        return hash;
}

int main(void)
{
        printf("%llu\n", rolling_hash("TestString"));
        printf("%llu\n", rolling_hash("estStringh"));
        unsigned long long old = rolling_hash("TestString");
        // Add a character to the end
        // since the last char in old was multiplied by 1, now multiply it by
        // the base and then add the _new_ character to the end
        old = old * 257 + 'h';
        //old = old % MOD;
        // Remove a char from the start
        // Simply, remove the hash value of the first character
        old = old - 'T' * pow(257, 10);;

        printf("\n%llu\n", old);
        return 0;
}

The code above works perfectly fine as long as I do not introduce any remainder operations; once I uncomment my % operations, things break down and the answer I get from the changes over the rolling hash won't equal that which's being printed by the second print.

janisz's answer:
The suggestion to change the hash generator as in janisz's answer got the remainder to work when adding new characters but NOT when removing the old ones.
Note: I am using my own pow function to work with unsigned long long


Solution

  • Hash genrator code is wrong. It should be

    hash = (hash*257 + str[i]) % MOD;
    

    and unncoment old_hash = old_hash % MOD;. Also change the way you generate new hash from previous

    (old_hash - to_delete_char * pow(257, str_len-1)) % MOD;
    

    Take a look at your code. First 2 lines are perfectly good. What happen in the loop. First of all you are doing as much multiplies as you can. In my approach I use Horner scheme of computing hash becouse hash is a polynomial.

    Why it works when without modulus and with not. I think it's a coincidence becouse you overflow integer with 8 characters (log(2^64)/log(257) = 8).

    Now what's wrong with removing characters. to_delete_char * pow(257, str_len); should be to_delete_char * pow(257, str_len-1); index should start from 0 not 1 to mach your generator.

    EDIT: I think problem was in pow function. As I wrote above it overflow just with 8 characters. In your example you have 10 so it can't work.

    EDIT: It turns out that adding and removing character must be done as a one operation. Probably due to equivalents but I'm not sure.

    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <unistd.h>
    
    #define MOD 787
    
    unsigned long long pow(int x, int y)
    {
        unsigned long long ret = 1;
        for (int i=0;i<y;i++)
            ret = (ret*x)%MOD;
        return ret;
    }
    unsigned long long rolling_hash(const char *str)
    {
            unsigned long long hash = 0;
            size_t str_len = strlen(str);
            for(int i = 0, k = str_len -1; i < str_len; i++, k--) {
                    hash = hash + (str[i] * pow(257, k))%MOD;
                    hash = hash % MOD;
            }
            return hash;
    }
    
    int main(void)
    {
            char input[] = "TestString";
            printf("Input: %llu\n", rolling_hash(input));
            printf("Expected: %llu\n", rolling_hash("estStringh"));
            unsigned long long old = rolling_hash(input);
            // Add a character to the end
            // and Remove a char from the start
    
            unsigned long long  h = (input[0] * pow(257, strlen(input)))%MOD;
            old = ((old * 257) + 'h' - h) % MOD;
    
            printf("Actual: %llu\n", old);
            return 0;
    }