phpc++mysqlmurmurhash

How to generate a 64bit Murmur hash v2 in PHP 7.2?


I've got a MySQL database that has some Murmur2 hashes (as unsigned 64bit ints) that were generated with the Percona UDF that comes with the Percona strand of MySQL database found here https://github.com/percona/build-test/blob/master/plugin/percona-udf/murmur_udf.cc

My problem is that now I need to generate these same hashes on the PHP side, but I can't seem to find or tweak anything existing to work/output the same output for the same input.

Things I've tried:

  1. Copying the C++ function from the Percona UDF into my forked version of this PHP extension that originally produced 32bit int hashes https://github.com/StirlingMarketingGroup/php_murmurhash. This almost worked, as in it compiled, but when I execute the function within PHP the apache server crashes with a segfault and I'm not familiar enough with C++ and PHP extensions to debug this

The segfault gets caused by me running this function

var_dump(murmurhash('Hello World'));

Which works fine normally when I downloaded https://github.com/kibae/php_murmurhash (the original, 32bit, hash producing extension) and followed the instructions, but once I replaced the function (Only edit in the MurmurHash2.cpp file to https://github.com/StirlingMarketingGroup/php_murmurhash/blob/master/MurmurHash2.cpp) the same function call crashes the PHP script.

  1. Trying to port the Percona UDF C++ function to PHP. I'm not super sure if my PHP function is 100% accurate with trying to account for the pointer incrementing but I suspect more so that the reason I get entirely different output with the PHP version has something to do with PHP not supporting unsigned integers.

Here is the PHP function that I've written as a port from the Percona C++ function

function murmurhash2(string $s) : int {
    $len = strlen($s);
    $seed = 0;

    $m = 0x5bd1e995;
    $r = 24;

    $h1 = $seed ^ $len;
    $h2 = 0;

    $i = 0;

    while ($len >= 8) {
        $k1 = ord($s[$i++]);
        $k1 *= $m; $k1 ^= $k1 >> $r; $k1 *= $m;
        $h1 *= $m; $h1 ^= $k1;
        $len -= 4;

        $k2 = ord($s[$i++]);
        $k2 *= $m; $k2 ^= $k2 >> $r; $k2 *= $m;
        $h2 *= $m; $h2 ^= $k2;
        $len -= 4;
    }

    if ($len >= 4) {
        $k1 = ord($s[$i++]);
        $k1 *= $m; $k1 ^= $k1 >> $r; $k1 *= $m;
        $h1 *= $m; $h1 ^= $k1;
        $len -= 4;
    }

    switch ($len) {
        case 3: $h2 ^= ord($s[2]) << 16;
        case 2: $h2 ^= ord($s[1]) << 8;
        case 1: $h2 ^= ord($s[0]);
                $h2 *= $m;
    };

    $h1 ^= $h2 >> 18; $h1 *= $m;
    $h2 ^= $h1 >> 22; $h2 *= $m;
    $h1 ^= $h2 >> 17; $h1 *= $m;

    $h = $h1;

    $h = ($h << 32) | $h2;
    return $h;
}

Within MySQL I get this

select murmur_hash('Hello World'), cast(murmur_hash('Hello World')as unsigned), CONV(cast(murmur_hash('Hello World')as unsigned), 10, 16);
-- -8846466548632298438 9600277525077253178 853B098B6B655C3A

And in PHP I get

var_dump(murmurhash2('Hello World'));
// int(5969224437940092928)

So looking at the MySQL and PHP results, neither signed nor unsigned match my PHP output.

Is there something that can be fixed with either of my previous two approaches, or maybe an already working approach that I can use instead?


Solution

  • I've solved this myself by essentially porting the Percona hashing function directly to a PHP extension MySQL.

    Installation and usage instructions are posted here https://github.com/StirlingMarketingGroup/php-murmur-hash


    Example output

    In MySQL, the Percona extension is used like

    select`murmur_hash`('Yeet')
    -- -7850704420789372250
    

    And in PHP

    php -r 'echo murmur_hash("Yeet");'
    // -7850704420789372250
    

    Note that those are getting treated as signed integers for both environments, which you can solve in MySQL by using cast(`murmur_hash`('Yeet')as unsigned), but PHP doesn't support unsigned integers.