(I am VERY sorry I am not allowed to add many URLs to help me better explain my problems in this post because I am new on StackOverflow and my StackOverflow account has very low privilege).
Summary
Can anyone please guide me on how to modify murmurhash3.js
(below) so that it produces the same hash as MurmurHash3.cpp
(below) does? I provide a simple python code "simple_python_wrapper.py" for MurmurHash3.cpp based on what I need. The simple_python_wrapper.py should work on your computer if you have sklearn installed.
I have heavily used Murmurhash3.cpp
(shown below) while using transform
from sklearn (a Python Machine Learning Library): from sklearn.feature_extraction._hashing import transform
in one of my machine learning projects. transform
uses Murmurhash3.cpp
deep down in sklearn's implementation/import tree.
More details
hash % (2^18) {that is "hash modulus 2^18"} based on MurmurHash3.cpp
"hello" gives 260679
"there" gives 45525
hash % (2^18) {that is "hash modulus 2^18"} based on murmurhash3.js
"hello" gives -58999
"there" gives 65775
murmurhash3.js
/*
* The MurmurHash3 algorithm was created by Austin Appleby. This JavaScript port was authored
* by whitequark (based on Java port by Yonik Seeley) and is placed into the public domain.
* The author hereby disclaims copyright to this source code.
*
* This produces exactly the same hash values as the final C++ version of MurmurHash3 and
* is thus suitable for producing the same hash values across platforms.
*
* There are two versions of this hash implementation. First interprets the string as a
* sequence of bytes, ignoring most significant byte of each codepoint. The second one
* interprets the string as a UTF-16 codepoint sequence, and appends each 16-bit codepoint
* to the hash independently. The latter mode was not written to be compatible with
* any other implementation, but it should offer better performance for JavaScript-only
* applications.
*
* See http://github.com/whitequark/murmurhash3-js for future updates to this file.
*/
var MurmurHash3 = {
mul32: function(m, n) {
var nlo = n & 0xffff;
var nhi = n - nlo;
return ((nhi * m | 0) + (nlo * m | 0)) | 0;
},
hashBytes: function(data, len, seed) {
var c1 = 0xcc9e2d51, c2 = 0x1b873593;
var h1 = seed;
var roundedEnd = len & ~0x3;
for (var i = 0; i < roundedEnd; i += 4) {
var k1 = (data.charCodeAt(i) & 0xff) |
((data.charCodeAt(i + 1) & 0xff) << 8) |
((data.charCodeAt(i + 2) & 0xff) << 16) |
((data.charCodeAt(i + 3) & 0xff) << 24);
k1 = this.mul32(k1, c1);
k1 = ((k1 & 0x1ffff) << 15) | (k1 >>> 17); // ROTL32(k1,15);
k1 = this.mul32(k1, c2);
h1 ^= k1;
h1 = ((h1 & 0x7ffff) << 13) | (h1 >>> 19); // ROTL32(h1,13);
h1 = (h1 * 5 + 0xe6546b64) | 0;
}
k1 = 0;
switch(len % 4) {
case 3:
k1 = (data.charCodeAt(roundedEnd + 2) & 0xff) << 16;
// fallthrough
case 2:
k1 |= (data.charCodeAt(roundedEnd + 1) & 0xff) << 8;
// fallthrough
case 1:
k1 |= (data.charCodeAt(roundedEnd) & 0xff);
k1 = this.mul32(k1, c1);
k1 = ((k1 & 0x1ffff) << 15) | (k1 >>> 17); // ROTL32(k1,15);
k1 = this.mul32(k1, c2);
h1 ^= k1;
}
// finalization
h1 ^= len;
// fmix(h1);
h1 ^= h1 >>> 16;
h1 = this.mul32(h1, 0x85ebca6b);
h1 ^= h1 >>> 13;
h1 = this.mul32(h1, 0xc2b2ae35);
h1 ^= h1 >>> 16;
return h1;
},
hashString: function(data, len, seed) {
var c1 = 0xcc9e2d51, c2 = 0x1b873593;
var h1 = seed;
var roundedEnd = len & ~0x1;
for (var i = 0; i < roundedEnd; i += 2) {
var k1 = data.charCodeAt(i) | (data.charCodeAt(i + 1) << 16);
k1 = this.mul32(k1, c1);
k1 = ((k1 & 0x1ffff) << 15) | (k1 >>> 17); // ROTL32(k1,15);
k1 = this.mul32(k1, c2);
h1 ^= k1;
h1 = ((h1 & 0x7ffff) << 13) | (h1 >>> 19); // ROTL32(h1,13);
h1 = (h1 * 5 + 0xe6546b64) | 0;
}
if((len % 2) == 1) {
k1 = data.charCodeAt(roundedEnd);
k1 = this.mul32(k1, c1);
k1 = ((k1 & 0x1ffff) << 15) | (k1 >>> 17); // ROTL32(k1,15);
k1 = this.mul32(k1, c2);
h1 ^= k1;
}
// finalization
h1 ^= (len << 1);
// fmix(h1);
h1 ^= h1 >>> 16;
h1 = this.mul32(h1, 0x85ebca6b);
h1 ^= h1 >>> 13;
h1 = this.mul32(h1, 0xc2b2ae35);
h1 ^= h1 >>> 16;
return h1;
}
};
if(typeof module !== "undefined" && typeof module.exports !== "undefined") {
module.exports = MurmurHash3;
}
Here is the HTML code + Javascript I am using to test the javascript
https://jsbin.com/gicomikike/edit?html,js,output
<html>
<head>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.1.0/jquery.min.js"></script>
<script src="murmurhash3.js"></script>
<script>
function call_murmurhash3_32_gc () {
var key = $('textarea#textarea1').val();
var seed = 0;
var hash = MurmurHash3.hashString (key, key.length, seed);
$('div#div1').text(hash);
}
</script>
</head>
<body>
Body
<form>
<textarea rows="4" cols="50" id=textarea1></textarea>
<br>
<input type="button" value="Hash" onclick="call_murmurhash3_32_gc()"/>
</form>
<div id=div1>
</div>
</body>
</html>
simple_python_wrapper.py
This makes use of MurmurHash3.cpp in sklearn's import tree.
from sklearn.feature_extraction._hashing import transform
import numpy as np
def getHashIndex (words):
raw_X = words
n_features = 262144 # 2 ** 18
dtype = np.float32 #np.float64
#transform(raw_X, Py_ssize_t n_features, dtype)
indices_a, indptr, values = transform (raw_X, n_features, dtype)
return indices_a
words = [[("hello", 1), ("there", 1)]]
print getHashIndex (words)
Output
[260679 45525]
MurmurHash3.cpp
I copied this code is available here
https://github.com/karanlyons/murmurHash3.js/blob/master/murmurHash3.js
//-----------------------------------------------------------------------------
// MurmurHash3 was written by Austin Appleby, and is placed in the public
// domain. The author hereby disclaims copyright to this source code.
// Note - The x86 and x64 versions do _not_ produce the same results, as the
// algorithms are optimized for their respective platforms. You can still
// compile and run any of them on any platform, but your performance with the
// non-native version will be less than optimal.
#include "MurmurHash3.h"
//-----------------------------------------------------------------------------
// Platform-specific functions and macros
// Microsoft Visual Studio
#if defined(_MSC_VER)
#define FORCE_INLINE __forceinline
#include <stdlib.h>
#define ROTL32(x,y) _rotl(x,y)
#define ROTL64(x,y) _rotl64(x,y)
#define BIG_CONSTANT(x) (x)
// Other compilers
#else // defined(_MSC_VER)
#if defined(GNUC) && ((GNUC > 4) || (GNUC == 4 && GNUC_MINOR >= 4))
/* gcc version >= 4.4 4.1 = RHEL 5, 4.4 = RHEL 6.
* Don't inline for RHEL 5 gcc which is 4.1 */
#define FORCE_INLINE attribute((always_inline))
#else
#define FORCE_INLINE
#endif
inline uint32_t rotl32 ( uint32_t x, int8_t r )
{
return (x << r) | (x >> (32 - r));
}
inline uint64_t rotl64 ( uint64_t x, int8_t r )
{
return (x << r) | (x >> (64 - r));
}
#define ROTL32(x,y) rotl32(x,y)
#define ROTL64(x,y) rotl64(x,y)
#define BIG_CONSTANT(x) (x##LLU)
#endif // !defined(_MSC_VER)
//-----------------------------------------------------------------------------
// Block read - if your platform needs to do endian-swapping or can only
// handle aligned reads, do the conversion here
FORCE_INLINE uint32_t getblock ( const uint32_t * p, int i )
{
return p[i];
}
FORCE_INLINE uint64_t getblock ( const uint64_t * p, int i )
{
return p[i];
}
//-----------------------------------------------------------------------------
// Finalization mix - force all bits of a hash block to avalanche
FORCE_INLINE uint32_t fmix ( uint32_t h )
{
h ^= h >> 16;
h *= 0x85ebca6b;
h ^= h >> 13;
h *= 0xc2b2ae35;
h ^= h >> 16;
return h;
}
//----------
FORCE_INLINE uint64_t fmix ( uint64_t k )
{
k ^= k >> 33;
k *= BIG_CONSTANT(0xff51afd7ed558ccd);
k ^= k >> 33;
k *= BIG_CONSTANT(0xc4ceb9fe1a85ec53);
k ^= k >> 33;
return k;
}
//-----------------------------------------------------------------------------
void MurmurHash3_x86_32 ( const void * key, int len,
uint32_t seed, void * out )
{
const uint8_t * data = (const uint8_t*)key;
const int nblocks = len / 4;
uint32_t h1 = seed;
uint32_t c1 = 0xcc9e2d51;
uint32_t c2 = 0x1b873593;
//----------
// body
const uint32_t * blocks = (const uint32_t *)(data + nblocks*4);
for(int i = -nblocks; i; i++)
{
uint32_t k1 = getblock(blocks,i);
k1 *= c1;
k1 = ROTL32(k1,15);
k1 *= c2;
h1 ^= k1;
h1 = ROTL32(h1,13);
h1 = h1*5+0xe6546b64;
}
//----------
// tail
const uint8_t * tail = (const uint8_t*)(data + nblocks*4);
uint32_t k1 = 0;
switch(len & 3)
{
case 3: k1 ^= tail[2] << 16;
case 2: k1 ^= tail[1] << 8;
case 1: k1 ^= tail[0];
k1 *= c1; k1 = ROTL32(k1,15); k1 *= c2; h1 ^= k1;
};
//----------
// finalization
h1 ^= len;
h1 = fmix(h1);
*(uint32_t*)out = h1;
}
//-----------------------------------------------------------------------------
void MurmurHash3_x86_128 ( const void * key, const int len,
uint32_t seed, void * out )
{
const uint8_t * data = (const uint8_t*)key;
const int nblocks = len / 16;
uint32_t h1 = seed;
uint32_t h2 = seed;
uint32_t h3 = seed;
uint32_t h4 = seed;
uint32_t c1 = 0x239b961b;
uint32_t c2 = 0xab0e9789;
uint32_t c3 = 0x38b34ae5;
uint32_t c4 = 0xa1e38b93;
//----------
// body
const uint32_t * blocks = (const uint32_t *)(data + nblocks*16);
for(int i = -nblocks; i; i++)
{
uint32_t k1 = getblock(blocks,i*4+0);
uint32_t k2 = getblock(blocks,i*4+1);
uint32_t k3 = getblock(blocks,i*4+2);
uint32_t k4 = getblock(blocks,i*4+3);
k1 *= c1; k1 = ROTL32(k1,15); k1 *= c2; h1 ^= k1;
h1 = ROTL32(h1,19); h1 += h2; h1 = h1*5+0x561ccd1b;
k2 *= c2; k2 = ROTL32(k2,16); k2 *= c3; h2 ^= k2;
h2 = ROTL32(h2,17); h2 += h3; h2 = h2*5+0x0bcaa747;
k3 *= c3; k3 = ROTL32(k3,17); k3 *= c4; h3 ^= k3;
h3 = ROTL32(h3,15); h3 += h4; h3 = h3*5+0x96cd1c35;
k4 *= c4; k4 = ROTL32(k4,18); k4 *= c1; h4 ^= k4;
h4 = ROTL32(h4,13); h4 += h1; h4 = h4*5+0x32ac3b17;
}
//----------
// tail
const uint8_t * tail = (const uint8_t*)(data + nblocks*16);
uint32_t k1 = 0;
uint32_t k2 = 0;
uint32_t k3 = 0;
uint32_t k4 = 0;
switch(len & 15)
{
case 15: k4 ^= tail[14] << 16;
case 14: k4 ^= tail[13] << 8;
case 13: k4 ^= tail[12] << 0;
k4 *= c4; k4 = ROTL32(k4,18); k4 *= c1; h4 ^= k4;
case 12: k3 ^= tail[11] << 24;
case 11: k3 ^= tail[10] << 16;
case 10: k3 ^= tail[ 9] << 8;
case 9: k3 ^= tail[ 8] << 0;
k3 *= c3; k3 = ROTL32(k3,17); k3 *= c4; h3 ^= k3;
case 8: k2 ^= tail[ 7] << 24;
case 7: k2 ^= tail[ 6] << 16;
case 6: k2 ^= tail[ 5] << 8;
case 5: k2 ^= tail[ 4] << 0;
k2 *= c2; k2 = ROTL32(k2,16); k2 *= c3; h2 ^= k2;
case 4: k1 ^= tail[ 3] << 24;
case 3: k1 ^= tail[ 2] << 16;
case 2: k1 ^= tail[ 1] << 8;
case 1: k1 ^= tail[ 0] << 0;
k1 *= c1; k1 = ROTL32(k1,15); k1 *= c2; h1 ^= k1;
};
//----------
// finalization
h1 ^= len; h2 ^= len; h3 ^= len; h4 ^= len;
h1 += h2; h1 += h3; h1 += h4;
h2 += h1; h3 += h1; h4 += h1;
h1 = fmix(h1);
h2 = fmix(h2);
h3 = fmix(h3);
h4 = fmix(h4);
h1 += h2; h1 += h3; h1 += h4;
h2 += h1; h3 += h1; h4 += h1;
((uint32_t*)out)[0] = h1;
((uint32_t*)out)[1] = h2;
((uint32_t*)out)[2] = h3;
((uint32_t*)out)[3] = h4;
}
//-----------------------------------------------------------------------------
void MurmurHash3_x64_128 ( const void * key, const int len,
const uint32_t seed, void * out )
{
const uint8_t * data = (const uint8_t*)key;
const int nblocks = len / 16;
uint64_t h1 = seed;
uint64_t h2 = seed;
uint64_t c1 = BIG_CONSTANT(0x87c37b91114253d5);
uint64_t c2 = BIG_CONSTANT(0x4cf5ad432745937f);
//----------
// body
const uint64_t * blocks = (const uint64_t *)(data);
for(int i = 0; i < nblocks; i++)
{
uint64_t k1 = getblock(blocks,i*2+0);
uint64_t k2 = getblock(blocks,i*2+1);
k1 *= c1; k1 = ROTL64(k1,31); k1 *= c2; h1 ^= k1;
h1 = ROTL64(h1,27); h1 += h2; h1 = h1*5+0x52dce729;
k2 *= c2; k2 = ROTL64(k2,33); k2 *= c1; h2 ^= k2;
h2 = ROTL64(h2,31); h2 += h1; h2 = h2*5+0x38495ab5;
}
//----------
// tail
const uint8_t * tail = (const uint8_t*)(data + nblocks*16);
uint64_t k1 = 0;
uint64_t k2 = 0;
switch(len & 15)
{
case 15: k2 ^= uint64_t(tail[14]) << 48;
case 14: k2 ^= uint64_t(tail[13]) << 40;
case 13: k2 ^= uint64_t(tail[12]) << 32;
case 12: k2 ^= uint64_t(tail[11]) << 24;
case 11: k2 ^= uint64_t(tail[10]) << 16;
case 10: k2 ^= uint64_t(tail[ 9]) << 8;
case 9: k2 ^= uint64_t(tail[ 8]) << 0;
k2 *= c2; k2 = ROTL64(k2,33); k2 *= c1; h2 ^= k2;
case 8: k1 ^= uint64_t(tail[ 7]) << 56;
case 7: k1 ^= uint64_t(tail[ 6]) << 48;
case 6: k1 ^= uint64_t(tail[ 5]) << 40;
case 5: k1 ^= uint64_t(tail[ 4]) << 32;
case 4: k1 ^= uint64_t(tail[ 3]) << 24;
case 3: k1 ^= uint64_t(tail[ 2]) << 16;
case 2: k1 ^= uint64_t(tail[ 1]) << 8;
case 1: k1 ^= uint64_t(tail[ 0]) << 0;
k1 *= c1; k1 = ROTL64(k1,31); k1 *= c2; h1 ^= k1;
};
//----------
// finalization
h1 ^= len; h2 ^= len;
h1 += h2;
h2 += h1;
h1 = fmix(h1);
h2 = fmix(h2);
h1 += h2;
h2 += h1;
((uint64_t*)out)[0] = h1;
((uint64_t*)out)[1] = h2;
}
//-----------------------------------------------------------------------------
Let me explain that a bit more.
from sklearn.feature_extraction._hashing import transform
makes use of this code https://github.com/scikit-learn/scikit-learn/blob/412996f09b6756752dfd3736c306d46fca8f1aa1/sklearn/feature_extraction/_hashing.pyx
which makes use of this
from sklearn.utils.murmurhash cimport murmurhash3_bytes_s32
which in turn makes use of this
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/murmurhash.pyx
which is built on this
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/src/MurmurHash3.cpp . So, MurmurHash3.cpp is very important. I need a Javascript version of this exact MurmurHash3.cpp such that the Javascript code and MurmurHash3.cpp would produce the same result.
I need this because I want to make some of my machine learning tools available online and the hashing needs to be done on the client's web browser.
So far, I have found a few Javascript implementations of MurmurHash3. However, murmurhash3.js https://github.com/whitequark/murmurhash3-js/blob/master/murmurhash3.js
seems to be closest (in terms of codes structure) to the MurmurHash3.cpp used by sklearn. But I still d not get the same hash from both of them.
Can anyone please guide me on how to modify murmurhash3.js
(above) so that it produces the same hash as MurmurHash3.cpp
(above) does?
Based on suggestions from @ChristopherOicles, I changed my Javascript
code (my the header off my HTML code) to use hashBytes
instead of hashString
as shown below. I also noticed that I needed to change the returned value of hashBytes
to its absolute value for my purpose (so I did). These solve my problems and now I get the same hash from the Python/C++ codes and from the Javascript codes.
Modified Javascript Function in my HTML file
<script>
function call_murmurhash3_32_gc () {
var key = $('textarea#textarea1').val();
var seed = 0;
var hash = MurmurHash3.hashBytes (key, key.length, seed);
$('div#div1').text(Math.abs (hash) % 262144);
}
</script>
My complete solution is here https://jsbin.com/qilokot/edit?html,js,output .
Again, many thanks to Christopher Oicles
and to everyone who tried to help me in some ways.