I am new to python and trying to create a bloomFilter based on Bit torrent BEP 33. I have created Bloom Filter but it is not exactly what I am looking for. Here's what I need and I have not completely understood this situation. If someone here can explain ...
//fixed parameters
k = 2
m = 256*8
//the filter
byte[m/8] bloom ## What is this part?
function insertIP(byte[] ip) {
byte[20] hash = sha1(ip)
int index1 = hash[0] | hash[1] << 8
int index2 = hash[2] | hash[3] << 8
// truncate index to m (11 bits required)
index1 %= m ## ?
index2 %= m ## ?
// set bits at index1 and index2
bloom[index1 / 8] |= 0x01 << index1 % 8 ## ??
bloom[index2 / 8] |= 0x01 << index2 % 8 ## ??
}
// insert IP 192.168.1.1 into the filter:
insertIP(byte[4] {192,168,1,1})
And this is what I have created
import hashlib
m = 2048
def hashes(s):
index = [0, 0]
#for c in s:
#o = ord(c)
index[0] = hashlib.sha224(index[0]).hexdigest ## This needs integer hash
index[1] = hashlib.sha224(index[1]).hexdigest ## same as above
return [x % m for x in index]
class BloomFilter(object):
def __init__(self):
self.bitarray = [0] * m
def add(self, s):
for x in hashes(s):
self.bitarray[x] = 1
#print self.bitarray
def query(self, s):
return all(self.bitarray[x] == 1 for x in hashes(s))
shazib=BloomFilter()
shazib.add('192.168.0.1')
print shazib.query('192.168.0.1')
First, an explanation of the code...
//fixed parameters
k = 2
This is the most baffling line to me; k
isn't used at all...
m = 256*8
This is the number of bits in 256 bytes.
//the filter
byte[m/8] bloom ## What is this part?
bloom
is an array of 256 bytes, i.e. 256 * 8 bits, i.e. m
bits. Each bit in bloom
will contain information about what values are in the filter.
function insertIP(byte[] ip) {
byte[20] hash = sha1(ip)
This creates a 20-byte hash of ip
.
int index1 = hash[0] | hash[1] << 8
int index2 = hash[2] | hash[3] << 8
These two lines calculate two indices into bloom
based on the hash. Basically, index1
is the concatenation of the first two bytes of hash
, and index2
is the concatenation of the second two bytes of hash
.
// truncate index to m (11 bits required)
index1 %= m ## ?
index2 %= m ## ?
These two lines truncate the values so that they don't exceed the range of possible indices into bloom
. The %
is the mod operator; it returns the remainder after division. (17 % 4 = 1, 22 % 5 = 2 and so on.) Remember that bloom is 256 * 8 bits long? Eleven bits allows us to encode 2 ** 11 possible indices, i.e. 2048 values, i.e. 256 * 8 values.
// set bits at index1 and index2
bloom[index1 / 8] |= 0x01 << index1 % 8 ## ??
bloom[index2 / 8] |= 0x01 << index2 % 8 ## ??
We're treating bloom
as a bit-array, so we have to do some bit-twiddling to access the correct bit. First, divide indexA
by 8, to get the correct byte, then truncate indexA
using the %
operator to get the correct bit within that byte.
}
// insert IP 192.168.1.1 into the filter:
insertIP(byte[4] {192,168,1,1})
And voila, we have a bloom filter. If you printed it out bitwise, it would look like this:
data-> 001011000101110011000001001000100...
indices-> 000000000011111111112222222222333...
012345678901234567890123456789012...
And if a particular i.p., when hashed, generates an index1
of 5
and an index2
of 9
, then it would be considered "in" the filter, because the bits at indices 5
and 9
are set to 1
. Of course, there can be false positives, because multiple different values could result in the same indices; but there can be no false negatives.
import hashlib
m = 2048
def hashes(s):
index = [0, 0]
#for c in s:
#o = ord(c)
index[0] = hashlib.sha224(index[0]).hexdigest ## This needs integer hash
index[1] = hashlib.sha224(index[1]).hexdigest ## same as above
Here's your first problem. index[0]
and index[1]
need to be integers. Also, hashlib.sha224(index[0]).hexdigest
returns a method. You have to call the method to get anything out of it, like this: hashlib.sha224(index[0]).hexdigest()
. Also, if you want this to work in a way that's identical to the above code, you could convert the hash to an int (you can use int(x, 16)
to convert a hexadecimal string into an integer) and then extract the first two bytes using & 65535
, then shift it by two bytes using >> 16
, then extract those two bytes using & 65535
again. Once you've got that correct, the rest works.
return [x % m for x in index]
class BloomFilter(object):
def __init__(self):
self.bitarray = [0] * m
def add(self, s):
for x in hashes(s):
self.bitarray[x] = 1
#print self.bitarray
def query(self, s):
return all(self.bitarray[x] == 1 for x in hashes(s))
shazib=BloomFilter()
shazib.add('192.168.0.1')
print shazib.query('192.168.0.1')