I am not much of a programmer but i have read some where that hash tables are fast when checking existence of data.
I have a list containing 4 million elements of at least 120 bytes each which i need to check against another list of the same criteria and normally if the elements where 4 byte each i would do
//declare array where i can check if data exists
uint8_t data1[1<<24]={0};
//existence counter
size_t count=0;
//load data, this is just an example
void load(){
for(uint32_t x=0; x < 0xFFFFFFFF; x++){
data1[x>>8]= x&0xFF;
}
}
//checking for existence
void check(uint32_t *data2, size_t size){
for(size_t i = 0; i < size; i++){
if(data1[data2[i]>>8] == data2[i]&0xFF){
count++;
}
}
}
This works perfectly for this type.
I tried using the CRC32()
hashing algorithm mostly used in zip file format to hash each 120 byte element and store it like i described in the code above but it seems to inaccurate, Simply speaking it isn't robust enough.
Details about the lists
Both lists are derived recurrence relation equation in which i have theory that, there exist a point in which a list1[x] == list2[y]
, So they are no duplicates (meaning a 120 byte value can not exist more than once in a list), All values are unique.
List1[2^22] // 120 bytes each
List2[2^22] // 120 bytes each
So intent to count how many values from list2
exists in list1
without doing the work of 2^44
, This way i can write the occurrence probability.
I am expecting a way of hashing a 120 byte value into something that can stored and checked without enormous workload, right now i can't even do this with 8 byte elements.
I have read many posts but there are not clear to me.
When using a CRC32 for hashing each element of a list with 4 million elements, there will only be very few elements with the same hash, i.e. very few collisions. However, you cannot use an entire CRC32 as a hash value, because then the hash table would have to consist of 4,294,967,296
buckets, which would probably require 32 GiB of memory, depending on what type of hash table and collision resolution you use.
When using a hash table with separate chaining as collision resolution, a good rule of thumb is to make the number of buckets in the hash table similar to the number of elements in the hash table. So it would probably be good to make the hash table have a size of 4 million buckets, or maybe 2^22
buckets, which is 4,194,304
. This would probably only require 32 MiB of memory, which is 1024 times less than the amount of memory mentioned above. However, this means that only 22 bits of the 32 bits of the CRC32 would be used, which means that there would be significantly more collisions. However, this does not matter much, for the following reason:
If you store the 4 million elements of the first list in a hash table with 4 million buckets, and use separate chaining as collision resolution, then you will have to compare every element in the second list with maybe 2 elements from the first list that are in the hash table, on average. If every element has a size of 120 bytes, then you will have to compare a total of 240 bytes, on average, for every element in the second list. If you embed the full CRC32 hash in the struct
of the elements of the list, then you can first compare the full CRC32 hashes and only compare the 120 bytes if the full CRC32 hashes match.
So, if you have to compare 4 million elements each with 2 elements on average, you will only have 8 million comparisons in total.