I have a bloom filter containing 100M items, using 12 hashes and a false positive probability of 1 in 1 trillion, and it takes up ~1.3GB. I need to run hundreds of billions of checks on this filter, using 64 independent threads. Would having multiple copies of the filter in memory improve performance? What is the limiting factor here? Is it copies of the filter, or memory channels, or number of memory sticks, or something else?
I'm using C on Windows with a 32-core Threadripper and 256GB memory. The code is complex, so I don't want to change it unless there would be a decent increase in performance.
Having multiple copies of the data wouldn't help, and would probably hurt.
Here's why: a processor typically shares an L3 cache between multiple cores. If there is only one copy of the data, then when two cores request the same part of the bloom filter, then the L3 cache can satsify both requests while only having one copy in the L3 cache. However, if you have multiple copies of the data in memory, then two cores requesting different copies of the same data will cause both copies to be stored in L3. You have a fairly small amount of L3 cache (in the range of 4MB-256MB) and this would waste it.