[SOLVED] URL filtering on top of Redis: Bloom filters or HyperLogLog data structure

URL filtering on top of Redis: Bloom filters or HyperLogLog data structure

I want to implement URL filtering for the distributed crawling system on top of Redis database (e.g. don't visit the same URL twice, so I need somehow to keep tracking all of them with the minimal memory fingerprint, there is no need to store full URLs, just check if some particular URL has been visited or not). Bloom filters sounds right in this case, and I saw a native module for Redis implementing the Bloom filters. But it also has the built-in HyperLogLog data structure, so I'm wondering which one is a better choice in my scenario.

Solution

Bloom filter is totally different from HyperLogLog. Bloom filter is used for checking if there're some duplicated items, while HyperLogLog is used for distinct counting. In your case, you should use Bloom filter.

Also see this question for their differences.