javasettime-complexitybig-ohyperloglog

What Algorithm is used by java.util.HashSet and java.util.TreeSet to store unique values in its structure?


I have come across multiple algorithms such as Flajolet-Martin algorithm , HyperLogLog to find out unique elements from a list of elements and suddenly became curious about how Java calculates it? And what is the Time-complexity in each of these cases to store and find unique values?


Solution

  • Flajolet-Martin and HyperLogLog algorithms are about getting an approximate count of the distinct elements (the count-distinct problem) in one pass of a stream of N elements with O(N) time and modest (much better than O(N)) memory usage.

    An implementation of the Map API does not need a solution to the "count-distinct" problem.

    (Aside: TreeMap and HashMap already keep a precomputed count of the number of entries in the map1; i.e. Map.size(). Provided that you don't get into thread-safety problems the result is accurate (not approximate). The cost of calling size() is O(1). The cost of maintaining it is O(U) where U is the number of map addition and removal operations performed.)

    More generally, Flajolet-Martin algorithm or HyperLogLog do not / cannot form the basis for a Map data structure. They do not address the dictionary problem.

    The algorithms used by HashMap and TreeMap are (respectively) hash table and binary tree algorithms. There are more details in the respective javadocs, and the full source code (with comments) is readily available for you to look at. (Google for "java.util.HashMap" source ... for example.)


    1 - Interestingly, ConcurrentHashMap doesn't work this way ... because updating the size field would be a concurrency bottleneck. Instead, the size() operation is O(N).