Randomly select two sets ,both set contains distinct keys (one key may belongs to multiple sets ,one set can never contain duplicate keys ).
Return a integer which represents for the number of keys belongs to both sets .
For example intersect({1,2,3,4},{3,4,5}) returns 2 .
I just need the size of the intersection .I don't need to know exactly which keys are there in the intersection .
Are there any datastructures support this kind of operations in less than O(n) time ?
Edit:
Reading the data out does requires O(n) time, but that would not lead to the conclusion that you can't do the intersection operation in less than O(n) time .
Image this scenario:
I have N sets,each contains 100 keys . I read them up, that's N*100 operations .Now I want to know witch pair of sets have the largest intersection ,that's O(N²) intersection operations .So I want to reduce the complexity of the intersection operation .I don't really care how much time it takes to read and construct the sets ,it's at most N*100,that's nothing compares to O(N²) intersection operations .
Be aware ,there's no way that you can find the pair of sets that have the largest intersection by doing less than O(N²) intersection operations ,I can prove that .You must do all the intersection operations .
(he basic idea is ,let's imagine a complete graph ,with N vertices ,each represent for a set ,and Nx(N-1)/2 edges, each represents for a intersection for the connected pair .Now give each edge an non-negetive weight all you want(represents for the intersection size) ,I can always construct N sets satisfy those Nx(N-1)/2 edge weights .That proves my claim .)
I would advice you to take a look at the two possible alternatives, which work particularly well in practice (especially in case of the large sets).
A Bloom filter is a space-efficient (based on the bit array) probabilistic data structure, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not.
There is a trade-off between False Positive rate and the memory footprint of the Bloom Filter. Hence, it is possible to estimate the appropriate size of the Bloom Filter for different use cases.
Each set can be associated with its own Bloom Filter. It is very easy to obtain the Bloom Filter, which corresponds to the intersection of the different sets: all bit arrays, which correspond to the different Bloom Filters, can be combined using the bitwise AND
operation.
Having the Bloom Filter, which corresponds to the intersection, it is possible to find quickly the items, which are present in all intersected sets.
Apart from that, it is possible to estimate the cardinality of the intersection without the actual iteration over the entire sets: https://en.wikipedia.org/wiki/Bloom_filter#The_union_and_intersection_of_sets
A Skip List is a data structure that allows fast search and intersection within an ordered sequence of elements. Fast search and intersection are made possible by maintaining a linked hierarchy of subsequences, each skipping over fewer elements.
Succinctly saying, the Skip List is very similar to the plain Linked List data structure, however each node of the Skip List has a couple of additional pointers to items, which are located further (pointers, which "skips" over the couple of other nodes of the list).
Hence, in order to obtain the intersection - it is needed to maintain the pointers inside all Skip Lists, which are being intersected. During the intersection of the Skip Lists pointers are jumping over the items, which are not present in all intersected Skip Lists. Hence, usually the runtime complexity of the intersection operation is faster then O(N)
.
The algorithm of the intersection of Skip Lists is described in the book "Introduction to Information Retrieval" (written by Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze): http://nlp.stanford.edu/IR-book/html/htmledition/faster-postings-list-intersection-via-skip-pointers-1.html
Skip Lists are actively used in a high-performance, full-featured text search engine library: Apache Lucene (Skip Lists are used inside the Inverted Index component).
Here is an additional Stackoverflow question about the usage of Skip Lists in Lucene: how lucene use skip list in inverted index?