I have two very large ordered vectors of integers v and w. What is the most efficient way of finding the positions of the entries of w with values in v? I am hoping for something way more efficient than doing a binary search for every number in v. (Planning to implement it in C++.)
EDIT: Per request, an example: if v={3, 10, 20} and w={2,3,5,11,15,20} then I need to find positions i=1 and i=5 because w[1]=3 and w[5]=20. Now imagine that v has size 10^10 and w size 10^11.
EDIT2: Vectors have no duplicates. The numbers in vectors have 20 digits. So they are too large to put a hash table (index table) of values of v in memory.
This is trivially done in linear time.
Track the current index in both w
and v
. For each index in w
, iterate through v
until you find an item that's bigger or equal. If it's equal, then emit the index.
//emits the indecies of items in A that are also in B
template<typename AIter, typename BIter, typename DIter>
DIter set_union_indecies(AIter a_first, AIter a_last, BIter b_first, BIter b_last, DIter dest) {
unsigned index = 0;
while (a_first != a_last) {
// find the next item in B thats at least as big as the current item in A
while(b_first != b_last && *b_first < *a_first)
++b_first;
if (b_first == b_last) return dest;
// if they're the same, then emit the index
if (!(*a_first < *b_first))
*dest++ = index;
// check next value in A
++index;
++a_first;
}
return dest;
}
https://coliru.stacked-crooked.com/a/c28a23cc26e50d79
As long as they're reasonably similar in size, then this is O(v+m) with a very tiny constant. If they're wildly different in size, then you may want something that's more of a divide-and-conquer. I think that would still be O(w*logN) (or vice-versa) though, so not orders of magnitude faster than a binary search per-element.