algorithmsearchbinary-search

Finding multiple value positions in a large vector


I have two very large ordered vectors of integers v and w. What is the most efficient way of finding the positions of the entries of w with values in v? I am hoping for something way more efficient than doing a binary search for every number in v. (Planning to implement it in C++.)

EDIT: Per request, an example: if v={3, 10, 20} and w={2,3,5,11,15,20} then I need to find positions i=1 and i=5 because w[1]=3 and w[5]=20. Now imagine that v has size 10^10 and w size 10^11.

EDIT2: Vectors have no duplicates. The numbers in vectors have 20 digits. So they are too large to put a hash table (index table) of values of v in memory.


Solution

  • This is trivially done in linear time.

    Track the current index in both w and v. For each index in w, iterate through v until you find an item that's bigger or equal. If it's equal, then emit the index.

    //emits the indecies of items in A that are also in B
    template<typename AIter, typename BIter, typename DIter>
    DIter set_union_indecies(AIter a_first, AIter a_last, BIter b_first, BIter b_last, DIter dest) {
        unsigned index = 0;
        while (a_first != a_last) {
            // find the next item in B thats at least as big as the current item in A
            while(b_first != b_last && *b_first < *a_first)
                ++b_first;
            if (b_first == b_last) return dest;
            // if they're the same, then emit the index
            if (!(*a_first < *b_first))
                *dest++ = index;
            // check next value in A
            ++index;
            ++a_first;
        }
        return dest;
    }
    

    https://coliru.stacked-crooked.com/a/c28a23cc26e50d79

    As long as they're reasonably similar in size, then this is O(v+m) with a very tiny constant. If they're wildly different in size, then you may want something that's more of a divide-and-conquer. I think that would still be O(w*logN) (or vice-versa) though, so not orders of magnitude faster than a binary search per-element.