c++duplicatesstdvectorstl-algorithmstdset

How to remove duplicates from unsorted std::vector while keeping the original ordering using algorithms?


I have an array of integers that I need to remove duplicates from while maintaining the order of the first occurrence of each integer. I can see doing it like this, but imagine there is a better way that makes use of STL algorithms better? The insertion is out of my control, so I cannot check for duplicates before inserting.

int unsortedRemoveDuplicates(std::vector<int> &numbers) {
    std::set<int> uniqueNumbers;
    std::vector<int>::iterator allItr = numbers.begin();
    std::vector<int>::iterator unique = allItr;
    std::vector<int>::iterator endItr = numbers.end();

    for (; allItr != endItr; ++allItr) {
        const bool isUnique = uniqueNumbers.insert(*allItr).second;

        if (isUnique) {
            *unique = *allItr;
            ++unique;
        }
    }

    const int duplicates = endItr - unique;

    numbers.erase(unique, endItr);
    return duplicates;
}

How can this be done using STL algorithms?


Solution

  • The naive way is to use std::set as everyone tells you. It's overkill and has poor cache locality (slow).
    The smart* way is to use std::vector appropriately (make sure to see footnote at bottom):

    #include <algorithm>
    #include <vector>
    struct target_less
    {
        template<class It>
        bool operator()(It const &a, It const &b) const { return *a < *b; }
    };
    struct target_equal
    {
        template<class It>
        bool operator()(It const &a, It const &b) const { return *a == *b; }
    };
    template<class It> It uniquify(It begin, It const end)
    {
        std::vector<It> v;
        v.reserve(static_cast<size_t>(std::distance(begin, end)));
        for (It i = begin; i != end; ++i)
        { v.push_back(i); }
        std::stable_sort(v.begin(), v.end(), target_less());
        v.erase(std::unique(v.begin(), v.end(), target_equal()), v.end());
        std::sort(v.begin(), v.end());
        size_t j = 0;
        for (It i = begin; i != end && j != v.size(); ++i)
        {
            if (i == v[j])
            {
                using std::iter_swap; iter_swap(i, begin);
                ++j;
                ++begin;
            }
        }
        return begin;
    }
    

    Then you can use it like:

    int main()
    {
        std::vector<int> v;
        v.push_back(6);
        v.push_back(5);
        v.push_back(5);
        v.push_back(8);
        v.push_back(5);
        v.push_back(8);
        v.erase(uniquify(v.begin(), v.end()), v.end());
    }
    

    *Note: That's the smart way in typical cases, where the number of duplicates isn't too high. For a more thorough performance analysis, see this related answer to a related question.

    Benchmark

    A benchmark showing that this is indeed faster was added in (based on this answer's uniquify()):

    https://github.com/nh2/cpp-dedup-benchmark

    benchmark graph