cudathruststream-compaction

Thrust: Removing duplicates in key-value arrays


I have a pair of arrays of equal size, I will call them keys and values.

For example:

K: V
1: 99
1: 100
1: 100
1: 100
1: 103
2: 103
2: 105
3: 45
3: 67

The keys are sorted and the values associated with each key are sorted. How do I remove the value duplicates associated with each key and its corresponding key?

That is, I want to compact the above to:

1: 99
1: 100
1: 103
2: 103 <-- This should remain, since key is different
2: 105
3: 45
3: 67

I looked at the stream compaction functions available in Thrust, but was not able to find anything which does this. Is this possible with Thrust? Or do I need to write my own kernel to mark the duplicates in a stencil and then remove them?


Solution

  • The keys are sorted and the values associated with each key are also sorted. Thus, we can consider that the key-value pairs are sorted. thrust::unique will work directly on this if it can see these 2 vectors as a single vector. This can be achieved by zipping up the 2 items (key-value) at each position into a single tuple using zip_iterator.

    Here is how to achieve this in-place and also trim the key-value vectors to only the unique elements:

    typedef thrust::device_vector< int >                IntVector;
    typedef IntVector::iterator                         IntIterator;
    typedef thrust::tuple< IntIterator, IntIterator >   IntIteratorTuple;
    typedef thrust::zip_iterator< IntIteratorTuple >    ZipIterator;
    
    IntVector keyVector;
    IntVector valVector;
    
    ZipIterator newEnd = thrust::unique( thrust::make_zip_iterator( thrust::make_tuple( keyVector.begin(), valVector.begin() ) ),
                                         thrust::make_zip_iterator( thrust::make_tuple( keyVector.end(), valVector.end() ) ) );
    
    IntIteratorTuple endTuple = newEnd.get_iterator_tuple();
    
    keyVector.erase( thrust::get<0>( endTuple ), keyVector.end() );
    valVector.erase( thrust::get<1>( endTuple ), valVector.end() );
    

    If you want to compact and produce a separate result stream, you need to write your own binary predicate for your type which looks at both elements of the tuple. The thrust::zip_iterator can be used to form a virtual tuple iterator from separate arrays.

    A complete working example of how you might do it looks like this:

    #include <iostream>
    #include <thrust/tuple.h>
    #include <thrust/functional.h>
    #include <thrust/device_vector.h>
    #include <thrust/iterator/zip_iterator.h>
    #include <thrust/unique.h>
    
    // Binary predicate for a tuple pair
    typedef thrust::tuple<int, int> tuple_t;
    struct tupleEqual
    {
      __host__ __device__
        bool operator()(tuple_t x, tuple_t y)
        {
          return ( (x.get<0>()== y.get<0>()) && (x.get<1>() == y.get<1>()) );
        }
    };
    
    typedef thrust::device_vector<int>::iterator  intIterator;
    typedef thrust::tuple<intIterator, intIterator> intIteratorTuple;
    typedef thrust::zip_iterator<intIteratorTuple> zipIterator;
    typedef thrust::device_vector<tuple_t>::iterator tupleIterator;
    
    int main(void)
    {
      thrust::device_vector<int> k(9), v(9);
      thrust::device_vector<tuple_t> kvcopy(9);
    
      k[0] = 1; k[1] = 1; k[2] = 1; 
      k[3] = 1; k[4] = 1; k[5] = 2;
      k[6] = 2; k[7] = 3; k[8] = 3;
    
      v[0] = 99;  v[1] = 100; v[2] = 100;
      v[3] = 100; v[4] = 103; v[5] = 103;
      v[6] = 105; v[7] = 45;  v[8] = 67;
    
      zipIterator kvBegin(thrust::make_tuple(k.begin(),v.begin()));
      zipIterator kvEnd(thrust::make_tuple(k.end(),v.end()));
      thrust::copy(kvBegin, kvEnd, kvcopy.begin());
    
      tupleIterator kvend = 
            thrust::unique(kvcopy.begin(), kvcopy.end(), tupleEqual());
    
      for(tupleIterator kvi = kvcopy.begin(); kvi != kvend; kvi++) {
        tuple_t r = *kvi;
        std::cout << r.get<0>() << "," << r.get<1>() << std::endl;
      }
    
      return 0;
    }