sortingmatrixeigen

How to sort the Eigen::Matrix that have std::pair<int, int> and delete the duplicates of indices


I need a way to sort and delete the duplicates of row and column indices of a sparse matrix. But at first I need to sort indices and later find a method to delete the duplicates.

Eigen::Matrix<std::pair<int, int>, Eigen::Dynamic, Eigen::Dynamic> unsorted_indices; 

How can write a sort method for this? or method that I can achieve the both at once.


Solution

  • I can easily achieve this with std::set<pair<int,int>> […] When unknows grow over a million then things are very costly

    Matrix is the wrong data structure since you apparently make no use of it being two-dimensional. If std::set works for you, you can use a plain old std::vector

    #include <algorithm>
    
    std::vector<std::pair<int, int>> indices;
    for(...)
      indices.emplace_back(row_i, col_j);
    std::sort(indices.begin(), indices.end());
    indices.erase(std::unique(indices.begin(), indices.end()), indices.end());
    indices.shrink_to_fit();
    

    vector is more memory efficient than std::set (for the cost of a single entry in std::set including allocator overhead you can put 6 entries into a vector). The erase(unique(…) stuff is known as the erase-remove idiom

    Hash table

    If the number of duplicates is large, removing them early with a set-like structure may be better. However, std::set has a relatively high memory overhead and costly O(log(n)) insertion cost. std::unordered_set is more efficient with O(1) and has slightly lower memory usage but it requires you to write your own hash function.

    Unless you need to avoid external libraries, use a flat hash table. They have a much lower memory overhead. Boost comes with one that supports std::pair out-of-the-box:

    #include <boost/unordered/unordered_flat_set.hpp>
    
    boost::unordered_flat_set<std::pair<int, int>> set;
    for(...)
      set.insert({row_i, col_j});
    std::vector<std::pair<int, int>> indices(set.begin(), set.end());
    std::sort(indices.begin(), indices.end());
    

    Bitmap

    As an alternative, you can use a simple bitmap. The memory consumption is then fixed to rows * cols / 8 byte as opposed to the at least 8 byte per entry in the hash table (not accounting for the hash table's overhead). So if your matrix is filled at least 1/64 (1.5%), the bitmap is more memory-efficient (or 1/320 compared to regular std::unordered_set). Scanning it for set bits also automatically produces a sorted sequence, saving us this step.

    #include <bit>
    // C++20, using std::countr_zero
    #include <cstdint>
    // using std::uint32_t
    
    std::size_t words_per_row = (cols + 31) / 32;
    std::vector<std::uint32_t> bitmap(words_per_row * rows);
    for(...) {
      std::size_t index = row_i * words_per_row + col_j / 32;
      bitmap[index] |= 1 << (col_j % 32);
    }
    std::vector<std::pair<int, int>> indices;
    for(std::size_t row_i = 0; row_i < rows; ++row_i) {
      for(std::size_t word = 0; word < words_per_row; ++word) {
        std::uint32_t bits = bitmap[row_i * words_per_row + word];
        while(bits) {
          std::uint32_t bit_idx = std::countr_zero(bits);
          int col_j = word * 32 + bit_idx;
          indices.emplace_back(row_i, col_j);
          bits &= ~(1 << bit_idx); // clear bit
        }
      }
    }