[SOLVED] How to sort the Eigen::Matrix that have std::pair<int, int> and delete the duplicates of indices

How to sort the Eigen::Matrix that have std::pair<int, int> and delete the duplicates of indices

I need a way to sort and delete the duplicates of row and column indices of a sparse matrix. But at first I need to sort indices and later find a method to delete the duplicates.

Eigen::Matrix<std::pair<int, int>, Eigen::Dynamic, Eigen::Dynamic> unsorted_indices;

How can write a sort method for this? or method that I can achieve the both at once.

Solution

I can easily achieve this with std::set<pair<int,int>> […] When unknows grow over a million then things are very costly

Matrix is the wrong data structure since you apparently make no use of it being two-dimensional. If std::set works for you, you can use a plain old std::vector

#include <algorithm>

std::vector<std::pair<int, int>> indices;
for(...)
  indices.emplace_back(row_i, col_j);
std::sort(indices.begin(), indices.end());
indices.erase(std::unique(indices.begin(), indices.end()), indices.end());
indices.shrink_to_fit();

vector is more memory efficient than std::set (for the cost of a single entry in std::set including allocator overhead you can put 6 entries into a vector). The erase(unique(…) stuff is known as the erase-remove idiom

Hash table

If the number of duplicates is large, removing them early with a set-like structure may be better. However, std::set has a relatively high memory overhead and costly O(log(n)) insertion cost. std::unordered_set is more efficient with O(1) and has slightly lower memory usage but it requires you to write your own hash function.

Unless you need to avoid external libraries, use a flat hash table. They have a much lower memory overhead. Boost comes with one that supports std::pair out-of-the-box:

#include <boost/unordered/unordered_flat_set.hpp>

boost::unordered_flat_set<std::pair<int, int>> set;
for(...)
  set.insert({row_i, col_j});
std::vector<std::pair<int, int>> indices(set.begin(), set.end());
std::sort(indices.begin(), indices.end());

Bitmap

As an alternative, you can use a simple bitmap. The memory consumption is then fixed to rows * cols / 8 byte as opposed to the at least 8 byte per entry in the hash table (not accounting for the hash table's overhead). So if your matrix is filled at least 1/64 (1.5%), the bitmap is more memory-efficient (or 1/320 compared to regular std::unordered_set). Scanning it for set bits also automatically produces a sorted sequence, saving us this step.

#include <bit>
// C++20, using std::countr_zero
#include <cstdint>
// using std::uint32_t

std::size_t words_per_row = (cols + 31) / 32;
std::vector<std::uint32_t> bitmap(words_per_row * rows);
for(...) {
  std::size_t index = row_i * words_per_row + col_j / 32;
  bitmap[index] |= 1 << (col_j % 32);
}
std::vector<std::pair<int, int>> indices;
for(std::size_t row_i = 0; row_i < rows; ++row_i) {
  for(std::size_t word = 0; word < words_per_row; ++word) {
    std::uint32_t bits = bitmap[row_i * words_per_row + word];
    while(bits) {
      std::uint32_t bit_idx = std::countr_zero(bits);
      int col_j = word * 32 + bit_idx;
      indices.emplace_back(row_i, col_j);
      bits &= ~(1 << bit_idx); // clear bit
    }
  }
}