I need a way to sort and delete the duplicates of row and column indices of a sparse matrix. But at first I need to sort indices and later find a method to delete the duplicates.
Eigen::Matrix<std::pair<int, int>, Eigen::Dynamic, Eigen::Dynamic> unsorted_indices;
How can write a sort method for this? or method that I can achieve the both at once.
I can easily achieve this with
std::set<pair<int,int>>
[…] When unknows grow over a million then things are very costly
Matrix is the wrong data structure since you apparently make no use of it being two-dimensional. If std::set
works for you, you can use a plain old std::vector
#include <algorithm>
std::vector<std::pair<int, int>> indices;
for(...)
indices.emplace_back(row_i, col_j);
std::sort(indices.begin(), indices.end());
indices.erase(std::unique(indices.begin(), indices.end()), indices.end());
indices.shrink_to_fit();
vector
is more memory efficient than std::set
(for the cost of a single entry in std::set
including allocator overhead you can put 6 entries into a vector
). The erase(unique(…)
stuff is known as the erase-remove idiom
If the number of duplicates is large, removing them early with a set-like structure may be better. However, std::set
has a relatively high memory overhead and costly O(log(n)) insertion cost. std::unordered_set
is more efficient with O(1) and has slightly lower memory usage but it requires you to write your own hash function.
Unless you need to avoid external libraries, use a flat hash table. They have a much lower memory overhead. Boost comes with one that supports std::pair
out-of-the-box:
#include <boost/unordered/unordered_flat_set.hpp>
boost::unordered_flat_set<std::pair<int, int>> set;
for(...)
set.insert({row_i, col_j});
std::vector<std::pair<int, int>> indices(set.begin(), set.end());
std::sort(indices.begin(), indices.end());
As an alternative, you can use a simple bitmap. The memory consumption is then fixed to rows * cols / 8
byte as opposed to the at least 8 byte per entry in the hash table (not accounting for the hash table's overhead). So if your matrix is filled at least 1/64 (1.5%), the bitmap is more memory-efficient (or 1/320 compared to regular std::unordered_set
). Scanning it for set bits also automatically produces a sorted sequence, saving us this step.
#include <bit>
// C++20, using std::countr_zero
#include <cstdint>
// using std::uint32_t
std::size_t words_per_row = (cols + 31) / 32;
std::vector<std::uint32_t> bitmap(words_per_row * rows);
for(...) {
std::size_t index = row_i * words_per_row + col_j / 32;
bitmap[index] |= 1 << (col_j % 32);
}
std::vector<std::pair<int, int>> indices;
for(std::size_t row_i = 0; row_i < rows; ++row_i) {
for(std::size_t word = 0; word < words_per_row; ++word) {
std::uint32_t bits = bitmap[row_i * words_per_row + word];
while(bits) {
std::uint32_t bit_idx = std::countr_zero(bits);
int col_j = word * 32 + bit_idx;
indices.emplace_back(row_i, col_j);
bits &= ~(1 << bit_idx); // clear bit
}
}
}