I am reading through an Nvidia article about solving linear systems (sparse) on GPU. I got stuck on the construction of the chainPtrHost
data structure. I understand what it does but I don't understand how it's constructed. It is mentioned that it depends on the parallelism across multiple levels but how that is determined doesn't seem to be explained.
Does anyone know how it can be constructed?
Paper for reference: Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU
My understanding is that one just iterates through the levels sequentially on the host like in the following, very basic, untested C++ snippet:
#include <vector>
std::vector<int> create_chains(const std::vector<int> &levels, int threshold) {
const int num_levels = static_cast<int>(std::size(levels) - 1);
std::vector<int> chains{};
chains.reserve(std::size(levels)); // optional optimization to avoid re-allocation
bool no_open_chain = true;
for (int idx = 0; idx < num_levels; ++idx) {
const int num_rows = levels[idx + 1] - levels[idx];
const bool new_single_link_chain = (num_rows > threshold);
if (no_open_chain || new_single_link_chain) { // new chain starts here
chains.push_back(idx + 1); // 1-based index
}
no_open_chain = new_single_link_chain;
}
chains.push_back(num_levels + 1); // 1-based end of last chain
return chains;
}
The right value for threshold
depends on the implementation of the single-block solver kernel. If each thread operates on one row and there is no restriction on the block size for occupancy/shared memory/etc., it could be 1024
, i.e. the maximum number of threads (rows) per block.