algorithmmathcudagpusparse-matrix

Solution of sparse linear system on GPU, paper from nvidia


I am reading through an Nvidia article about solving linear systems (sparse) on GPU. I got stuck on the construction of the chainPtrHost data structure. I understand what it does but I don't understand how it's constructed. It is mentioned that it depends on the parallelism across multiple levels but how that is determined doesn't seem to be explained.

Does anyone know how it can be constructed?

Paper for reference: Parallel Solution of Sparse Triangular Linear Systems in the Preconditioned Iterative Methods on the GPU


Solution

  • My understanding is that one just iterates through the levels sequentially on the host like in the following, very basic, untested C++ snippet:

    #include <vector>
    
    std::vector<int> create_chains(const std::vector<int> &levels, int threshold) {
        const int num_levels = static_cast<int>(std::size(levels) - 1);
        
        std::vector<int> chains{};
        chains.reserve(std::size(levels)); // optional optimization to avoid re-allocation
    
        bool no_open_chain = true;
    
        for (int idx = 0; idx < num_levels; ++idx) {
            const int num_rows = levels[idx + 1] - levels[idx];
            const bool new_single_link_chain = (num_rows > threshold);
    
            if (no_open_chain || new_single_link_chain) { // new chain starts here
                chains.push_back(idx + 1); // 1-based index
            }
    
            no_open_chain = new_single_link_chain;
        }
        chains.push_back(num_levels + 1); // 1-based end of last chain
    
        return chains;
    }
    

    The right value for threshold depends on the implementation of the single-block solver kernel. If each thread operates on one row and there is no restriction on the block size for occupancy/shared memory/etc., it could be 1024, i.e. the maximum number of threads (rows) per block.