asynchronous parallel-processing julia julia-gpu

Julia: Parallel CUSPARSE calculations on multiple GPUs

I have n separate GPUs, each storing its own data. I would like to have each of them perform a set of calculations simultaneously. The CUDArt documentation here describes the use of streams to asynchronously call custom C kernels in order to achieve parallelization (see also this other example here). With custom kernels, this can be accomplished through the use of the stream argument in CUDArt's implementation of the launch() function. As far as I can tell, however, the CUSPARSE (or CUBLAS) functions don't have a similar option for stream specification.

Is this possible with CUSPARSE, or do I just need to dive down to the C if I want to use multiple GPUs?

REVISED Bounty Update

Ok, so, I now have a relatively decent solution working, finally. But, I'm sure it could be improved in a million ways - it's quite hacky right now. In particular, I'd love suggestions for solutions along the lines of what I tried and wrote about in this SO question (which I never got to work properly). Thus, I'd be delighted to award the bounty to anyone with further ideas here.

Solution

Ok, so, I think I've finally come upon something that works at least relatively well. I'd still be absolutely delighted to offer the Bounty though to anyone who has further improvements. In particular, improvements based on the design that I attempted (but failed) to implement as described in this SO question would be great. But, any improvements or suggestions on this and I'd be delighted to give the bounty.

The key breakthrough that I discovered for a way to get things like CUSPARSE and CUBLAS to parallelize over multiple GPUs is that you need to create a separate handle for each GPU. E.g. from the documentation on CUBLAS API:

The application must initialize the handle to the cuBLAS library context by calling the cublasCreate() function. Then, the is explicitly passed to every subsequent library function call. Once the application finishes using the library, it must call the function cublasDestory() to release the resources associated with the cuBLAS library context.

This approach allows the user to explicitly control the library setup when using multiple host threads and multiple GPUs. For example, the application can use cudaSetDevice() to associate different devices with different host threads and in each of those host threads it can initialize a unique handle to the cuBLAS library context, which will use the particular device associated with that host thread. Then, the cuBLAS library function calls made with different handle will automatically dispatch the computation to different devices.

(emphasis added)

See here and here for some additional helpful docs.

Now, in order to actually move forward on this, I had to do a bunch of rather messy hacking. In the future, I'm hoping to get in touch with the folks who developed the CUSPARSE and CUBLAS packages to see about incorporating this into their packages. For the time being though, this is what I did:

First, the CUSPARSE and CUBLAS packages come with functions to create handles. But, I had to modify the packages a bit to export those functions (along with needed other functions and object types) so that I could actually access them myself.

Specifically, I added to CUSPARSE.jl the following:

export libcusparse, SparseChar

to libcusparse_types.jl the following:

export cusparseHandle_t, cusparseOperation_t, cusparseMatDescr_t, cusparseStatus_t

to libcusparse.jl the following:

export cusparseCreate

and to sparse.jl the following:

export getDescr, cusparseop

Through all of these, I was able to get functional access to the cusparseCreate() function which can be used to create new handles (I couldn't just use CUSPARSE.cusparseCreate() because that function depended on a bunch of other functions and data types). From there, I defined a new version of the matrix multiplication operation that I wanted that took an additional argument, the Handle, to feed in the ccall() to the CUDA driver. Below is the full code:

using CUDArt, CUSPARSE  ## note: modified version of CUSPARSE, as indicated above.

N = 10^3;
M = 10^6;
p = 0.1;

devlist = devices(dev->true);
nGPU = length(devlist)

dev_X = Array(CudaSparseMatrixCSR, nGPU)
dev_b = Array(CudaArray, nGPU)
dev_c = Array(CudaArray, nGPU)
Handles = Array(Array{Ptr{Void},1}, nGPU)


for (idx, dev) in enumerate(devlist)
    println("sending data to device $dev")
    device(dev) ## switch to given device
    dev_X[idx] = CudaSparseMatrixCSR(sprand(N,M,p))
    dev_b[idx] = CudaArray(rand(M))
    dev_c[idx] = CudaArray(zeros(N))
    Handles[idx] = cusparseHandle_t[0]
    cusparseCreate(Handles[idx])
end


function Pmv!(
    Handle::Array{Ptr{Void},1},
    transa::SparseChar,
    alpha::Float64,
    A::CudaSparseMatrixCSR{Float64},
    X::CudaVector{Float64},
    beta::Float64,
    Y::CudaVector{Float64},
    index::SparseChar)
    Mat     = A
    cutransa = cusparseop(transa)
    m,n = Mat.dims
    cudesc = getDescr(A,index)
    device(device(A))  ## necessary to switch to the device associated with the handle and data for the ccall 
    ccall(
        ((:cusparseDcsrmv),libcusparse), 

        cusparseStatus_t,

        (cusparseHandle_t, cusparseOperation_t, Cint,
        Cint, Cint, Ptr{Float64}, Ptr{cusparseMatDescr_t},
        Ptr{Float64}, Ptr{Cint}, Ptr{Cint}, Ptr{Float64},
        Ptr{Float64}, Ptr{Float64}), 

        Handle[1],
        cutransa, m, n, Mat.nnz, [alpha], &cudesc, Mat.nzVal,
        Mat.rowPtr, Mat.colVal, X, [beta], Y
    )
end

function test(Handles, dev_X, dev_b, dev_c, idx)
    Pmv!(Handles[idx], 'N',  1.0, dev_X[idx], dev_b[idx], 0.0, dev_c[idx], 'O')
    device(idx-1)
    return to_host(dev_c[idx])
end


function test2(Handles, dev_X, dev_b, dev_c)

    @sync begin
        for (idx, dev) in enumerate(devlist)
            @async begin
                Pmv!(Handles[idx], 'N',  1.0, dev_X[idx], dev_b[idx], 0.0, dev_c[idx], 'O')
            end
        end
    end
    Results = Array(Array{Float64}, nGPU)
    for (idx, dev) in enumerate(devlist)
        device(dev)
        Results[idx] = to_host(dev_c[idx]) ## to_host doesn't require setting correct device first.  But, it is  quicker if you do this.
    end

    return Results
end

## Function times given after initial run for compilation
@time a = test(Handles, dev_X, dev_b, dev_c, 1); ## 0.010849 seconds (12 allocations: 8.297 KB)
@time b = test2(Handles, dev_X, dev_b, dev_c);   ## 0.011503 seconds (68 allocations: 19.641 KB)

# julia> a == b[1]
# true