image-processingcudasynchronizationblurdynamic-parallelism

How to ensure that a child kernel finished processing before the parent kernel continues?


I am learning CUDA and decided to do a basic image box blur demo as a way to get familiar.

When I try to make a child kernel compute the sum of the neighboring pixels, the sum is always 0.

After some research I figured that the child kernel is not finishing before the parent kernel continues and too many kernels get launched exceeding the device limit.

I saw some people using cudaDeviceSynchronize() in device code but it is deprecated now.

How can I achieve my goal? Or is the idea of using dynamic parallelism to compute the sum of neighboring cells wrong?

__global__ void sumValues(uchar3* img, Pair2i* size, int origin, Pair3u* tmpSums, uchar kernel_size)
{   
    int Idx = origin + (threadIdx.y - (int)ceil((kernel_size - 1) / 2.0)) * size->y + (threadIdx.x - (int)ceil((kernel_size - 1) / 2.0));

    if (Idx < 0 || Idx >= size->x * size->y) return;

    atomicAdd((&tmpSums[origin].x), unsigned int(img[Idx].x));
    atomicAdd((&tmpSums[origin].y), unsigned int(img[Idx].y));
    atomicAdd((&tmpSums[origin].z), unsigned int(img[Idx].z));
}

__global__ void blur(uchar3* img, Pair2i* size, uchar3* res, Pair3u* tmpSums, uchar kernel_size)
{
    unsigned int Idx = access2d(threadIdx, blockIdx, blockDim, gridDim);

    if (Idx >= size->x * size->y) return;

    tmpSums[Idx] = { 0, 0, 0 };

    dim3 dimBlock(kernel_size, kernel_size);

    sumValues<<<1, dimBlock>>>(img, size, Idx, tmpSums, kernel_size);

    res[Idx].x = tmpSums[Idx].x / (kernel_size * kernel_size);
    res[Idx].y = tmpSums[Idx].y / (kernel_size * kernel_size);
    res[Idx].z = tmpSums[Idx].z / (kernel_size * kernel_size);
}

For a little more context, I am using OpenCV for reading the image and passing the data pointer to the kernel computing the blurred image. The parent kernel (blur) calls the child kernel (sumValues) to compute the sum and store it in tmpSums for the parent kernel to then get the average and assign it to the resulting image.


Solution

  • I suppose this isn't the code you are actually using. This will not compile:

    atomicAdd((&tmpSums[origin].x), unsigned int(img[Idx].x));
    

    I assume you meant:

    atomicAdd((&tmpSums[origin].x), (unsigned int)(img[Idx].x));
    

    Yes, it's possible to use CDP to solve the problem, roughly as you have outlined it. No, I wouldn't recommend it, based on what I see here and assumptions about sizes.

    Preamble:

    1. With CDP2 (which is recommended instead of CDP1 moving forward, and required if you are using a cc9.0 or higher GPU), having the parent kernel code depend on child kernel results is a prohibited design practice:

    This means that the modifications made by the threads in the child grid are never guaranteed to become available to the parent grid. To access modifications made by child_launch, a tail_launch kernel is launched into the cudaStreamTailLaunch stream.

    And that gives us one clue as to a possible approach in your case.

    1. In CDP2, we must also be aware of the pending launch limit, which defaults to 2048. Your kernel wants to launch one child kernel per parent kernel thread or per pixel. If the number of pixels is significantly larger than 2048, you're going to run into this limit.

    2. Any time you are having trouble with a CUDA code, and particularly with CDP, I strongly encourage proper CUDA error checking including in device code, when using the device runtime API or launching CDP kernels.

    3. As a general rule in CUDA, it's unwise to use kernel launches to perform "small" amounts of work. What is a "small" amount of work? It is a kernel launch whose number of threads would not fill a GPU. Pretty much by definition, any kernel launch of the form <<<1, XXX>>> can only run on a single SM, and could not possibly fill any modern GPU that has more than 1 SM. "But wait, I'm launching 2048 of those! Doesn't that help?" The GPU has a maximum resident grid (i.e. kernel launches actually executing) limit that is lower than 2048, for most modern GPUs currently, it is 128. So this might be able to fill your GPU, depending on your grid (or, in your case, threadblock) size for your child kernels. And this does not take into account that child kernels, just like parent kernels, have a launch latency or "overhead" that is usually on the order of 3-15 microseconds. So that overhead could be a pretty high cost to pay if your child kernel is doing very little work (on the order of 3-15 microseconds, or less).

    In short, CDP might not be the way to go. At least there are hazards to be aware of.

    Possible solutions:

    Can it be done? Yes, probably, at least in your case. The key will be around using streams to order our work. I can think of several approaches, one is already hinted at. I expect that none of these methods will be interesting from a performance standpoint (primarily due to item 4 above), so I'm not going to spend a lot of time trying to demonstrate good performance. I'm convinced for what you have shown, much better performance is obtained just by running a monolithic kernel, using a pretty standard 2D stencil design.

    1. Put everything into the first child kernel, using a single thread of the child kernel to finish the work, rather than the parent kernel

    2. Using ordinary created streams (or the NULL stream), with multiple child kernel launches. Launching kernels into the same stream guarantees that a. the second child kernel launched will wait until the first child kernel is finished b. The results computed by the first child kernel in global memory will be visible to the second child kernel

    3. use many "fire and forget" kernels (or kernels launched into the NULL stream will work as well) followed by a single "tail launch" kernel.

    4. Use an ordinary, monolithic (non-CDP) 2D stencil kernel

    Here is an example showing all 4 methods and the original method in your question:

    # cat t242.cu
    #include <cstdio>
    #include <iostream>
    
    using Pair2i = int2;
    using Pair3u = uint3;
    using uchar = unsigned char;
    const int img_dim = 64;
    
    #define check(x)  if (x != cudaSuccess) std::cout << __LINE__ << " " << cudaGetErrorString(x) << std::endl
    #define kcheck(x) if (x != cudaSuccess) printf("device: %s\n", cudaGetErrorString(x))
    
    
    #include <time.h>
    #include <sys/time.h>
    #define USECPSEC 1000000ULL
    
    unsigned long long dtime_usec(unsigned long long start){
    
      timeval tv;
      gettimeofday(&tv, 0);
      return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
    }
    
    __managed__ bool test_ok;
    
    __device__ unsigned int access2d(uint3 t, uint3 b, dim3 d, dim3 g){
            return ((blockIdx.y*gridDim.x + blockIdx.x)*blockDim.y+threadIdx.y)*blockDim.x+threadIdx.x;}
    
    __global__ void sumValues(uchar3* img, Pair2i* size, int origin, Pair3u* tmpSums, uchar kernel_size)
    {
        int Idx = origin + (threadIdx.y - (int)ceil((kernel_size - 1) / 2.0)) * size->y + (threadIdx.x - (int)ceil((kernel_size - 1) / 2.0));
    
        if (Idx < 0 || Idx >= size->x * size->y) return;
    
        atomicAdd((&tmpSums[origin].x), (unsigned int)(img[Idx].x));
        atomicAdd((&tmpSums[origin].y), (unsigned int)(img[Idx].y));
        atomicAdd((&tmpSums[origin].z), (unsigned int)(img[Idx].z));
    }
    
    __global__ void blur(uchar3* img, Pair2i* size, uchar3* res, Pair3u* tmpSums, uchar kernel_size)
    {
        unsigned int Idx = access2d(threadIdx, blockIdx, blockDim, gridDim);
    
        if (Idx >= size->x * size->y) return;
    
        tmpSums[Idx] = { 0, 0, 0 };
    
        dim3 dimBlock(kernel_size, kernel_size);
    
        sumValues<<<1, dimBlock>>>(img, size, Idx, tmpSums, kernel_size);
    
        cudaError_t err = cudaGetLastError();
        kcheck(err);
    
        res[Idx].x = tmpSums[Idx].x / (kernel_size * kernel_size);
        res[Idx].y = tmpSums[Idx].y / (kernel_size * kernel_size);
        res[Idx].z = tmpSums[Idx].z / (kernel_size * kernel_size);
    }
    
    template <typename T>
    __global__ void test_equal(T *d1, T *d2, size_t tsz){
    
      for (size_t i = blockIdx.x*blockDim.x+threadIdx.x; i < tsz; i+=gridDim.x*blockDim.x){
        if (d1[i].x != d2[i].x) test_ok = false;
        if (d1[i].y != d2[i].y) test_ok = false;
        if (d1[i].z != d2[i].z) test_ok = false;}
    }
    
    __global__ void sumValues1(uchar3* img, Pair2i* size, int origin, volatile Pair3u* tmpSums, uchar kernel_size, uchar3* res)
    {
        int Idx = origin + (threadIdx.y - (int)ceil((kernel_size - 1) / 2.0)) * size->y + (threadIdx.x - (int)ceil((kernel_size - 1) / 2.0));
    
        if (Idx >= 0 &&  Idx < size->x * size->y) {
    
          atomicAdd((unsigned int *)(&tmpSums[origin].x), (unsigned int)(img[Idx].x));
          atomicAdd((unsigned int *)(&tmpSums[origin].y), (unsigned int)(img[Idx].y));
          atomicAdd((unsigned int *)(&tmpSums[origin].z), (unsigned int)(img[Idx].z));
        }
        if (!threadIdx.x && !threadIdx.y){
    
          res[origin].x = tmpSums[origin].x / (kernel_size * kernel_size);
          res[origin].y = tmpSums[origin].y / (kernel_size * kernel_size);
          res[origin].z = tmpSums[origin].z / (kernel_size * kernel_size);
        }
    }
    
    __global__ void blur1(uchar3* img, Pair2i* size, uchar3* res, Pair3u* tmpSums, uchar kernel_size)
    {
        unsigned int Idx = access2d(threadIdx, blockIdx, blockDim, gridDim);
    
        if (Idx >= size->x * size->y) return;
    
        tmpSums[Idx] = { 0, 0, 0 };
    
        dim3 dimBlock(kernel_size, kernel_size);
    
        sumValues1<<<1, dimBlock>>>(img, size, Idx, tmpSums, kernel_size, res);
    
        cudaError_t err = cudaGetLastError();
        kcheck(err);
    }
    __global__ void finish_up(int Idx, uchar3* res, Pair3u* tmpSums, uchar kernel_size)
    {
          res[Idx].x = tmpSums[Idx].x / (kernel_size * kernel_size);
          res[Idx].y = tmpSums[Idx].y / (kernel_size * kernel_size);
          res[Idx].z = tmpSums[Idx].z / (kernel_size * kernel_size);
    }
    
    __global__ void blur2(uchar3* img, Pair2i* size, uchar3* res, Pair3u* tmpSums, uchar kernel_size)
    {
        unsigned int Idx = access2d(threadIdx, blockIdx, blockDim, gridDim);
    
        if (Idx >= size->x * size->y) return;
    
        tmpSums[Idx] = { 0, 0, 0 };
    
        dim3 dimBlock(kernel_size, kernel_size);
    
        sumValues<<<1, dimBlock>>>(img, size, Idx, tmpSums, kernel_size);
        cudaError_t err = cudaGetLastError();
        kcheck(err);
        finish_up<<<1,1>>>(Idx, res, tmpSums, kernel_size);
        err = cudaGetLastError();
        kcheck(err);
    }
    __global__ void finish_up3(uchar3* res, Pair3u* tmpSums, uchar kernel_size, size_t img_size)
    {
        for (size_t Idx = threadIdx.x+blockDim.x*blockIdx.x; Idx < img_size; Idx+=gridDim.x*blockDim.x){
          res[Idx].x = tmpSums[Idx].x / (kernel_size * kernel_size);
          res[Idx].y = tmpSums[Idx].y / (kernel_size * kernel_size);
          res[Idx].z = tmpSums[Idx].z / (kernel_size * kernel_size);}
    }
    
    __global__ void blur3(uchar3* img, Pair2i* size, uchar3* res, Pair3u* tmpSums, uchar kernel_size)
    {
        unsigned int Idx = access2d(threadIdx, blockIdx, blockDim, gridDim);
    
        if (Idx >= size->x * size->y) return;
    
        tmpSums[Idx] = { 0, 0, 0 };
    
        dim3 dimBlock(kernel_size, kernel_size);
    
        sumValues<<<1, dimBlock>>>(img, size, Idx, tmpSums, kernel_size);
        cudaError_t err = cudaGetLastError();
        kcheck(err);
        if (Idx == 0){
          finish_up3<<<((size->x*size->y)+511)/512, 512, 0, cudaStreamTailLaunch>>>(res, tmpSums, kernel_size, size->x*size->y);
          err = cudaGetLastError();
          kcheck(err);}
    }
    
    __global__ void blur4(uchar3* img, Pair2i* size, uchar3* res, Pair3u* tmpSums, uchar kernel_size)
    {
        int idx = threadIdx.x+blockDim.x*blockIdx.x;
        int idy = threadIdx.y+blockDim.y*blockIdx.y;
        if ((idx < size->x) && (idy < size->y)){
          unsigned tsx = 0;
          unsigned tsy = 0;
          unsigned tsz = 0;
          for (int oy = 0; oy < kernel_size; oy++)
            for (int ox = 0; ox < kernel_size; ox++){
              int kx = idx+ox-(kernel_size>>1); // for odd kernel_size
              int ky = idy+oy-(kernel_size>>1);
              if ((kx >= 0) && (kx < size->x) && (ky >= 0) && (ky < size->y)){
                tsx += img[ky*size->x + kx].x;
                tsy += img[ky*size->x + kx].y;
                tsz += img[ky*size->x + kx].z;}
            }
          res[idy*size->x+idx].x = tsx / (kernel_size * kernel_size);
          res[idy*size->x+idx].y = tsy / (kernel_size * kernel_size);
          res[idy*size->x+idx].z = tsz / (kernel_size * kernel_size);}
    }
    
    int main(){
    
      cudaError_t err = cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, img_dim*img_dim*2);  // reserve enough space for two child kernel launches per pixel
      check(err);
      uchar3 *i, *r, *r2;
      Pair2i s = {img_dim, img_dim};
      Pair2i *d_s;
      Pair3u *t;
      uchar ks = 5; // assume odd values only
      cudaMalloc(&i, img_dim*img_dim*sizeof(i[0]));
      cudaMalloc(&r, img_dim*img_dim*sizeof(r[0]));
      cudaMalloc(&t, img_dim*img_dim*sizeof(t[0]));
      cudaMalloc(&r2, img_dim*img_dim*sizeof(r2[0]));
      cudaMemset(i, 4, img_dim*img_dim*sizeof(i[0]));
      cudaMemset(r, 0, img_dim*img_dim*sizeof(r[0]));
      cudaMemset(t, 0, img_dim*img_dim*sizeof(t[0]));
      cudaMalloc(&d_s, sizeof(Pair2i));
      cudaMemcpy(d_s, &s, sizeof(Pair2i), cudaMemcpyHostToDevice);
      err = cudaGetLastError();
      check(err);
      unsigned long long dt;
      // original method
      dim3 b(16,16);
      dim3 g((img_dim+b.x-1)/b.x, (img_dim+b.y-1)/b.y);
      dt = dtime_usec(0);
      blur<<<g, b>>>(i, d_s, r, t, ks);
      err = cudaGetLastError();
      check(err);
      err = cudaDeviceSynchronize();
      dt = dtime_usec(dt);
      std::cout << "original method time: " << dt/(float)USECPSEC << "s" << std::endl;
      check(err);
      // proposal 1
      cudaMemset(r, 0, img_dim*img_dim*sizeof(r[0]));
      cudaMemset(t, 0, img_dim*img_dim*sizeof(t[0]));
      dt = dtime_usec(0);
      blur1<<<g, b>>>(i, d_s, r, t, ks);
      err = cudaGetLastError();
      check(err);
      err = cudaDeviceSynchronize();
      dt = dtime_usec(dt);
      std::cout << "method 1 time: " << dt/(float)USECPSEC << "s" << std::endl;
      check(err);
      // proposal 2
      cudaMemset(r2, 0, img_dim*img_dim*sizeof(r2[0]));
      cudaMemset(t, 0, img_dim*img_dim*sizeof(t[0]));
      dt = dtime_usec(0);
      blur2<<<g, b>>>(i, d_s, r2, t, ks);
      err = cudaGetLastError();
      check(err);
      err = cudaDeviceSynchronize();
      dt = dtime_usec(dt);
      std::cout << "method 2 time: " << dt/(float)USECPSEC << "s" << std::endl;
      check(err);
      test_ok = true;
      test_equal<<<88*3, 512>>>(r, r2, img_dim*img_dim);
      err = cudaGetLastError();
      check(err);
      err = cudaDeviceSynchronize();
      check(err);
      if (!test_ok) std::cout << "mismatch method 1/2" << std::endl;
      // proposal 3
      cudaMemset(r2, 0, img_dim*img_dim*sizeof(r2[0]));
      cudaMemset(t, 0, img_dim*img_dim*sizeof(t[0]));
      dt = dtime_usec(0);
      blur3<<<g, b>>>(i, d_s, r2, t, ks);
      err = cudaGetLastError();
      check(err);
      err = cudaDeviceSynchronize();
      dt = dtime_usec(dt);
      std::cout << "method 3 time: " << dt/(float)USECPSEC << "s" << std::endl;
      check(err);
      test_ok = true;
      test_equal<<<88*3, 512>>>(r, r2, img_dim*img_dim);
      err = cudaGetLastError();
      check(err);
      err = cudaDeviceSynchronize();
      check(err);
      if (!test_ok) std::cout << "mismatch method 1/3" << std::endl;
      // proposal 4
      cudaMemset(r2, 0, img_dim*img_dim*sizeof(r2[0]));
      cudaMemset(t, 0, img_dim*img_dim*sizeof(t[0]));
      dt = dtime_usec(0);
      blur4<<<g, b>>>(i, d_s, r2, t, ks);
      err = cudaGetLastError();
      check(err);
      err = cudaDeviceSynchronize();
      dt = dtime_usec(dt);
      std::cout << "method 4 time: " << dt/(float)USECPSEC << "s" << std::endl;
      check(err);
      test_ok = true;
      test_equal<<<88*3, 512>>>(r, r2, img_dim*img_dim);
      err = cudaGetLastError();
      check(err);
      err = cudaDeviceSynchronize();
      check(err);
      if (!test_ok) std::cout << "mismatch method 1/4" << std::endl;
    #if DEBUG
      uchar3 *h_r = new uchar3[img_dim*img_dim];
      cudaMemcpy(h_r, r, img_dim*img_dim*sizeof(uchar3), cudaMemcpyDeviceToHost);
      for (int i = 0; i < img_dim; i++){
        for (int j = 0; j < img_dim; j++)  std::cout << (int)(h_r[i*img_dim+j].x) << " ";
        std::cout << std::endl;}
      cudaMemcpy(h_r, r2, img_dim*img_dim*sizeof(uchar3), cudaMemcpyDeviceToHost);
      std::cout << std::endl;
      for (int i = 0; i < img_dim; i++){
        for (int j = 0; j < img_dim; j++)  std::cout << (int)(h_r[i*img_dim+j].x) << " ";
        std::cout << std::endl;}
    #endif
    }
    # nvcc -o t242 t242.cu -arch=sm_89 -rdc=true
    # ./t242
    original method time: 0.003729s
    method 1 time: 0.001868s
    method 2 time: 0.003716s
    method 3 time: 0.001913s
    method 4 time: 4.4e-05s
    mismatch method 1/4
    #
    

    Notes: