image-processingcudaalphablending

Number of thread increase but no effect on runtime


I have tried to implement alpha image blending algorithm in CUDA C. There is no error in my code. It compiled fine. As per the thread logic, If I run the code with the increased number of threads the runtime should be decreased. In my code, I got a weird pattern of run time. When I run the code with 1 thread the runtime was 8.060539 e-01 sec, when I run the code with 4 thread I got the runtime 7.579031 e-01 sec, When It ran for 8 threads the runtime was 7.810102e-01, and for 256 thread the runtime is 7.875319e-01.

Here is my code:

#include <stdio.h>
#include <stdlib.h>
#include "timer.h"
#define STB_IMAGE_IMPLEMENTATION
#include "stb_image.h"
#define STB_IMAGE_WRITE_IMPLEMENTATION
#include "stb_image_write.h"

__global__ void image_blend(unsigned char *Pout, unsigned char *pin1, unsigned char *pin2, int width, int height, int channels, float alpha){

        int col = threadIdx.x + blockIdx.x*blockDim.x;
        int row = threadIdx.y + blockIdx.y*blockDim.y;


        if(col<width && row<height){
                    size_t img_size = width * height * channels;

                        if (Pout != NULL)
                                {
                                    for (size_t i = 0; i < img_size; i++)
                                    {
                                        Pout[i] = ((1.0 - alpha) * pin1[i] + alpha * pin2[i]);
                                    }
                                }

        }

}

int main(int argc, char* argv[]){

        int thread_count;
        double start, finish;
        float alpha;
        int width, height, channels;
        unsigned char *new_img;
        thread_count = strtol(argv[1], NULL, 10);
        printf("Enter the value for alpha:");
        scanf("%f", &alpha);
        unsigned char *apple = stbi_load("apple.jpg", &width, &height, &channels, 0);
        unsigned char *orange = stbi_load("orange.jpg", &width, &height, &channels, 0);
        size_t img_size = width * height * channels;
        //unsigned char *new_img = malloc(img_size);
        cudaMallocManaged(&new_img,img_size*sizeof(unsigned char));
        cudaMallocManaged(&apple,img_size* sizeof(unsigned char));
        cudaMallocManaged(&orange, img_size*sizeof(unsigned char));
        GET_TIME(start);
        image_blend<<<1,16,thread_count>>>(new_img,apple, orange, width, height, channels,alpha);
        cudaDeviceSynchronize();
        GET_TIME(finish);
        stbi_write_jpg("new_image.jpg", width, height, channels, new_img, 100);
        cudaFree(new_img);
        cudaFree(apple);
        cudaFree(orange);
        printf("\n Elapsed time for cuda = %e seconds\n", finish-start);

}

After getting a weird pattern in the runtime I am bit skeptical about the implementation of the code. Can anyone let me know why I get those runtime even if my code has no bug.


Solution

  • Let's start here:

    image_blend<<<1,16,thread_count>>>(new_img,apple, orange, width, height, channels,alpha);
    

    It seems evident you don't understand the kernel launch syntax:

    <<<1,16,thread_count>>>
    

    The first number (1) is the number of blocks to launch. The second number (16) is the number of threads per block. The third number (thread_count) is the size of the dynamically allocated shared memory in bytes.

    So our first observation will be that although you claimed to have changed the thread count, you didn't. You were changing the number of bytes of dynamically allocated shared memory. Since your kernel code doesn't use shared memory, this is a completely meaningless variable.

    Let's also observe your kernel code:

                                    for (size_t i = 0; i < img_size; i++)
                                    {
                                        Pout[i] = ((1.0 - alpha) * pin1[i] + alpha * pin2[i]);
                                    }
    

    For every thread that passes your if test, each one of those threads will execute the entire for-loop and will process the entire image. That is not the general idea with writing CUDA kernels. The general idea is to break up the work so that each thread does a portion of the work, not the whole activity.

    These are very basic observations. If you take advantage of an orderly introduction to CUDA, such as here, you can get beyond some of these basic concepts.

    We could also point out that your kernel nominally expects a 2D launch, and you are not providing one, and perhaps many other observations. Another important concept that you are missing is that you cannot do this:

        unsigned char *apple = stbi_load("apple.jpg", &width, &height, &channels, 0);
        ...
        cudaMallocManaged(&apple,img_size* sizeof(unsigned char));
    

    and expect anything sensible to come from that. If you want to see how data is moved from a host allocation to the device, study nearly any CUDA sample code, such as vectorAdd. Using a managed allocation doesn't allow you to overwrite the pointer like you are doing and get anything useful from that.

    I'll provide an example of how one might go about doing what I think you are suggesting, without providing a complete tutorial on CUDA. To provide an example, I'm going to skip the STB image loading routines. To understand the work you are trying to do here, the actual image content does not matter.

    Here's an example of an image processing kernel (1D) that will:

    1. Process the entire image, only once
    2. Use less time, roughly speaking, as you increase the thread count.

    You haven't provided your timer routine/code, so I'll provide my own:

    $ cat t2130.cu
    #include <stdio.h>
    #include <stdlib.h>
    #include <time.h>
    #include <sys/time.h>
    #define USECPSEC 1000000ULL
    
    unsigned long long dtime_usec(unsigned long long start=0){
    
      timeval tv;
      gettimeofday(&tv, 0);
      return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
    }
    
    unsigned char *i_load(int w, int h, int c, int init){
      unsigned char *res = new unsigned char[w*h*c];
      for (int i = 0; i < w*h*c; i++) res[i] = init;
      return res;
    }
    
    __global__ void image_blend(unsigned char *Pout, unsigned char *pin1, unsigned char *pin2, int width, int height, int channels, float alpha){
    
      if (Pout != NULL)
        {
        size_t img_size = width * height * channels;
        for (size_t i = blockIdx.x*blockDim.x+threadIdx.x; i < img_size; i+=gridDim.x*blockDim.x) // grid-stride loop
          {
            Pout[i] = ((1.0 - alpha) * pin1[i] + alpha * pin2[i]);
          }
        }
    }
    
    int main(int argc, char* argv[]){
    
            int threads_per_block = 64;
            unsigned long long dt;
            float alpha;
            int width = 1920;
            int height = 1080;
            int channels = 3;
            size_t img_size = width * height * channels;
            int thread_count = img_size;
            if (argc > 1) thread_count = atoi(argv[1]);
            unsigned char *new_img, *m_apple, *m_orange;
            printf("Enter the value for alpha:");
            scanf("%f", &alpha);
            unsigned char *apple = i_load(width, height, channels, 10);
            unsigned char *orange = i_load(width, height, channels, 70);
            //unsigned char *new_img = malloc(img_size);
            cudaMallocManaged(&new_img,img_size*sizeof(unsigned char));
            cudaMallocManaged(&m_apple,img_size* sizeof(unsigned char));
            cudaMallocManaged(&m_orange, img_size*sizeof(unsigned char));
            memcpy(m_apple, apple, img_size);
            memcpy(m_orange, orange, img_size);
            int blocks;
            if (thread_count < threads_per_block) {threads_per_block = thread_count; blocks = 1;}
            else {blocks = thread_count/threads_per_block;}
            printf("running with %d blocks of %d threads\n", blocks, threads_per_block);
            dt = dtime_usec(0);
            image_blend<<<blocks, threads_per_block>>>(new_img,m_apple, m_orange, width, height, channels,alpha);
            cudaDeviceSynchronize();
            dt = dtime_usec(dt);
            cudaError_t err = cudaGetLastError();
            if (err != cudaSuccess) printf("CUDA Error: %s\n", cudaGetErrorString(err));
            else printf("\n Elapsed time for cuda = %e seconds\n", dt/(float)USECPSEC);
            cudaFree(new_img);
            cudaFree(m_apple);
            cudaFree(m_orange);
    }
    $ nvcc -o t2130 t2130.cu
    $ ./t2130 1
    Enter the value for alpha:0.2
    running with 1 blocks of 1 threads
    
     Elapsed time for cuda = 5.737880e-01 seconds
    $ ./t2130 2
    Enter the value for alpha:0.2
    running with 1 blocks of 2 threads
    
     Elapsed time for cuda = 3.230150e-01 seconds
    $ ./t2130 32
    Enter the value for alpha:0.2
    running with 1 blocks of 32 threads
    
     Elapsed time for cuda = 4.865200e-02 seconds
    $ ./t2130 64
    Enter the value for alpha:0.2
    running with 1 blocks of 64 threads
    
     Elapsed time for cuda = 2.623300e-02 seconds
    $ ./t2130 128
    Enter the value for alpha:0.2
    running with 2 blocks of 64 threads
    
     Elapsed time for cuda = 1.546000e-02 seconds
    $ ./t2130
    Enter the value for alpha:0.2
    running with 97200 blocks of 64 threads
    
     Elapsed time for cuda = 5.809000e-03 seconds
    $
    

    (CentOS 7, CUDA 11.4, V100)

    The key methodology that allows the kernel to do all the work (only once) while making use of an "arbitrary" number of threads efficiently is the grid-stride loop.