c++cudatesla

Why my cuda program became slower after using 128 threads on blocks?


I have a simple cuda application with the following code:

#include <stdio.h>
#include <sys/time.h>
#include <stdint.h>
__global__
void daxpy(int n, int a, int *x, int *y) {
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  y[i] = x[i];
  int j;
  for(j = 0; j < 1024*10000; ++j) {
     y[i] += j%10;
  }
}
// debug time
void calc_time(struct timeval *start, const char *msg) {
   struct timeval end;
   gettimeofday(&end, NULL);
   uint64_t us = end.tv_sec * 1000000 + end.tv_usec - (start->tv_sec * 1000000 + start->tv_usec);
   printf("%s cost us = %llu\n", msg, us);
   memcpy(start, &end, sizeof(struct timeval));
}
void do_test() {
   unsigned long n = 1536;
   int *x, *y, a, *dx, *dy;
   a = 2.0;
   x = (int*)malloc(sizeof(int)*n);
   y = (int*)malloc(sizeof(int)*n);
   for(i = 0; i < n; ++i) {
      x[i] = i;
   }

   cudaMalloc((void**)&dx, n*sizeof(int));
   cudaMalloc((void**)&dy, n*sizeof(int));
   struct timeval start;
   gettimeofday(&start, NULL);
   cudaMemcpy(dx, x, n*sizeof(int), cudaMemcpyHostToDevice);

   daxpy<<<1, 512>>>(n, a, dx, dy); // this line 
   cudaThreadSynchronize();
   cudaMemcpy(y, dy, n*sizeof(int), cudaMemcpyDeviceToHost);
   calc_time(&start, "do_test ");
   cudaFree(dx);
   cudaFree(dy);
   free(x);
   free(y);
}
int main() {
   do_test();
   return 0;
}

The gpu kernel call is daxpy<<<1, 512>>>(n, a, dx, dy) and I performed some tests using different block sizes:

... and made the following observations:

I would like to ask what is causing the difference in execution time for block sizes 128 and 129.

My GPU is tesla K80:

CUDA Driver Version / Runtime Version          6.5 / 6.5
CUDA Capability Major/Minor version number:    3.7
Total amount of global memory:                 11520 MBytes (12079136768 bytes)
(13) Multiprocessors, (192) CUDA Cores/MP:     2496 CUDA Cores
GPU Clock rate:                                824 MHz (0.82 GHz)
Memory Clock rate:                             2505 Mhz
Memory Bus Width:                              384-bit
L2 Cache Size:                                 1572864 bytes
Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
Total amount of constant memory:               65536 bytes
Total amount of shared memory per block:       49152 bytes
Total number of registers available per block: 65536
Warp size:                                     32
Maximum number of threads per multiprocessor:  2048
Maximum number of threads per block:           1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch:                          2147483647 bytes
Texture alignment:                             512 bytes
Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
Run time limit on kernels:                     No
Integrated GPU sharing Host Memory:            No
Support host page-locked memory mapping:       Yes
Alignment requirement for Surfaces:            Yes
Device has ECC support:                        Enabled
Device supports Unified Addressing (UVA):      Yes
Device PCI Bus ID / PCI location ID:           135 / 0

Solution

  • After providing us with the exact time differences in one of the comments, i.e.:

    I think it indirectly supports my theory of issue being related to warp scheduling. Look at the GK210 whitepaper, which is a chip used in K80:

    Therefore, for 129 threads, scheduling cannot happen at once, because SMX has to schedule 5 warps, i.e. scheduling will happen in two steps.

    If the above is true, then I would expect:

    192 is the number of cores on the SMX, see whitepaper. As a reminder - entire blocks are always scheduled for one SMX and so obviously if you spawn more than 192 threads then those for sure won't be able to execute in parallel and execution time should be higher for 193+ number of threads.

    You can verify the above thesis by simplifying your kernel code to the degree where it will do almost nothing so it should be more or less obvious whether the execution takes longer only due to scheduling (there will be no other limiting factors such as memory throughput).

    Disclaimer: The above are just my assumptions as I don't have access to K80, nor any other GPU with quad warp scheduler so I cannot profile your code properly. But anyway, I believe that is the task for you - why not to use nvprof and profile your code yourself? Then you should be able to see where the time difference lies.