How to debug the cuda kernel properly?

I have this function:

void initializeCudaMatrixWithCPUMatrix(float *hostA, float *cudaA, int n,
                                       int m) {
  // float testt = hostA[888];
  cudaMalloc((void **)&cudaA, n * m * sizeof(float));
  cudaMemcpy(cudaA, hostA, n * m * sizeof(float), 
  cudaMemcpyHostToDevice);
  cudaDeviceSynchronize();
  // testt = cudaA[888];
}

I call it like this:

  initializeCudaMatrixWithCPUMatrix(model->embeddingMatrix,
                                    model->embeddingMatrixCuda, 408, 1024);

broadcastedMatrix is malloced like this: (I changed it to that after einpoklum's answer it had the same mistake, he mentioned, before)

  float *broadcastedEmbeddingMatrix;
  gpuMallocMatrix(&broadcastedEmbeddingMatrix, trademodel->maxnum * 128, 1024);
  cudaDeviceSynchronize();

void gpuMallocMatrix(float **A, int m, int n) {
  cudaError_t cudaStatus = cudaMalloc((void **)A, n * m * sizeof(float));
}

Then I call:

broadcastMatrix(trademodel->embeddingMatrixCuda, 
                broadcastedEmbeddingMatrix, 408, 1024, 128);

and then in a seperate file I have:

extern "C" void broadcastMatrix( float *Matrix, float *BroadcastedMatrix,
                                 int rows, int cols, int batchsize )   {   
    dim3 rows2d(rows, batchsize);
    broadcastMatrixKernel<<<rows2d, cols>>>(Matrix, BroadcastedMatrix,
                                            rows, cols, batchsize); 
}

And then in that file as well the kernel itself:

__global__ void broadcastMatrixKernel(float *Matrix, float *BroadcastedMatrix, 
                                      int rows, int cols, int batchsize) {
   int rowIdx = blockIdx.x;
   int batchIdx = blockIdx.y;
   int colIdx = threadIdx.x;
   if (rowIdx < rows && colIdx < cols && batchIdx < batchsize) { 
       BroadcastedMatrix[batchIdx * rows * cols + rowIdx * cols + colIdx] = 
       Matrix[rowIdx * cols + colIdx];   
   }
}

And I get: received signal CUDA_EXCEPTION_14, Warp Illegal Address for this call right here:

broadcastMatrix(trademodel->embeddingMatrixCuda, 
                broadcastedEmbeddingMatrix, 408, 1024, 128);

I thought that maybe one of those matrixes were empty thats why I wanted to check with testt = cudaA[888]; Since this doesnt work I put comments now to clarify that its not really part of the code.

@paleonix I use cuda-gdb (or maybe just gdb? I can't look it up atm) with a launch.json and vscode for debugging so I can use breakpoints etc. But if I step into the kernel function itself it doesnt work as I'd expect. It also doesnt happen for the same block/thread pair everytime. Its usually block (0,0,0) thread (418,0,0) or sth (480, 0, 0) or (416, 0, 0) but it happens within the same (the first) block just the thread varies. Maybe I could show you the launch.json and we could fix the debugging. I'd massivly prefer a proper launch.json (i.e. proper debugging within the kernel) over printf statements.

Some additional information: Cuda must be installed correctly since just before the warp error I call another kernel and that one works. BroadcastedMatrix is initialiced with cudaMalloc and cudaMalloc returns cudaSuccess for that call so I think thats not the issue. (cudaMalloc also returns cudaSuccess when mallocing embeddingMatrixCuda)

I hope this question makes sense, best regards :)

Solution

You are leaking the device-side allocation in initializeCudaMatrixWithCPUMatrix: you're letting cudaMalloc() set the value of cudaA - which is a local float* variable, not a reference to an external float*; the variable ceases to exist when returning from the function.

So, when you make your call of the function:

initializeCudaMatrixWithCPUMatrix(
    model->embeddingMatrix, model->embeddingMatrixCuda, 408, 1024);

the value of model->embeddingMatrixCuda remains the same after the function returns.

As for everything happening later in your code - I suggest start with getting your allocation right; then if you're having trouble, maybe ask another question, and make it very focused please.

Some more advice, beyond your specific question:

Consider having your functions take a GPU ID, since - who knows whether your code is going to be using the first/default GPU always?
(As @paleonix has also suggested:) Always check your API calls for errors. See this SO question about how to do that conveniently.
Even if you're writing C-style rather than C++-style code - consider structuring your data better. So, if you use matrices, define something like:
```
struct matrix {
   float* data;
   size_t dims[2]; // dims[0] is the major dimension
};
```
And then you could write a function such as:
```
struct matrix create_gpu_copy_of_matrix(struct matrix on_host);
```
and explain in a comment that the returned matrix is "owning", i.e. that it's the caller's responsibility to manage its GPU-side memory. You might then be able to write something like:
```
model->embedding[GPU_DEVICE_SIDE] = create_gpu_copy_of_matrix(
    model->embedding[HOST_SIDE], chosen_gpu_id);
```
within even having to repeat the matrix dimensions; without the possibility of getting them wrong; and with much better readability IMNSHO.