Optimizing the solution of the 2D diffusion (heat) equation in CUDA

I have already checked earlier questions on SO about this issue but not able to see how it relates here.

I am solving 2d diffusion equation with CUDA and it turns out that my GPU code is slower than its CPU counterpart.

Here is my code:

//kernel definition
__global__ void diffusionSolver(double* A, int n_x,int n_y)
{

int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;


if(i<n_x && j <n_y && i*(n_x-i-1)*j*(n_y-j-1)!=0)
    A[i+n_y*j] = A[i+n_y*j] + (A[i-1+n_y*j]+A[i+1+n_y*j]+A[i+(j-1)*n_y]+A[i+(j+1)*n_y] -4.0*A[i+n_y*j])/40.0;


}

int main function

 int main()
        {
        int n_x = 200 ;
        int n_y = 200 ;       
        double *phi;
        double *dummy;
        double *phi_old;            
        int i,j ;

        phi = (double *) malloc( n_x*n_y* sizeof(double));
        phi_old = (double *) malloc( n_x*n_y* sizeof(double));
        dummy = (double *) malloc( n_x*n_y* sizeof(double));
        int iterationMax =200;                    
        for(j=0;j<n_y ;j++)
        {
            for(i=0;i<n_x;i++)
            {
            if((.4*n_x-i)*(.6*n_x-i)<0)
            phi[i+n_y*j] = -1;

            else 
            phi[i+n_y*j] = 1;
            }

        }

        double *dev_phi;
        cudaMalloc((void **) &dev_phi, n_x*n_y*sizeof(double));
        cudaMemcpy(dev_phi, phi, n_x*n_y*sizeof(double),
        cudaMemcpyHostToDevice);

        dim3 threadsPerBlock(10,100);
        dim3 numBlocks(n_x*n_y / threadsPerBlock.x, n_x*n_y / threadsPerBlock.y);

        for(int z=0; z<iterationMax; z++)
        {
        if(z%100==0)
        cout <<z/100 <<"\n";;
        diffusionSolver<<<numBlocks, threadsPerBlock>>>(dev_phi, n_x,n_y);
        }
    cudaMemcpy(phi, dev_phi,n_x*n_y*sizeof(double), cudaMemcpyDeviceToHost);
cudaFree(dev_phi);
    return 0;
    }

The problem with this code is that it runs slower than a simple CPU-only iterative method. I don't know much about profiler and until now I tried with cuda-memcheck which gives 0 errors. How can I know which portion of the code is performing slowly and speed that up? I am working on a Linux environment. Thanks in advance for any help.

Solution

The worst problem I see is that you are launching far too many blocks for the size of the input array. At the moment you are computing the grid size as:

dim3 numBlocks(n_x*n_y / threadsPerBlock.x, n_x*n_y / threadsPerBlock.y);

which should yield a grid size of (400,4000) blocks for an input array of only 200x200. That is clearly incorrect. The calculation should be something like:

int nbx = (n_x / threadsPerBlock.x) + (((n_x % threadsPerBlock.x) == 0) ? 0 : 1);
int nby = (n_y / threadsPerBlock.y) + (((n_y % threadsPerBlock.y) == 0) ? 0 : 1);
dim3 numBlocks(nbx,nby);

which would yield a grid size of (2,20) blocks, or 40000 times fewer than you are currently launching.

There are other optimisations which you could consider making to the kernel, but those pale into insignificance compared with mistakes of this magnitude.