c++cudanumerical-computing

CUDA access matrix stored in RAM and possibility of being implemented


Recently I started working with numerical computation and solving mathematical problems numerically, programing in C++ with OpenMP. But now my problem is to big and take days to solve even parallelized. So, I’m thinking in start learning CUDA to reduce the time, but I have some doubts.

The heart of my code is the following function. The entries are two pointes to vectors. N_mesh_points_x,y,z are integers pre-defined, weights_x,y,z are column matrices, kern_1 is an exponential function and table_kernel is a function who access a 50 Gb matrix stored in RAM and pre calculated.

void Kernel::paralel_iterate(std::vector<double>* K1, std::vector<double>* K2 )
{
  double r, sum_1 = 0 , sum_2 = 0;
  double phir;

    for (int l = 0; l < N_mesh_points_x; l++){
      for (int m = 0; m < N_mesh_points_y; m++){
        for (int p = 0; p < N_mesh_points_z; p++){
        sum_1 = 0;
        sum_2 = 0;

        #pragma omp parallel for schedule(dynamic) private(phir) reduction(+: sum_1,sum_2)
        for (int i = 0; i < N_mesh_points_x; i++){
          for (int j = 0; j < N_mesh_points_y; j++){
            for (int k = 0; k < N_mesh_points_z; k++){
               
               if (!(i==l) || !(j==m) || !(k==p)){
               phir = weights_x[i]*weights_y[j]*weights_z[k]*kern_1(i,j,k,l,m,p);
               sum_1 += phir * (*K1)[position(i,j,k)];
               sum_2 += phir;
              }

             }
           }
         }
        (*K2)[ position(l,m,p)] = sum_1 + (table_kernel[position(l,m,p)] - sum_2) * (*K1)[position (l,m,p)];
    }
  }
}

return;
}

My questions are:


Solution

  • Can I program, at least the central part of this function, in CUDA? I only parallelized with OpenMP the internals loops because was giving the wrong answer when I parallelized all the loops.

    Yes, you should be able to program the portion that you currently have in the OpenMP scope, as a CUDA kernel.

    The function table_kernel who access a big matrix, the matrix is to big to be stored in the memory of my video card, so the file will stay in RAM. This is a problem? The CUDA can access easily the files in RAM? Or this can’t be done and all the files needed to be stored inside video card?

    Since you only access this outside the OpenMP scope, if you only use a CUDA kernel for the work that you are currently doing with OpenMP, it should not be necessary to access table_kernel from the GPU, and therefore this should not be an issue. If you attempt to add additional loops to be parallelized on the GPU, then this may become an issue. Since the access would be relatively infrequent (compared to the processing going on in the inner loops), if you wanted to pursue this, you could try making the table_kernel data available to the GPU via cudaHostAlloc - basically mapping host memory in the GPU address space. This normally is a significant performance hazard, but if you make infrequent accesses to it as mentioned, it may or may not be a serious performance issue.

    Note that you won't be able to use or access std::vector in device code, so those types of data containers would probably have to be realized as ordinary double arrays.