N-body OpenCL code : error CL_OUT_OF_HOST_MEMORY with GPU card NVIDIA A6000

I would like to make run an old N-body which uses OpenCL.

I have 2 cards NVIDIA A6000 with NVLink, a component which binds from an hardware (and maybe software ?) point of view these 2 GPU cards.

But at the execution, I get the following result:

Here is the kernel code used (I have put pragma that I estimate useful for NVIDIA cards):

#pragma OPENCL EXTENSION cl_khr_fp64 : enable

__kernel
void
nbody_sim(
    __global double4* pos ,
    __global double4* vel,
    int numBodies,
    double deltaTime,
    double epsSqr,
    __local double4* localPos,
    __global double4* newPosition,
    __global double4* newVelocity)
{
    unsigned int tid = get_local_id(0);
    unsigned int gid = get_global_id(0);
    unsigned int localSize = get_local_size(0);

    // Gravitational constant
    double G_constant = 227.17085e-74;

    // Number of tiles we need to iterate
    unsigned int numTiles = numBodies / localSize;

    // position of this work-item
    double4 myPos = pos[gid];
    double4 acc = (double4) (0.0f, 0.0f, 0.0f, 0.0f);

    for(int i = 0; i < numTiles; ++i)
    {
        // load one tile into local memory
        int idx = i * localSize + tid;
        localPos[tid] = pos[idx];

        // Synchronize to make sure data is available for processing
        barrier(CLK_LOCAL_MEM_FENCE);

        // Calculate acceleration effect due to each body
        // a[i->j] = m[j] * r[i->j] / (r^2 + epsSqr)^(3/2)
        for(int j = 0; j < localSize; ++j)
        {
            // Calculate acceleration caused by particle j on particle i
            double4 r = localPos[j] - myPos;
            double distSqr = r.x * r.x  +  r.y * r.y  +  r.z * r.z;
            double invDist = 1.0f / sqrt(distSqr + epsSqr);
            double invDistCube = invDist * invDist * invDist;
            double s = G_constant * localPos[j].w * invDistCube;

            // accumulate effect of all particles
            acc += s * r;
        }

        // Synchronize so that next tile can be loaded
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    double4 oldVel = vel[gid];

    // updated position and velocity
    double4 newPos = myPos + oldVel * deltaTime + acc * 0.5f * deltaTime * deltaTime;
    newPos.w = myPos.w;
    double4 newVel = oldVel + acc * deltaTime;

    // write to global memory
    newPosition[gid] = newPos;
    newVelocity[gid] = newVel;
}

The part of code which sets up the Kernel code is below:

int NBody::setupCL()
{
  cl_int status = CL_SUCCESS;
  cl_event writeEvt1, writeEvt2;

  // The block is to move the declaration of prop closer to its use
  cl_command_queue_properties prop = 0;
  commandQueue = clCreateCommandQueue(
      context,
      devices[current_device],
      prop,
      &status);
  CHECK_OPENCL_ERROR( status, "clCreateCommandQueue failed.");

    ...

// create a CL program using the kernel source
  const char *kernelName = "NBody_Kernels.cl";
  FILE *fp = fopen(kernelName, "r");
  if (!fp) {
    fprintf(stderr, "Failed to load kernel.\n");
    exit(1);
  }
  char *source = (char*)malloc(10000);
  int sourceSize = fread( source, 1, 10000, fp);
  fclose(fp);

  // Create a program from the kernel source
  program = clCreateProgramWithSource(context, 1, (const char **)&source, (const size_t *)&sourceSize, &status);

  // Build the program
  status = clBuildProgram(program, 1, devices, NULL, NULL, NULL);

  // get a kernel object handle for a kernel with the given name
  kernel = clCreateKernel(
      program,
      "nbody_sim",
      &status);
  CHECK_OPENCL_ERROR(status, "clCreateKernel failed.");

  status = waitForEventAndRelease(&writeEvt1);
  CHECK_ERROR(status, NBODY_SUCCESS, "WaitForEventAndRelease(writeEvt1) Failed");

  status = waitForEventAndRelease(&writeEvt2);
  CHECK_ERROR(status, NBODY_SUCCESS, "WaitForEventAndRelease(writeEvt2) Failed");

  return NBODY_SUCCESS;
}

So, the errors occurs at the creation of the Kernel code. Is there a way to consider the 2 GPU as a unique GPU with NVLINK component ? I mean from a software point of view ?

How can I fix this error of creation of Kernel code ?

Update 1

I) I have voluntarily restricted the number of GPU devices to only one GPU by modifying this loop below (actually, it remains only one iteration):

  // Print device index and device names
  //for(cl_uint i = 0; i < deviceCount; ++i)
  for(cl_uint i = 0; i < 1; ++i)
  {
    char deviceName[1024];
    status = clGetDeviceInfo(deviceIds[i], CL_DEVICE_NAME, sizeof(deviceName), deviceName, NULL);
    CHECK_OPENCL_ERROR(status, "clGetDeviceInfo failed");

    std::cout << "Device " << i << " : " << deviceName <<" Device ID is "<<deviceIds[i]<< std::endl;
  }

  // Set id = 0 for currentDevice with deviceType
  *currentDevice = 0;

  free(deviceIds);

  return NBODY_SUCCESS;
}

and doing after the classical call:

 status = clBuildProgram(program, 1, devices, NULL, NULL, NULL);

But error remains, below the message:

II) If I don't modify this loop and apply the solution suggested,i.e set devices[current_device] instead of devices I get a compilation error like this:

In file included from NBody.hpp:8,
                 from NBody.cpp:1:
/opt/AMDAPPSDK-3.0/include/CL/cl.h:863:16: note:   initializing argument 3 of ‘cl_int clBuildProgram(cl_program, cl_uint, _cl_device_id* const*, const char*, void (*)(cl_program, void*), void*)’
                const cl_device_id * /* device_list */,

How could I circumvent this issue of compilation ?

Update 2

I have printed the values of status variable in this portion of my code:

and I get a value for status = -44. From CL/cl.h, it would correspond to a CL_INVALID_PROGRAM error:

and then, when I execute the application, I get:

I wonder if I didn't miss to put special pragma in kernel code since i am using OpenCL on NVIDIA cards, don't I ?

By the way, what is the type of the variables devices ? I can't manage to print it correctly.

Update 3

I have added the following lines but still -44 error at the execution. Instead of putting all the concerned code, I provide the following link to download the source file: http://31.207.36.11/NBody.cpp and the Makefile used for compilation: http://31.207.36.11/Makefile . Maybe someone will find some errors but I would like mostly know why I get this error -44 .

Update 4

I am taking over this project.

Here is the result of clinfo command:

$ clinfo
Number of platforms:                 1
  Platform Profile:              FULL_PROFILE
  Platform Version:              OpenCL 3.0 CUDA 11.4.94
  Platform Name:                 NVIDIA CUDA
  Platform Vendor:               NVIDIA Corporation
  Platform Extensions:               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_khr_gl_event cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_nv_kernel_attribute cl_khr_device_uuid cl_khr_pci_bus_info


  Platform Name:                 NVIDIA CUDA
Number of devices:               2
  Device Type:                   CL_DEVICE_TYPE_GPU
  Vendor ID:                     10deh
  Max compute units:                 84
  Max work items dimensions:             3
    Max work items[0]:               1024
    Max work items[1]:               1024
    Max work items[2]:               64
  Max work group size:               1024
  Preferred vector width char:           1
  Preferred vector width short:          1
  Preferred vector width int:            1
  Preferred vector width long:           1
  Preferred vector width float:          1
  Preferred vector width double:         1
  Native vector width char:          1
  Native vector width short:             1
  Native vector width int:           1
  Native vector width long:          1
  Native vector width float:             1
  Native vector width double:            1
  Max clock frequency:               1800Mhz
  Address bits:                  64
  Max memory allocation:             12762480640
  Image support:                 Yes
  Max number of images read arguments:       256
  Max number of images write arguments:      32
  Max image 2D width:                32768
  Max image 2D height:               32768
  Max image 3D width:                16384
  Max image 3D height:               16384
  Max image 3D depth:                16384
  Max samplers within kernel:            32
  Max size of kernel argument:           4352
  Alignment (bits) of base address:      4096
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                     Yes
    Quiet NaNs:                  Yes
    Round to nearest even:           Yes
    Round to zero:               Yes
    Round to +ve and infinity:           Yes
    IEEE754-2008 fused multiply-add:         Yes
  Cache type:                    Read/Write
  Cache line size:               128
  Cache size:                    2408448
  Global memory size:                51049922560
  Constant buffer size:              65536
  Max number of constant args:           9
  Local memory type:                 Scratchpad
  Local memory size:                 49152
  Max pipe arguments:                0
  Max pipe active reservations:          0
  Max pipe packet size:              0
  Max global variable size:          0
  Max global variable preferred total size:  0
  Max read/write image args:             0
  Max on device events:              0
  Queue on device max size:          0
  Max on device queues:              0
  Queue on device preferred size:        0
  SVM capabilities:
    Coarse grain buffer:             Yes
    Fine grain buffer:               No
    Fine grain system:               No
    Atomics:                     No
  Preferred platform atomic alignment:       0
  Preferred global atomic alignment:         0
  Preferred local atomic alignment:      0
  Kernel Preferred work group size multiple:     32
  Error correction support:          0
  Unified memory for Host and Device:        0
  Profiling timer resolution:            1000
  Device endianess:              Little
  Available:                     Yes
  Compiler available:                Yes
  Execution capabilities:
    Execute OpenCL kernels:          Yes
    Execute native function:             No
  Queue on Host properties:
    Out-of-Order:                Yes
    Profiling :                  Yes
  Queue on Device properties:
    Out-of-Order:                No
    Profiling :                  No
  Platform ID:                   0x1e97440
  Name:                      NVIDIA RTX A6000
  Vendor:                    NVIDIA Corporation
  Device OpenCL C version:           OpenCL C 1.2
  Driver version:                470.57.02
  Profile:                   FULL_PROFILE
  Version:                   OpenCL 3.0 CUDA
  Extensions:                    cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_khr_gl_event cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_nv_kernel_attribute cl_khr_device_uuid cl_khr_pci_bus_info


  Device Type:                   CL_DEVICE_TYPE_GPU
  Vendor ID:                     10deh
  Max compute units:                 84
  Max work items dimensions:             3
    Max work items[0]:               1024
    Max work items[1]:               1024
    Max work items[2]:               64
  Max work group size:               1024
  Preferred vector width char:           1
  Preferred vector width short:          1
  Preferred vector width int:            1
  Preferred vector width long:           1
  Preferred vector width float:          1
  Preferred vector width double:         1
  Native vector width char:          1
  Native vector width short:             1
  Native vector width int:           1
  Native vector width long:          1
  Native vector width float:             1
  Native vector width double:            1
  Max clock frequency:               1800Mhz
  Address bits:                  64
  Max memory allocation:             12762578944
  Image support:                 Yes
  Max number of images read arguments:       256
  Max number of images write arguments:      32
  Max image 2D width:                32768
  Max image 2D height:               32768
  Max image 3D width:                16384
  Max image 3D height:               16384
  Max image 3D depth:                16384
  Max samplers within kernel:            32
  Max size of kernel argument:           4352
  Alignment (bits) of base address:      4096
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                     Yes
    Quiet NaNs:                  Yes
    Round to nearest even:           Yes
    Round to zero:               Yes
    Round to +ve and infinity:           Yes
    IEEE754-2008 fused multiply-add:         Yes
  Cache type:                    Read/Write
  Cache line size:               128
  Cache size:                    2408448
  Global memory size:                51050315776
  Constant buffer size:              65536
  Max number of constant args:           9
  Local memory type:                 Scratchpad
  Local memory size:                 49152
  Max pipe arguments:                0
  Max pipe active reservations:          0
  Max pipe packet size:              0
  Max global variable size:          0
  Max global variable preferred total size:  0
  Max read/write image args:             0
  Max on device events:              0
  Queue on device max size:          0
  Max on device queues:              0
  Queue on device preferred size:        0
  SVM capabilities:
    Coarse grain buffer:             Yes
    Fine grain buffer:               No
    Fine grain system:               No
    Atomics:                     No
  Preferred platform atomic alignment:       0
  Preferred global atomic alignment:         0
  Preferred local atomic alignment:      0
  Kernel Preferred work group size multiple:     32
  Error correction support:          0
  Unified memory for Host and Device:        0
  Profiling timer resolution:            1000
  Device endianess:              Little
  Available:                     Yes
  Compiler available:                Yes
  Execution capabilities:
    Execute OpenCL kernels:          Yes
    Execute native function:             No
  Queue on Host properties:
    Out-of-Order:                Yes
    Profiling :                  Yes
  Queue on Device properties:
    Out-of-Order:                No
    Profiling :                  No
  Platform ID:                   0x1e97440
  Name:                      NVIDIA RTX A6000
  Vendor:                    NVIDIA Corporation
  Device OpenCL C version:           OpenCL C 1.2
  Driver version:                470.57.02
  Profile:                   FULL_PROFILE
  Version:                   OpenCL 3.0 CUDA
  Extensions:                    cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_fp64 cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_nv_copy_opts cl_khr_gl_event cl_nv_create_buffer cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_nv_kernel_attribute cl_khr_device_uuid cl_khr_pci_bus_info

So I have one platform with 2 GPU cards A6000.

Given the fact that I want to make run the original version of my code (i.e using a single GPU card), I have to select only one ID in the source NBody.cpp (I will see in a second time how to manage with 2 GPU cards but this is for after). So, I have just modified in this source.

Instead of:

  // Print device index and device names
  for(cl_uint i = 0; i < deviceCount; ++i)
  {
    char deviceName[1024];
    status = clGetDeviceInfo(deviceIds[i], CL_DEVICE_NAME, sizeof(deviceName), deviceName, NULL);
    CHECK_OPENCL_ERROR(status, "clGetDeviceInfo failed");

    std::cout << "Device " << i << " : " << deviceName <<" Device ID is "<<deviceIds[i]<< std::endl;
  }

I did:

// Print device index and device names
  //for(cl_uint i = 0; i < deviceCount; ++i)
  for(cl_uint i = 0; i < 1; ++i)
  {
    char deviceName[1024];
    status = clGetDeviceInfo(deviceIds[i], CL_DEVICE_NAME, sizeof(deviceName), deviceName, NULL);
    CHECK_OPENCL_ERROR(status, "clGetDeviceInfo failed");

    std::cout << "Device " << i << " : " << deviceName <<" Device ID is "<<deviceIds[i]<< std::endl;
  }

As you can see, I have forced to take into account deviceIds[0], that is to say, a single GPU card.

A critical point is also the part of building program.

  // create a CL program using the kernel source 
  const char *kernelName = "NBody_Kernels.cl";
  FILE *fp = fopen(kernelName, "r");
  if (!fp) {
    fprintf(stderr, "Failed to load kernel.\n");
    exit(1);
  }
  char *source = (char*)malloc(10000);
  int sourceSize = fread( source, 1, 10000, fp);
  fclose(fp);

  // Create a program from the kernel source
  program = clCreateProgramWithSource(context, 1, (const char **)&source, (const size_t *)&sourceSize, &status);

  // Build the program
  //status = clBuildProgram(program, 1, devices, NULL, NULL, NULL);
  status = clBuildProgram(program, 1, &devices[current_device], NULL, NULL, NULL);
  printf("status1 = %d\n", status);
  //printf("devices = %d\n", devices[current_device]);

  // get a kernel object handle for a kernel with the given name
  kernel = clCreateKernel(
      program,
      "nbody_sim",
      &status);
  printf("status2 = %d\n", status);
  CHECK_OPENCL_ERROR(status, "clCreateKernel failed.");

At the execution, I get the following values for status1 and status2:

Selected Platform Vendor : NVIDIA Corporation
deviceCount = 2/nDevice 0 : NVIDIA RTX A6000 Device ID is 0x55c38207cdb0
status1 = -44
devices = -2113661720
status2 = -44
clCreateKernel failed.
clSetKernelArg failed. (updatedPos)
clEnqueueNDRangeKernel failed.
clEnqueueNDRangeKernel failed.
clEnqueueNDRangeKernel failed.
clEnqueueNDRangeKernel failed.

The first error is a failed creation of kernel. Here my NBody_Kernels.cl source:

#pragma OPENCL EXTENSION cl_khr_fp64 : enable

__kernel
void 
nbody_sim(
    __global double4* pos ,
    __global double4* vel,
    int numBodies,
    double deltaTime,
    double epsSqr,
    __local double4* localPos,
    __global double4* newPosition,
    __global double4* newVelocity)
{
    unsigned int tid = get_local_id(0);
    unsigned int gid = get_global_id(0);
    unsigned int localSize = get_local_size(0);

    // Gravitational constant
    double G_constant = 227.17085e-74;

    // Number of tiles we need to iterate
    unsigned int numTiles = numBodies / localSize;

    // position of this work-item
    double4 myPos = pos[gid];
    double4 acc = (double4) (0.0f, 0.0f, 0.0f, 0.0f);

    for(int i = 0; i < numTiles; ++i)
    {
        // load one tile into local memory
        int idx = i * localSize + tid;
        localPos[tid] = pos[idx];

        // Synchronize to make sure data is available for processing
        barrier(CLK_LOCAL_MEM_FENCE);

        // Calculate acceleration effect due to each body
        // a[i->j] = m[j] * r[i->j] / (r^2 + epsSqr)^(3/2)
        for(int j = 0; j < localSize; ++j)
        {
            // Calculate acceleration caused by particle j on particle i
            double4 r = localPos[j] - myPos;
            double distSqr = r.x * r.x  +  r.y * r.y  +  r.z * r.z;
            double invDist = 1.0f / sqrt(distSqr + epsSqr);
            double invDistCube = invDist * invDist * invDist;
            double s = G_constant * localPos[j].w * invDistCube;

            // accumulate effect of all particles
            acc += s * r;
        }

        // Synchronize so that next tile can be loaded
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    double4 oldVel = vel[gid];

    // updated position and velocity
    double4 newPos = myPos + oldVel * deltaTime + acc * 0.5f * deltaTime * deltaTime;
    newPos.w = myPos.w;
    double4 newVel = oldVel + acc * deltaTime;

    // write to global memory
    newPosition[gid] = newPos;
    newVelocity[gid] = newVel;
}

The modified source can be found here:

last modified code

I don't know how to solve the creation of this Kernel code and the following values status1 = -44 and status2 = -44.

Update 5

I have added clGetProgramBuildInfo to the code the following snippet to be able to see what's wrong with the clCreateKernl failed error:

// Create a program from the kernel source
  program = clCreateProgramWithSource(context, 1, (const char **)&source, (const size_t *)&sourceSize, &status);

  if (clBuildProgram(program, 1, devices, NULL, NULL, NULL) != CL_SUCCESS)
  {
    // Determine the size of the log
    size_t log_size;
    clGetProgramBuildInfo(program, devices[current_device], CL_PROGRAM_BUILD_LOG, 0, NULL, &log_size);
    // Allocate memory for the log
    char *log = (char *) malloc(log_size);

    cout << "size log =" << log_size << endl;
    // Get the log
    clGetProgramBuildInfo(program, devices[current_device], CL_PROGRAM_BUILD_LOG, log_size, log, NULL);

    // Print the log
    printf("%s\n", log);
    }


  // get a kernel object handle for a kernel with the given name
  kernel = clCreateKernel(
      program,
      "nbody_sim",
      &status);
  CHECK_OPENCL_ERROR(status, "clCreateKernel failed.");

Unfortunately, this function clGetProgramBuildInfo only gives the output:

Selected Platform Vendor : NVIDIA Corporation
Device 0 : NVIDIA RTX A6000 Device ID is 0x562857930980
size log =16
log =
clCreateKernel failed.

How can I print the content of "value" ?

Update 6

If I do a printf on :

  // Create a program from the kernel source
  program = clCreateProgramWithSource(context, 1, (const char **)&source, (const size_t *)&sourceSize, &status);
printf("status clCreateProgramWithSourceContext = %d\n", status);

I get an status=-6 which corresponds to CL_OUT_OF_HOST_MEMORY

Which are the tracks which allow to fix this ?

Partial solution

By compiling with Intel compilers (icc and icpc), compilation is performed well and code is running fine. I don't understand why it doesn't work with GNU gcc/g++-8 compiler. If someone had an idea ...

Solution

Your kernel code looks good and the cache tiling implementation is correct. Only make sure that the number of bodies is a multiple of local size, or alternatively limit the inner for loop to the global size additionally.

OpenCL allows usage of multiple devices in parallel. You need to make a thread with a queue for each device separately. You also need to take care of device-device communications and synchronization manually. Data transfer happens over PCIe (you also can do remote direct memory access); but you can't use NVLink with OpenCL. This should not be an issue in your case though as you need only little data transfer compared to the amount of arithmetic.

A few more remarks:

In many cases N-body requires FP64 to sum up the forces and resolve positions at very different length scales. However on the A6000, FP64 performance is very poor, just like on GeForce Ampere. FP32 would be significantly (~64x) faster, but is likely insufficient in terms of accuracy here. For efficient FP64 you would need an A100 or MI100.
Instead of 1.0/sqrt, use rsqrt. This is hardware supported and almost as fast as a multiplication.
Make sure to use either FP32 float (1.0f) or FP64 double (1.0) literals consistently. Using double literals with float triggers double arithmetic and casting of the result back to float which is much slower.

EDIT: To help you out with the error message: Most probably the error at clCreateKernel (what value does status have after calling clCreateKernel?) hints that program is invalid. This might be because you give clBuildProgram a vector of 2 devices, but set the number of devices to only 1 and also have context only for 1 device. Try

status = clBuildProgram(program, 1, &devices[current_device], NULL, NULL, NULL);

with only a single device.

To go multi-GPU, create two threads on the CPU that run NBody::setupCL() independently for GPUs 0 and 1, and then do synchronization manually.

EDIT 2: I see nowhere that you create context. Without a valid context, program will be invalid, so clBuildProgram will throw error -44. Call

context = clCreateContext(0, 1, &devices[current_device], NULL, NULL, NULL);

before you do anything with context.

N-body OpenCL code : error CL_​OUT_​OF_​HOST_​MEMORY with GPU card NVIDIA A6000