[SOLVED] Why I get Present table dump for device[1]: NVIDIA Tesla GPU 0 in this code?

Why I get Present table dump for device[1]: NVIDIA Tesla GPU 0 in this code?

Here is a header file DataHolder.h:

#ifndef DATAHOLDER_H
#define DATAHOLDER_H
using FloatingType=float;
int LIFE=0;
const int GL=2000000;
template <typename Floating> class DataHolder {
public:
  Floating particles[GL];
public:
  DataHolder(){}
 ~DataHolder(){}
 void Propagate();
 void InitParticle();
 };
template <typename Floating> void DataHolder<Floating>::Propagate()
{
#pragma acc parallel loop copy(LIFE) present(particles)
  for(int i=0; i<LIFE; ++i) Floating r0= particles[i];
}
template <typename Floating> void DataHolder<Floating>::InitParticle()
{
#pragma acc parallel num_gangs(1) vector_length(1) 
present(particles[0:GL]) copy(LIFE)
{
  particles[LIFE]=0.0f;
#pragma acc atomic update
  ++LIFE;
}
}
#endif//DATAHOLDER_H

I use it in the file main.cpp:

#include <iostream>
#include "DataHolder.h"
#include <accelmath.h>
#include <openacc.h>
#include <cuda.h>
#include <cuda_runtime.h>
int main(int argc, char **argv)
{
  DataHolder<FloatingType> * d;
  cudaMalloc((void**) & d, sizeof(DataHolder<FloatingType>));
  std::cout<<"sizeof(DataHolder<FloatingType>)=" 
  <<sizeof(DataHolder<FloatingType>)/1024/1024<<" MB"<<std::endl;  
  LIFE=0;
  int step=0;
  d->InitParticle();
  cudaFree(d);
}

The program compiles, but fails with:

sizeof(DataHolder)=7 MB hostptr=0x501520000,stride=1,size=2000000,eltsize=4,flags=0x200=present,async=-1,threadid=1 Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 3.0, threadid=1 host:0x604b60 device:0x501ce0000 size:4 presentcount:1+0 line:26 name:LIFE allocated block device:0x501ce0000 size:512 thread:1 FATAL ERROR: data in PRESENT clause was not found on device 1: name=(null) host:0x501520000 file:/home/70-gaa/NFbuild_script_CHECK_GPU/ERROR/T3DataHolder.h _ZN10DataHolderIfE12InitParticleEv line:26

Why? What is wrong?

I compile the code for launching on GPU GeForce GTX 650 Ti with the compile line:

cmake . -DCMAKE_C_COMPILER=pgcc -DCMAKE_CXX_COMPILER=pgc++ - 
DCMAKE_CXX_FLAGS="-acc -mcmodel=medium =ta=tesla:cc30,managed -fast - 
Mcuda=cuda10.1 --c++11"

Use PGI 19.4 C++ compiler, gcc 5.3.1, OS Fedora 23 x86_64, CUDA 10.1, CUDA driver version 418.67.

Solution

The "present" clause check if the particular variable is present on the device but can only check variables managed by the OpenACC runtime. Here, you're allocating data via cudaMalloc which isn't managed by the runtime. In these cases, you should replace "present" with "deviceptr" to tell the runtime that is a CUDA device pointer. You'll need to add the "this" pointer since it's a device pointer as well.

However, your code will then seg fault on the host since you dereference "d" which is not a valid host pointer.

The simplest solution here is not use cudaMalloc and allocate "d" using "new". Since you're using CUDA Unified Memory, the data movement of "d" will be handled by the CUDA driver.