I have used OpenMP in a number of my past projects. Separately, I have also written CUDA and OpenCL GPU codes in a number of open source projects.
I've heard a lot that OpenMP 5 added support to NVIDIA/AMD GPU, and started some experiments in porting my CUDA/OpenCL codes to OpenMP5, but got stuck when offloading the code to nvidia GPU. I think I might have missed some basics, hence seeking some pointers here.
I decided to use C++ for this project. My full source code (~400 lines of C++ codes) can be accessed at https://github.com/fangq/umcx
To compile/test the code, one should run
git clone https://github.com/fangq/umcx.git
cd umcx/src
make # build multi-threading without gpu offloading
./umcx cube60 # running the benchmark
the above multi-threaded version can be built/run properly on g++ 11 or newer. The code also works on NVIDIA GPU when built with nvc using
make clean
make nvc
./umcx cube60
However, I got many problems when trying to build it using g++-12/13 for nvptx-none, for example, if I run
make clean
make nvidia CXX=g++-12
[Update Dec 31, 2024] The missing sinf
math function error was fixed by adding -foffload="-lm"
as suggested by Mat Colgrove in the below comment.
Now g++-12 was able to produce a valid binary, however, running it produces the below memory error
$ make nvidia CXX=g++-12
g++-12 -g -Wall -Wextra -std=c++14 -O3 -foffload=nvptx-none -foffload="-lm" -fcf-protection=none -fno-stack-protector -DGPU_OFFLOAD -fopenmp -c umcx.cpp -o umcx.o
g++-12 -g -Wall -Wextra -std=c++14 -O3 -foffload=nvptx-none -foffload="-lm" -fcf-protection=none -fno-stack-protector -DGPU_OFFLOAD -fopenmp -o umcx umcx.o
/usr/bin/ld: /tmp/cc2w2usX.crtoffloadtable.o: warning: relocation against `__offload_funcs_end' in read-only section `.rodata'
/usr/bin/ld: warning: creating DT_TEXTREL in a PIE
$ ./umcx cube60
libgomp: cuCtxSynchronize error: an illegal memory access was encountered
libgomp: cuMemFree_v2 error: an illegal memory access was encountered
libgomp: device finalization failed
running compute-sanitizer ./umcx cube60
, it reports many memory reading errors for line#185 of the code
========= Invalid __global__ read of size 4 bytes
========= at main$_omp_fn$1+0x14f0 in umcx.cpp:185
========= by thread (0,4,0) in block (25,0,0)
========= Address 0x7f7f651712d0 is out of bounds
========= and is 522,050 bytes after the nearest allocation at 0x7f7f64c00000 of size 5,184,399 bytes
Interestingly, g++-12 was also able to build an amdgcn offloaded binary with the same warning as above.
$ make amd CXX=g++-12
g++-12 -g -Wall -Wextra -std=c++14 -O3 -foffload=amdgcn-amdhsa="-march=gfx906" -foffload="-lm" -fcf-protection=none -fno-stack-protector -DGPU_OFFLOAD -fopenmp -c umcx.cpp -o umcx.o
g++-12 -g -Wall -Wextra -std=c++14 -O3 -foffload=amdgcn-amdhsa="-march=gfx906" -foffload="-lm" -fcf-protection=none -fno-stack-protector -DGPU_OFFLOAD -fopenmp -o umcx umcx.o
/usr/bin/ld: /tmp/ccep7zpg.crtoffloadtable.o: warning: relocation against `__offload_funcs_end' in read-only section `.rodata'
/usr/bin/ld: warning: creating DT_TEXTREL in a PIE
$./umcx cube60
libgomp: Offload data incompatible with GCN plugin (expected 3, received 2)
libgomp: Cannot map target functions or variables (expected 1, have 4294967295)
I am not able to get g++-13 to work.
If I build it with make nvidia CXX=g++-13
, I got a different error
g++-13 -g -Wall -Wextra -std=c++14 -O3 -foffload=nvptx-none -fcf-protection=none -fno-stack-protector -DGPU_OFFLOAD -fopenmp -c umcx.cpp -o umcx.o
g++-13 -g -Wall -Wextra -std=c++14 -O3 -foffload=nvptx-none -fcf-protection=none -fno-stack-protector -DGPU_OFFLOAD -fopenmp -o umcx umcx.o
x86_64-linux-gnu-accel-nvptx-none-gcc-13: fatal error: cannot read spec file ‘libgomp.spec’: No such file or directory
compilation terminated.
nvptx mkoffload: fatal error: x86_64-linux-gnu-accel-nvptx-none-gcc-13 returned 1 exit status
On a different machine, same xubuntu 22.04 distro, same g++-12/13 installed from ppa, CUDA 12.6, both compilers gave me Value 'sm_30' is not defined for option 'gpu-name' error.
g++-13 -g -Wall -Wextra -std=c++14 -O3 -foffload=nvptx-none -fcf-protection=none -fno-stack-protector -DGPU_OFFLOAD -fopenmp -c umcx.cpp -o umcx.o
g++-13 -g -Wall -Wextra -std=c++14 -O3 -foffload=nvptx-none -fcf-protection=none -fno-stack-protector -DGPU_OFFLOAD -fopenmp -o umcx umcx.o
ptxas fatal : Value 'sm_30' is not defined for option 'gpu-name'
nvptx-as: ptxas returned 255 exit status
nvptx mkoffload: fatal error: x86_64-linux-gnu-accel-nvptx-none-gcc-13 returned 1 exit status
so my questions are
declare mapper()/map(mapper(id))
for this, but gcc does not yet support mapper.nvc++
builds the following directives for mapping a dynamic array inside a nested classmap(alloc: inputvol.vol) map(to: inputvol.vol[0:inputvol.dimxyzt]) map(alloc: outputvol.vol) map(from: outputvol.vol[0:outputvol.dimxyzt]) \
and produced correct results on the GPU. is this supported on gcc?
I want to thank some of the helpful comments.
I believe I have answers to most of the questions I had previously, and want to write a short summary here.
To map data stored in embedded pointers inside a struct/class
such as
struct Dataset {
int len = 0;
float *data = nullptr;
Dataset() {}
Dataset(int len0, int *data0) {...}
} readonlydata, readwritedata;
the following omp pragma works on most compilers (gcc 11 to 14, clang 16+)
map(to: readonlydata) map(to: readonlydata.data[0:readonlydata.len]) \
map(tofrom: readwritedata) map(tofrom: readonlydata.data[0:readonlydata.len])
The data pointer must be separately mapped in order to pass those to the device.
this was mostly inspired by the "Deep-Copy" OpenACC example shared by Mat Colgrove
https://developer.download.nvidia.com/assets/pgi-legacy-support/Deep-Copy-Support-in-OpenACC_PGI.pdf
It appears that OpenMP also supports using variables as array length at runtime.
Based on the OpenMP 5.1 examples, another way to map such nested dynamic data is to use declare mapper()
, which does not apply to individual variable, but applies to the struct type (typedef)
typedef struct Dataset dataset;
#pragma omp declare mapper(dataset ds) map(ds, ds.data[0:ds.len])
Unfortunately, it appears that declare mapper()
clause is currently not supported in either gcc or nvc.
Now, regarding gcc, clang and nvc, the completeness and robustness of their OpenMP GPU offloading features are quite uneven and overall buggy.
Among these 3 compilers, nvc is the most robust and also offers the highest gpu speed after offloading. However, it is only supported on Linux. gcc/clang can build on Mac/Windows, but both produced slow/unoptimized binaries. gcc-12 is relatively the more stable one, but the binary is also quite slow. gcc-11 can build my code, but does not run properly on some GPUs; gcc-13/14 both can build, but won't run. I have found a number of regressions that were related to those error messages.
Some commonly seen gcc error messages when building nvptx with gcc-11 to 13
-foffload="-lm"
in the linking command-foffload="-march=sm_35"
won't helpnum_teams()
or OMP_NUM_THEAMES
setting, and is capped to a small number, which is severely underutilizing the GPU hardware; I suspect that it is related to this fix (maybe a regression?): https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109875libgomp: Offload data incompatible with GCN plugin (expected 3, received 2) libgomp: Cannot map target functions or variables (expected 1, have 4294967295)
error when running.as of now (Jan of 2025), gcc
's GPU offloading is still quite buggy and unoptimized. nvc
is the quicker solution to get the code to build and run.