I am trying to compile a simple test problem using OpenMP offloading for an Nvidia GPU. I am using gcc with the nvptx-none target. I have installed the gcc+nvptx package with spack (or compiled gcc-13 with nvptx-tools myself, the results are the same). During linking, I get the error:
unresolved symbol _fputwc_r
collect2: error: ld returned 1 exit status
mkoffload: fatal error: x86_64-pc-linux-gnu-accel-nvptx-none-gcc returned 1 exit status
compilation terminated.
lto-wrapper: fatal error: /path/to/spack/opt/spack/linux-centos8-x86_64_v3/gcc-13.0.0/gcc-12.2.0-6olbpwbs53cquwnpsvrmuxprmaofwjtk/libexec/gcc/x86_64-pc-linux-gnu/12.2.0//accel/nvptx-none/mkoffload returned 1 exit status
compilation terminated.
/usr/bin/ld: error: lto-wrapper failed
Compiling with -fno-stack-protector
, as recommended e.g. here or
here, does not alleviate the problem. -fno-lto
does, but then the offloading doesn't work. Different optimization flags make no difference.
The ld
that is used is the system installation it seems. The spack installation provides another ld in spack/linux-centos8-x86_64_v3/gcc-13.0.0/gcc-12.2.0-6olbpwbs53cquwnpsvrmuxprmaofwjtk/nvptx-none
, but spack doesn't add this to the PATH normally. I guess with good reason, because including it leads to
as: unrecognized option '--64'
nvptx-as: missing .version directive at start of file '/tmp/cc9YfveM.s'``
Is this a problem with the linker, or something else? The problem only occurs when actually including a parallel for loop, just setting #pragma omp target
does not. The device is actually recognized, and code inside this pragma runs on the device according to OpenMP, as long as there is no parallel region present, which would produce above error.
Additional information:
The system is Rocky Linux release 8.7 (Green Obsidian)
The test program I am executing is based on the OpenMP test programs. It's full code is:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
void saxpy(float a, float* x, float* y, int sz) {
#pragma omp target teams distribute parallel for simd \
num_teams(3) map(to:x[0:sz]) map(tofrom:y[0:sz])
for (int i = 0; i < sz; i++) {
if (omp_is_initial_device()) {
printf("Running on host\n");
} else {
int nthreads= omp_get_num_threads();
int nteams= omp_get_num_teams();
printf("Running on device with %d teams (fixed) in total and %d threads in each team\n",nteams,nthreads);
}
fprintf(stdout, "Thread %d %i\n", omp_get_thread_num(), i );
y[i] = a * x[i] + y[i];
}
}
int main(int argc, char** argv) {
float a = 2.0;
int sz = 16;
float *x = calloc( sz, sizeof *x );
float *y = calloc( sz, sizeof *y );
//Set values
int num_devices = omp_get_num_devices();
printf("Number of available devices %d\n", num_devices);
saxpy( a, x, y, sz );
return 0;
}
I try to compile it with
gcc -O0 -fopenmp -foffload=nvptx-none -o mintest mintest.c
or with the flags mentioned above.
I guess the issue is that GCC cannot deal with the printf
within the code region that is running on the GPU. GPUs typically are not good at any form of I/O happening and so you should avoid calling things like printf
, read
, write
, etc. when within an offloaded code region.
If you want to detect if the code was running on the GPU device or the host, then you can use a pattern like this:
void test_on_gpu(void) {
int on_device = 0;
#pragma omp target teams map(from:on_device)
{
#pragma omp parallel
{
#pragma omp master
{
if (0 == omp_get_team_num()) {
on_device = !omp_is_initial_device()
}
}
}
}
printf("on GPU: %s\n", on_device ? "yes" : "no");
}
What the code does is:
target
)master
) in the first OpenMP team and the parallel region theremap(from:on_device)