I was using OpenMP with target offload and found that my application segfaults when limiting OpenMP to a single thread. I could boil it down to the following snippet:
#include <omp.h>
int main(){
int res = 0;
#pragma omp parallel num_threads(1)
{
#pragma omp single
{
#pragma omp taskgroup task_reduction(+:res)
{
#pragma omp target in_reduction(+:res) nowait
{
res++;
}
}
}
}
}
Compiled with
clang++ -fopenmp -fopenmp-targets=nvptx64 --offload-arch=sm_61 -O0 main.cpp
using clang 17.0.0
and cuda 12.1
, ran on a Ubuntu 22.04 machine with a 12700k/1080Ti.
This segfaults when num_threads
is set to 1
, but works fine with more than one thread (e.g. num_threads(2)
) or when not specifying nowait
on the target task, so that it synchronizes at the end of the target region.
From my understanding, this should work just fine even with a single thread.
According to mailing list/github issues, as of writing this, support for in_reduction
in clang is incomplete/missing.