I'm using OpenMP target offloading do offload some nested loops to the gpu. I'm using the nowait
to tun it asynchronous. This makes it a task. With the same input values the result differs from the one when not offloading (e.g. cpu: sum=0.99, offloading sum=0.5).
When removing the nowait
clause it works just fine. So I think the issue is that it becomes an OpenMP task and I'm struggling getting it right.
#pragma omp target teams distribute parallel for reduction( +: sum) collapse(2) nowait depend(in: a, b) depend(out: sum)
for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= n; j++)
{
double c = 0;
for (int k = 0; k < n; k++)
{
c += /* some computation */
}
sum += fabs(c);
}
}
The OpenMP 5.2 specification states:
The
target
construct generates a target task. The generated task region encloses the target region. If adepend
clause is present, it is associated with the target task. [...]. If thenowait
clause is present, execution of the target task may be deferred. If thenowait
clause is not present, the target task is an included task.
This means that your code is executed in a task with a possibly deferred execution (with nowait
). Thus, it can be executed at the end of the parallel in the worst case, but always before all the dependent tasks and taskwait
directives waiting for the target task (or the ones including a similar behaviour like taskgroup
). Because of that, you need not to modify the working arrays (nor release them) during this time span. If you do, the behaviour is undefined.
You should especially pay attention to the correctness of synchronization points and task dependencies in your code (it is impossible for us to check that with the current incomplete provided code).