The examples that I have seen for OpenMP's omp atomic
generally involve updating a scalar, and usually report that it is faster than omp critical
. In my application I wish to update elements of an allocated array, with some overlap between the elements that different threads will update, and I find that atomic is substantially slower than critical. Does it make a difference that it is an array, and am I using it correctly?
#include <stdlib.h>
#include <assert.h>
#include <omp.h>
#define N_EACH 10000000
#define N_OVERLAP 100000
#if !defined(OMP_CRITICAL) && !defined(OMP_ATOMIC)
#error Must define OMP_CRITICAL or OMP_ATOMIC
#endif
#if defined(OMP_CRITICAL) && defined(OMP_ATOMIC)
#error Must define only one of either OMP_CRITICAL or OMP_ATOMIC
#endif
int main(void) {
int const n = omp_get_max_threads() * N_EACH -
(omp_get_max_threads() - 1) * N_OVERLAP;
int *const a = (int *)calloc(n, sizeof(int));
#pragma omp parallel
{
int const thread_idx = omp_get_thread_num();
int i;
#ifdef OMP_CRITICAL
#pragma omp critical
#endif /* OMP_CRITICAL */
for (i = 0; i < N_EACH; i++) {
#ifdef OMP_ATOMIC
#pragma omp atomic update
#endif /* OMP_ATOMIC */
a[thread_idx * (N_EACH - N_OVERLAP) + i] += i;
}
}
/* Check result is correct */
#ifndef NDEBUG
{
int *const b = (int *)calloc(n, sizeof(int));
int thread_idx;
int i;
for (thread_idx = 0; thread_idx < omp_get_max_threads(); thread_idx++) {
for (i = 0; i < N_EACH; i++) {
b[thread_idx * (N_EACH - N_OVERLAP) + i] += i;
}
}
for (i = 0; i < n; i++) {
assert(a[i] == b[i]);
}
free(b);
}
#endif /* NDEBUG */
free(a);
}
Note that in this simplified example we can determine in advance which elements will overlap, so it would be more efficient to only apply atomic
/critical
when updating those, but in my real application this is not possible.
When I compile this using:
gcc -O2 atomic_vs_critical.c -DOMP_CRITICAL -DNDEBUG -fopenmp -o critical
gcc -O2 atomic_vs_critical.c -DOMP_ATOMIC -DNDEBUG -fopenmp -o atomic
and run with time ./critical
I get:
real 0m0.110s user 0m0.086s sys 0m0.058s
and with time ./atomic
, I get:
real 0m0.205s user 0m0.742s sys 0m0.032s
So it uses about half the wallclock time with the critical section (and I get the same when I repeat it).
There is another post that claims critical is slower than atomic, but that uses a scalar, and when I run the provided code the atomic result is actually slightly faster than the critical one.
Your comparison is not fair: the #pragma omp critical
is placed before the for
loop, so the compiler can vectorize your loop, but #pragma omp atomic update
is inside the loop, which prevents vectorization. This difference in vectorization causes the surprising runtimes. For a fair comparison place both inside the loop:
for (i = 0; i < N_EACH; i++) {
#ifdef OMP_CRITICAL
#pragma omp critical
#endif /* OMP_CRITICAL */
#ifdef OMP_ATOMIC
#pragma omp atomic update
#endif /* OMP_ATOMIC */
a[thread_idx * (N_EACH - N_OVERLAP) + i] += i;
}
Due to this vectorization issue, most probably runtime of your real program will be the shortest if you use single thread only.