I'm trying to understand ccNUMA systems but I'm a little bit confused about how OpenMP's scheduling can hurt the performance.Let's say we have the below code.What is happening if c1 is smaller than c0 or bigger?I understand the general idea that different chunk size leads to remote accesses but I read somewhere that for small chunk sizes something is happening with cache lines and I got really confused.
#pragma omp parallel for schedule(static,c0)
for(int i=0;i<N;i++)
A[i]=0;
#pragma omp parallel for schedule(static,c1)
for(int i=0;i<N;i++)
B[i]=A[i]*i;
When A[] has been allocated using malloc, the OS only promised that you will get the memory the pointer is pointing to. No actual memory allocation has been performed, that is, the physical memory pages have not been assigned yet. This happens when you execute the first parallel region where you touch the data for the first time (see also "first-touch policy"). When the first access happens, the OS creates the physical page on the same NUMA domain that executes the touching thread.
So, depending on how you choose c0
you get a certain distribution of the memory pages across the system. With a bit of math involved you can actually determine which value of c0
will lead to what distribution of the memory pages.
In the second loop, you're using a c1
that potentially different from c0
. For certain values of c1
(esp., c1
equal to c0
) you should see almost no NUMA traffic on the system, while for others you'll see a lot. Again, it's simple to determine those values mathematically.
Another thing you might be facing is false sharing. If c0
and c1
are chosen such that the data processed by a chunk is less than the size of a cache line, you'll see that a cache line is shared across multiple threads and thus is bouncing between the different caches of the system.