I'm trying to benchmark computing an f(x)
while varying the number of threads with every iteration.
f(x) = c * ln(x) * cos(x)
n=10000000
for (int pp = 2; pp<17; pp++)
{
p = pp;
int chunk = n/p; //acts like floor
omp_set_num_threads(p);
double start_parallel = omp_get_wtime();
//start parallel
#pragma omp parallel shared(tt,chunk) private (i)
{
//printf("thread number %d\n",omp_get_thread_num());
#pragma omp for schedule(dynamic,chunk) nowait
for(i=0; i<n; i++)
{
//tt[i] = f(tt[i]);
tt[i] = f1(tt[i]); //the speed up is much higher with f1 since log and cos
//computations are polynomial; see function.
}
} //end parallel
double end_parallel = omp_get_wtime();
double cpu_time_used_parallel = (double) (end_parallel - start_parallel);
printf("parallel: for n=%d, p=%d, time taken=%f, speedup=%f\n",
n,p,cpu_time_used_parallel,
cpu_time_used_seq/cpu_time_used_parallel);
}
Result:
Started varying threads:
parallel: for n=10000000, p=2, time taken=0.153774, speedup=3.503831
parallel: for n=10000000, p=3, time taken=0.064447, speedup=8.360370
parallel: for n=10000000, p=4, time taken=0.044694, speedup=12.055239
parallel: for n=10000000, p=5, time taken=0.048700, speedup=11.063550
parallel: for n=10000000, p=6, time taken=0.039009, speedup=13.811989
parallel: for n=10000000, p=7, time taken=0.041735, speedup=12.910017
parallel: for n=10000000, p=8, time taken=0.041268, speedup=13.055919
parallel: for n=10000000, p=9, time taken=0.039032, speedup=13.804157
parallel: for n=10000000, p=10, time taken=0.038970, speedup=13.825767
parallel: for n=10000000, p=11, time taken=0.039843, speedup=13.522884
parallel: for n=10000000, p=12, time taken=0.041356, speedup=13.028237
parallel: for n=10000000, p=13, time taken=0.041039, speedup=13.128763
parallel: for n=10000000, p=14, time taken=0.047433, speedup=11.359218
parallel: for n=10000000, p=15, time taken=0.048430, speedup=11.125202
parallel: for n=10000000, p=16, time taken=0.051950, speedup=10.371477
Note: The speedup here is computed against the sequential algorithm (threads = 1)
The speedup does not seem to be really affected by the variation of p
(number of threads).
Am I doing this right, or the cause comes from the non efficient incrementation of the number of threads (i.e. theoretically speaking changing p
won't seriously affect O(myprogram)
) ?
Q : Am I doing this right...?
A : No, sorry, you do not happen to do this right. Let's analyse the reasons and let's sketch some hints for HPC-grade performance benchmarking together :
Well, you already know the initial benchmark design was not well crafted. The effect of add-on costs, demonstrated in the contemporary criticism of the original, overhead-naive Amdahl-law has more details on this and timing details show this very well, as instantiation costs grew smaller and smaller to other classes of processing and I/O-related costs, as shown below.
The code has immense add-on overhead costs compared to the "computation"-part. This is the strongest marker of a flawed use of technology capabilities of an otherwise legal syntax-constructor ( OpenMP here, map-reduce somewhere else, list-comprehension in other cases or some other syntax-suger tricks elsewhere else )
An ultimate performance is an Art of Balancing ( right - balancing the Costs and Benefits - any sort of an imbalance means a lost performance edge ).
The second sin was to ignore the "behind-scene" [TIME]
-domain penalty for [SPACE]
-domain scaling. The larger the n
, the larger memory was to get allocated and the more penalties came in from ( horrifying ) cache-line inefficiencies, all that having zero-protection from falling into the hell-of-memory swapping:
n=1E3 1E4 1E5 1E6 1E7 1E8 1E9 :
______________________________________________________________________________:_____
1.000 1.000 1.000 1.000 1.000 1.000 1.000 : p= 1
0.930 1.403 0.902 1.536 1.492 1.517 0.356 : p= 2
1.075 2.319 2.207 1.937 2.001 1.991 1.489++ : p= 3
1.497+++++ 2.636++++ 1.563 1.657 2.571 2.144 0.687 : p= 4
1.226++ 2.548+++ 0.957 2.025 2.357 1.731 1.569++++ : p= 5
1.255+++ 1.805 2.704 2.020 2.348 1.502 0.989 : p= 6
0.957 0.581 3.104++ 2.124 2.486 2.002 0.838 : p= 7
1.151 1.376 2.449 2.154 2.573 1.536 0.776 : p= 8
1.135 1.685 2.388 2.506+++ 2.852++++ 2.311 1.676+++++ : p= 9
1.285++++ 2.492++ 2.497 2.568++++ 2.647+ 2.467 1.413+ : p=10
1.177 2.314+ 2.709+ 2.174 2.688+++ 2.634++++ 0.606 : p=11
1.216+ 2.293 2.442 2.287 2.550 2.551++ 1.256 : p=12
1.034 2.148 1.802 2.361++ 2.635 2.554+++ 1.181 : p=13
0.999 0.440 3.672+++++ 2.774+++++ 2.927+++++ 2.839+++++ 1.496+++ : p=14
1.091 1.217 3.285++++ 2.284 2.525 2.356 1.005 : p=15
0.937 2.850+++++ 3.185+++ 2.334+ 2.655++ 2.508+ 0.889 : p=16
If benchmarking computation, benchmark the computation and avoid MEM-I/O
If benchmarking, always check the whole landscape, to feel ( better learn how to avoid/evict any such from skewing the measured results ) the in-cache side-effects
Always avoid any form of sharing ( may be avoided - map the processing over disjunct areas of the tt[]
, yet covering "whole" tt[]
, in cache-line coherent blocks, so as to avoid false-sharing and "at least" re-use any data from the already fetched data-blocks that you have paid the costs of MEM-I/O fetching already ( if the vector-based LOAD/STORE is indeed a must - see above )
Permit and actively harness any HPC-vectorisation tricks available with FMA4 / SIMD / AVX512