My system setting, linux system, with 12 cores, isolated cores 2-11. core 0 and 1's usage are almost 100% by some other program. All the rest cores are idle.
export GOMP_CPU_AFFINITY=2,3,4
export OMP_NUM_THREADS=3
taskset -c $GOMP_CPU_AFFINITY perf stat -d ./test_openmp
output is:
Performance counter stats for './test_openmp':
47,654.74 msec task-clock:u # 2.981 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
115,358 page-faults:u # 2.421 K/sec
159,245,881,934 cycles:u # 3.342 GHz
250,009,309,156 instructions:u # 1.57 insn per cycle
20,002,132,172 branches:u # 419.730 M/sec
117,268 branch-misses:u # 0.00% of all branches
110,002,614,320 L1-dcache-loads:u # 2.308 G/sec
10,796,435,741 L1-dcache-load-misses:u # 9.81% of all L1-dcache accesses
0 LLC-loads:u # 0.000 /sec
0 LLC-load-misses:u # 0.00% of all LL-cache accesses
15.986638336 seconds time elapsed
47.175831000 seconds user
0.414928000 seconds sys
export GOMP_CPU_AFFINITY=1,2,3,4
export OMG_NUM_THREADS=4
taskset -c $GOMP_CPU_AFFINITY perf stat -d ./test_openmp
the output is
pid: 4118342
Performance counter stats for './test_openmp':
48,241.03 msec task-clock:u # 1.072 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
119,879 page-faults:u # 2.485 K/sec
161,605,704,451 cycles:u # 3.350 GHz
250,011,376,400 instructions:u # 1.55 insn per cycle
20,002,726,448 branches:u # 414.641 M/sec
118,657 branch-misses:u # 0.00% of all branches
110,002,938,510 L1-dcache-loads:u # 2.280 G/sec
10,796,444,713 L1-dcache-load-misses:u # 9.81% of all L1-dcache accesses
0 LLC-loads:u # 0.000 /sec
0 LLC-load-misses:u # 0.00% of all LL-cache accesses
45.012033357 seconds time elapsed
47.764469000 seconds user
0.399934000 seconds sys
My question is: why the second time I assigned one more core (core 1) to the program but the running time is must longer (15.98sec vs 45.01sec), and the cpu utilization is very low (2.98 vs 1.07)
Here is the test code I ran.
#include <iostream>
#include <cstdint>
#include <unistd.h>
constexpr int64_t N = 100000;
int m = N;
int n = N;
int main() {
double* a = new double[N];
double* c = new double[N];
double* b = new double[N*N];
std::cout << "pid: " << getpid() << std::endl;
#pragma omp parallel for default(none) shared(m,n,a,b,c)
for (int i=0; i<m; i++) {
double sum = 0.0;
for (int j=0; j<n; j++)
sum += b[i+j*N]*c[j];
a[i] = sum;
}
return 0;
}
When you don't specify a schedule for the workshare loop, the schedule is implementation defined. Most implementations pick the static schedule because it has the lowest runtime overhead for most workloads. The static schedule distributes the same number of iterations to each threads.
In your case, you specifically want to allow openmp to distribute the work differently to the threads. Try adding schedule(dynamic)
to the parallel for directive.
You can also select schedule(runtime)
and control the schedule by setting an environmental variable for each execution.