c++linuxopenmp

How come one shared busy cpu core can impact openmp's overall CPU utilization?


My system setting, linux system, with 12 cores, isolated cores 2-11. core 0 and 1's usage are almost 100% by some other program. All the rest cores are idle.

first round test.

export GOMP_CPU_AFFINITY=2,3,4
export OMP_NUM_THREADS=3
taskset -c $GOMP_CPU_AFFINITY perf stat -d ./test_openmp

output is:

 Performance counter stats for './test_openmp':

         47,654.74 msec task-clock:u              #    2.981 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
           115,358      page-faults:u             #    2.421 K/sec                  
   159,245,881,934      cycles:u                  #    3.342 GHz                    
   250,009,309,156      instructions:u            #    1.57  insn per cycle         
    20,002,132,172      branches:u                #  419.730 M/sec                  
           117,268      branch-misses:u           #    0.00% of all branches        
   110,002,614,320      L1-dcache-loads:u         #    2.308 G/sec                  
    10,796,435,741      L1-dcache-load-misses:u   #    9.81% of all L1-dcache accesses
                 0      LLC-loads:u               #    0.000 /sec                   
                 0      LLC-load-misses:u         #    0.00% of all LL-cache accesses

      15.986638336 seconds time elapsed

      47.175831000 seconds user
       0.414928000 seconds sys

second round test.

export GOMP_CPU_AFFINITY=1,2,3,4
export OMG_NUM_THREADS=4

taskset -c $GOMP_CPU_AFFINITY perf stat -d ./test_openmp

the output is

pid: 4118342

 Performance counter stats for './test_openmp':

         48,241.03 msec task-clock:u              #    1.072 CPUs utilized          
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
           119,879      page-faults:u             #    2.485 K/sec                  
   161,605,704,451      cycles:u                  #    3.350 GHz                    
   250,011,376,400      instructions:u            #    1.55  insn per cycle         
    20,002,726,448      branches:u                #  414.641 M/sec                  
           118,657      branch-misses:u           #    0.00% of all branches        
   110,002,938,510      L1-dcache-loads:u         #    2.280 G/sec                  
    10,796,444,713      L1-dcache-load-misses:u   #    9.81% of all L1-dcache accesses
                 0      LLC-loads:u               #    0.000 /sec                   
                 0      LLC-load-misses:u         #    0.00% of all LL-cache accesses

      45.012033357 seconds time elapsed

      47.764469000 seconds user
       0.399934000 seconds sys

My question is: why the second time I assigned one more core (core 1) to the program but the running time is must longer (15.98sec vs 45.01sec), and the cpu utilization is very low (2.98 vs 1.07)

Here is the test code I ran.

#include <iostream>
#include <cstdint>
#include <unistd.h>

constexpr int64_t N = 100000;
int m = N;
int n = N;

int main() {
  double* a = new double[N];
  double* c = new double[N];
  double* b = new double[N*N];

  std::cout << "pid: " << getpid() << std::endl;

#pragma omp parallel for default(none) shared(m,n,a,b,c)

for (int i=0; i<m; i++) {
 double sum = 0.0;
 for (int j=0; j<n; j++)
   sum += b[i+j*N]*c[j];
   a[i] = sum;
}

  return 0;
}


Solution

  • When you don't specify a schedule for the workshare loop, the schedule is implementation defined. Most implementations pick the static schedule because it has the lowest runtime overhead for most workloads. The static schedule distributes the same number of iterations to each threads.

    In your case, you specifically want to allow openmp to distribute the work differently to the threads. Try adding schedule(dynamic) to the parallel for directive.

    You can also select schedule(runtime) and control the schedule by setting an environmental variable for each execution.