c multithreading parallel-processing openmp

Parallel Computation of the Number π using OpenMP

I'm using the following code in order to parallelize the computation of π.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h>

#define N 1000000 // Number of intervals for integration

double calculate_pi_parallel_trapezoidal() {
    double integral = 0.0;
    double dx = 1.0 / N;

    #pragma omp parallel for reduction(+:integral)
    for (int i = 0; i < N; i++) {
        double x = (i + 0.5) * dx;
        double fx = sqrt(1.0 - x * x);
        integral += fx * dx;
    }

    return 4.0 * integral;
}

int main() {
    double pi;
    double start_time, end_time;
    int num_threads;

    // Set the environment variable to control thread affinity
    setenv("GOMP_CPU_AFFINITY", "0-7", 1);

    // Get the maximum number of threads available
    #pragma omp parallel
    {
        #pragma omp master
        {
            num_threads = omp_get_max_threads();
        }
    }
    printf("Maximum number of threads: %d\n", num_threads);

    // Experiment with different numbers of threads
    for (int threads = 1; threads <= num_threads; threads++) {
        // Set the number of threads using the environment variable
        omp_set_num_threads(threads);

        start_time = omp_get_wtime(); // Record the start time

        pi = calculate_pi_parallel_trapezoidal();

        end_time = omp_get_wtime(); // Record the end time

        printf("Threads: %d, Approximation of Pi (Trapezoidal Rule): %.16f, Time taken: %f seconds\n", threads, pi, end_time - start_time);
    }

    return 0;
}

I use OpenMP API for this problem as you can see in the code snippet above. The command-line to compile the code is gcc -fopenmp -O3 pi.c -o pi -lm , 03 for highest optimization possible and -lm to link the libm (math) library since I'm using sqrt. Then I simply run the code using ./pi. The results of the execution of the above code is the following:

Maximum number of threads: 8
Threads: 1, Approximation of Pi (Trapezoidal Rule): 3.1415926539343633, Time taken: 0.006868 seconds
Threads: 2, Approximation of Pi (Trapezoidal Rule): 3.1415926539341976, Time taken: 0.003119 seconds
Threads: 3, Approximation of Pi (Trapezoidal Rule): 3.1415926539342567, Time taken: 0.001779 seconds
Threads: 4, Approximation of Pi (Trapezoidal Rule): 3.1415926539342074, Time taken: 0.001705 seconds
Threads: 5, Approximation of Pi (Trapezoidal Rule): 3.1415926539342571, Time taken: 0.001142 seconds
Threads: 6, Approximation of Pi (Trapezoidal Rule): 3.1415926539342460, Time taken: 0.000825 seconds
Threads: 7, Approximation of Pi (Trapezoidal Rule): 3.1415926539342056, Time taken: 0.001365 seconds
Threads: 8, Approximation of Pi (Trapezoidal Rule): 3.1415926539342003, Time taken: 0.019961 seconds

Interestingly enough every time I run the code I get different results

Maximum number of threads: 8
Threads: 1, Approximation of Pi (Trapezoidal Rule): 3.1415926539343633, Time taken: 0.005567 seconds
Threads: 2, Approximation of Pi (Trapezoidal Rule): 3.1415926539341976, Time taken: 0.002621 seconds
Threads: 3, Approximation of Pi (Trapezoidal Rule): 3.1415926539342571, Time taken: 0.003349 seconds
Threads: 4, Approximation of Pi (Trapezoidal Rule): 3.1415926539342074, Time taken: 0.001916 seconds
Threads: 5, Approximation of Pi (Trapezoidal Rule): 3.1415926539342571, Time taken: 0.001457 seconds
Threads: 6, Approximation of Pi (Trapezoidal Rule): 3.1415926539342460, Time taken: 0.000927 seconds
Threads: 7, Approximation of Pi (Trapezoidal Rule): 3.1415926539342056, Time taken: 0.000866 seconds
Threads: 8, Approximation of Pi (Trapezoidal Rule): 3.1415926539342007, Time taken: 0.019658 seconds

I don't know the reason for this but I have a feeling it's normal, am I wrong?

As you may understand I have 8 logical processors (threads), and 4 physical cores (two threads per core). I run the code on an ubuntu 22.04.

Now the question is even though I have 8 threads and I run the code on 8 threads the run time doesn't decrease as the number of threads decrease all the way to the highest number os threads used and from the seven threads, the run time increases again! From what I understood I shouldn't expect this behavior and the run time should decrease all the way to the highest threads used, given that I have 8 threads and I use all 8 threads to parallelize.

I read different reasons for such issue, such as, load balancing, overhead, thread affinity, etc. but I have no idea how to diagnose what the problem is and how to fix it, could you please chip in and suggest your theories and tips on how I can resolve the issue and at the same time learn more about the multithreading and parallelization?

UPDATE

I made some adjustment to N based on some of the tips given in the comments, I increase N from 10E06 to 10E09 and I get the following results:

Maximum number of threads: 8
Threads: 1, Approximation of Pi (Trapezoidal Rule): 3.1415926535902123, Time taken: 1.379732 seconds
Threads: 2, Approximation of Pi (Trapezoidal Rule): 3.1415926535901511, Time taken: 0.746554 seconds
Threads: 3, Approximation of Pi (Trapezoidal Rule): 3.1415926535899734, Time taken: 0.518842 seconds
Threads: 4, Approximation of Pi (Trapezoidal Rule): 3.1415926535898255, Time taken: 0.414103 seconds
Threads: 5, Approximation of Pi (Trapezoidal Rule): 3.1415926535899268, Time taken: 0.518961 seconds
Threads: 6, Approximation of Pi (Trapezoidal Rule): 3.1415926535899241, Time taken: 0.490220 seconds
Threads: 7, Approximation of Pi (Trapezoidal Rule): 3.1415926535897527, Time taken: 0.443656 seconds
Threads: 8, Approximation of Pi (Trapezoidal Rule): 3.1415926535899521, Time taken: 0.413821 seconds

So based on some of the suggestions in the comments since OpenMP only scale up to the number of physical cores (as suggested by some fellow members, see in the comments) as I only have 4 physical cores the run time should decrease up to four cores at a time, as it can be seen in the results above, then when we move to 5 threads (logical cores) there is a jump in the run time and from there it decreases again.

Would somebody explain this phenomenon please?

Thanks a bunch

Solution

About logical cores and why using them does not necessarily speed-up computations (thanks to Jérôme Richard who corrected my initial answer)

First consider a CPU with N physical cores, and a program that runs with N threads. The OS is responsible for scheduling the threads on the cores. For simplicity let's imagine no other thread is running, so in practice the OS can leave each thread on a core indefinitely.

As long as each core receives the instructions and data fast enough from the memory, this is all good: the code is said to be "CPU bound" (performances are limited only by the CPU) and it scales perfectly with more cores/threads (the speed-up in terms of elapsed time is N).

Now, if each core does not receive data fast enough (because the bandwidth between RAM and CPU is limited) then it gets regularly idle, and the code is said to be "memory bound". The speed-up is (sometimes much) less than N and is bounded by the bandwith. This is nicely illustrated on this answer.

Once a core gets idle, the OS scheduler removes the current thread from the core and replaces it with another thread (it switches the threads), which hopefully has enough data ready to use to make the core busy. One problem is that switching the threads comes at a cost: the OS has to save the state of the thread (e.g. the registers) that was on the core, and restore the state of the thread that replaces it. During this, the core does not perform useful computations.

Here come the logical cores (technique known as Simultaneous MultiThreading (SMT), or hyperthreading in Intel): each physical core is viewed by the OS as 2 cores, so the OS can schedule 2 simultaneous threads on it. The core still has a unique intruction pipe, but it can interleave instructions from both threads, and each thread has its own "context" (e.g. there are 2 sets of registers). If thread gets idle, the core may be still busy with the other thread.

So it can be tempting to run as many threads as logical cores (2*N) in order to keep the physical cores constantly busy, but actually this is not necessarily a win situation:

if the code is purely CPU bound, running more than N threads does not help because the physical core are already fully busy with N threads
if the code is purely memory-bound, running more than N threads doesn't help either (*) because at the end the bottleneck is the bandwith: the threads will be interleaved on the cores, but each one will receive twice as less data and will be active twice as less time.
the only case where it can help is if the different threads are not all CPU-bound or memory-bound at the same time

(*) this can even hurt instead of helping, because more threads often means more random accesses to the memory, hence more cache misses. This very visible on the performance curves of the other answer.

It may sound a bit restrictive, but actually the logical cores are useful anyway because the OS can schedule the threads from the processes other than your main compute process (that is, mostly system processes in a HPC context) without necessarily having to switch your computing threads.