c++cmakeopenmp

No speedup observed when using openmp in a c++ project


I tried to add parallel execution to my c++ project.

Thus, based on this example I defined my app.cpp as follows:

#include <chrono>
#include <iostream>
#include <omp.h>

int sum_serial(int n) {
    int sum = 0;
    for (int i = 0; i <= n; ++i) {
        sum += i;
    }
    return sum;
}

// Parallel programming function
int sum_parallel(int n) {
    int sum = 0;
#pragma omp parallel for reduction(+ : sum)
    for (int i = 0; i <= n; ++i) {
        sum += i;
    }
    return sum;
}

int main(int argc, char* argv[]) {
    // Beginning of parallel region
#pragma omp parallel
    { printf("Hello World... from thread = %d\n", omp_get_thread_num()); }
    // Set threads number.
#if defined(_OPENMP)
    omp_set_num_threads(2);
#endif

    {
        const int n = 100000000;

        auto start_time = std::chrono::high_resolution_clock::now();

        int result_serial = sum_serial(n);

        auto end_time = std::chrono::high_resolution_clock::now();

        std::chrono::duration<double> serial_duration = end_time - start_time;

        start_time = std::chrono::high_resolution_clock::now();

        int result_parallel = sum_parallel(n);
        end_time = std::chrono::high_resolution_clock::now();
        std::chrono::duration<double> parallel_duration = end_time - start_time;

        std::cout << "Serial result: " << result_serial << std::endl;
        std::cout << "Parallel result: " << result_parallel << std::endl;
        std::cout << "Serial duration: " << serial_duration.count() << " seconds" << std::endl;
        std::cout << "Parallel duration: " << parallel_duration.count() << " seconds" << std::endl;
        std::cout << "Speedup: " << serial_duration.count() / parallel_duration.count()
                  << std::endl;
    }
    return 0;
}

What surprised me is that there is no speedup, in fact, the parallel execution is quite a lot slower. My output is:

Serial result: 987459712
Parallel result: 987459712
Serial duration: 0.132073 seconds
Parallel duration: 0.645815 seconds
Speedup: 0.204507

Note that my cmake is:

add_executable(wingdesigner app.cpp)
target_compile_features(app PRIVATE cxx_std_17)

add_compile_options(-Wall -O3 -fopenmp)


target_link_libraries(app PUBLIC gomp)

What is wrong?

I must be doing something wrong. I get it that large number of threads can result in barely observable speedup but in this particular case 2 threads compared to 1 should be faster, right? I assume my compiling is wrong but I can't figure out what the problem would be.


Solution

  • I tried your example and noticed that only one thread was used, even with OMP.

    As a matter of fact, it seems that your CMake file is wrong (should use target_compile_options instead of add_compile_options). I used the following and observed some speedup:

    project (foobar)
    add_executable(app app.cpp)
    target_compile_features(app PRIVATE cxx_std_17)
    target_compile_options (app PUBLIC -Wall -O3 -fopenmp)
    target_link_libraries  (app PUBLIC gomp)
    

    I also used the following snippet (using sin) in order to have "more" work to do:

    #include <chrono>
    #include <iostream>
    #include <omp.h>
    #include <cmath>
    
    auto sum_serial(int n) {
        double sum = 0;
        for (int i = 0; i <= n; ++i) {
            sum += sin(i);
        }
        return sum;
    }
    
    // Parallel programming function
    auto sum_parallel(int n) {
        double sum = 0;
    #pragma omp parallel for reduction(+ : sum)
        for (int i = 0; i <= n; ++i) {
            sum += sin(i);
        }
        return sum;
    }
    
    int main(int argc, char* argv[]) {
        // Beginning of parallel region
    #pragma omp parallel
        { printf("Hello World... from thread = %d\n", omp_get_thread_num()); }
        // Set threads number.
    #if defined(_OPENMP)
        omp_set_num_threads(2);
    #endif
    
        {
            const int n = 100000000;
    
            auto start_time = std::chrono::high_resolution_clock::now();
    
            auto result_serial = sum_serial(n);
    
            auto end_time = std::chrono::high_resolution_clock::now();
    
            std::chrono::duration<double> serial_duration = end_time - start_time;
    
            start_time = std::chrono::high_resolution_clock::now();
    
            auto result_parallel = sum_parallel(n);
            end_time = std::chrono::high_resolution_clock::now();
            std::chrono::duration<double> parallel_duration = end_time - start_time;
    
            std::cout << "Serial result    : " << result_serial << std::endl;
            std::cout << "Parallel result  : " << result_parallel << std::endl;
            std::cout << "Serial duration  : " << serial_duration.count() << " seconds" << std::endl;
            std::cout << "Parallel duration: " << parallel_duration.count() << " seconds" << std::endl;
            std::cout << "Speedup          : " << serial_duration.count() / parallel_duration.count() << std::endl;
        }
        return 0;
    }
    

    I got the following output:

    Serial result    : 1.71365
    Parallel result  : 1.71365
    Serial duration  : 1.08041 seconds
    Parallel duration: 0.546919 seconds
    Speedup          : 1.97545