cmultithreadingpthreadspthread-join

Why doesn't multithreading improve performance in this program for finding primes?


Running time is about the same, regardless of the number of threads. I am having trouble figuring out why. I know that the threads are running in parallel as they are supposed to, but I don't have even a good guess as to why there would be no performance improvement. (approx. 21 seconds to find all primes less than 8 million, for both single and multiple threads) What is going on here?

typedef struct prime_finder_vars {
    long from;
    long to;
    int idx;
} PrimeFinderVars;

int is_prime(long num) {
    int limit = round(sqrt(num));
    for (long i = 2; i <= limit; i++) {
        if (num % i == 0)
            return FALSE;
    }
    return TRUE;
}

void *prime_finder(void *pf) {

    PrimeFinderVars *pf_vars = (PrimeFinderVars *) pf;

    long next_cand = pf_vars->from;
    while (next_cand < pf_vars->to) {
        if (is_prime(next_cand)) {
            ++counts[pf_vars->idx];
        }
        next_cand += 2;
    }
    return pf;
}


int main(void) {

    struct timespec start;
    struct timespec end;
    double start_sec, end_sec, elapsed_sec;
    int sum = 0;

    clock_gettime(CLOCK_REALTIME, &start);

    pthread_t threads[NUM_THREADS];
    PrimeFinderVars vars[NUM_THREADS];

    int slice_size = SEARCH_RANGE / NUM_THREADS;

    for (int i = 0; i < NUM_THREADS; i++) {

        vars[i].from = i * slice_size + 1;
        vars[i].to = (i + 1) * slice_size;
        vars[i].idx = i;

        pthread_create(&threads[i], NULL, prime_finder, &vars[i]);

    }

    for (int i = 0; i < NUM_THREADS; i++) {
        pthread_join(threads[i], NULL);
        sum += counts[i];
    }

    clock_gettime(CLOCK_REALTIME, &end);

    start_sec = start.tv_sec + start.tv_nsec / NANO_PER_SEC;
    end_sec = end.tv_sec + end.tv_nsec / NANO_PER_SEC;
    elapsed_sec = end_sec - start_sec;
}

Solution

  • This is an interesting question. Everything Mikhail Vladimirov says is true but I decided to do some testing on my laptop to see what I got. My laptop is a modern MacBook pro with an eight core i9. I'm not sure if it is hyper threaded or not, but here are my results:

    Time to execute by threads I tested with the number of threads varying between 1 and 50 and a search range of 10,000,000.

    With one thread it takes nearly eleven seconds but this drops rapidly to around 1.5 seconds with 16 threads, and it doesn't get any better after that.

    My conclusion is

    1. My comment on Mikhail's answer about the cost of the thread functions is wrong, at least on my platform. I see no increased overhead with more threads
    2. There's something wrong with your thread library.

    I think you probably need to satisfy yourself that the threads really are running in parallel on separate cores. One explanation of your results could be that they are all competing for the same CPU.


    Just for fun I decided to try profiling the program.

    Profile of the first few iterations of thread count

    Each step represents another core going to 100%. I'm not sure why the prt with three threads doesn't go to 300%, but you can see with four threads that it goes up to 400% straight away but comes down in steps of 100%. This is the effect of you splitting the task into equal ranges and the threads dealing with lower numbers finishing sooner.


    The first 16 data points

    Threads Time
    1   11.893418
    2   7.352520
    3   5.117278
    4   4.062026
    5   3.511605
    6   2.892274
    7   2.401555
    8   2.172573
    9   1.910534
    10  1.864023
    11  1.860944
    12  1.369277
    13  1.628883
    14  1.196646
    15  1.626215
    16  1.548878
    

    The code I used to produce the test results (slightly modified from yours).

    #include <stdio.h>
    #include <pthread.h>
    #include <math.h>
    #include <stdbool.h>
    
    #define SEARCH_RANGE    10000000
    #define NANO_PER_SEC    1000000000
    
    typedef struct prime_finder_vars {
        long from;
        long to;
        int* count;
    } PrimeFinderVars;
    
    int is_prime(long num) {
        int limit = round(sqrt(num));
        for (long i = 2; i <= limit; i++) {
            if (num % i == 0)
                return false;
        }
        return true;
    }
    
    
    void *prime_finder(void *pf)
    {
    
        PrimeFinderVars *pf_vars = (PrimeFinderVars *) pf;
    
        long next_cand = pf_vars->from;
        while (next_cand < pf_vars->to)
        {
            if (is_prime(next_cand))
            {
                (*pf_vars->count)++ ;
            }
            next_cand += 2;
        }
        return pf;
    }
    
    
    void trial(int numThreads)
    {
        struct timespec start;
        struct timespec end;
        double start_sec, end_sec, elapsed_sec;
        int sum = 0;
    
        clock_gettime(CLOCK_REALTIME, &start);
    
        int counts[numThreads];
        pthread_t threads[numThreads];
        PrimeFinderVars vars[numThreads];
    
        int slice_size = SEARCH_RANGE / numThreads;
    
        for (int i = 0; i < numThreads; i++)
        {
            counts[i] = 0;
            vars[i].from = i * slice_size + 1;
            vars[i].to = (i + 1) * slice_size;
            vars[i].count = &counts[i];
    
            pthread_create(&threads[i], NULL, prime_finder, &vars[i]);
    
        }
    
        for (int i = 0; i < numThreads; i++)
        {
            pthread_join(threads[i], NULL);
            sum += counts[i];
        }
    
        clock_gettime(CLOCK_REALTIME, &end);
    
        start_sec = (double)start.tv_sec + (double)start.tv_nsec / NANO_PER_SEC;
        end_sec = (double)end.tv_sec + (double)end.tv_nsec / NANO_PER_SEC;
        elapsed_sec = end_sec - start_sec;
        printf("%d\t%f\n", numThreads, elapsed_sec);
    }
    
    int main()
    {
        printf("Threads\tTime\n");
        for (int threads = 1 ; threads <= 50 ; ++threads)
        {
            trial(threads);
        }
    }
    
    

    I pursued this a bit further over the last day or two. Firstly, I was intrigued about why there seemed to be a double line of timings: After about 12 threads, a run would take either 1.5 seconds or 1 second. I theorised above it was because of the bug Mikhail mentioned so I plotted the actual answer given for each number of threads and found that, while the answer was usually around 664,579 it would often be around half that and unsurprisingly, when the answer was half the real answer, that corresponded to the lower of the two timing lines. enter image description here

    So I fixed that bug and the double line effect disappeared. However, I was still getting more than one different answer depending on the number of threads. enter image description here

    The reason for this is that there are two more bugs.

    1. the original algorithm fails to test the top number in each range.
    2. The size of the ranges is calculated by doing an integer division of the search range by the number of threads. Unless there is no remainder, numbers at the top of the search range will not be checked.

    I fixed both bugs and did a third run. This didn't affect the timings appreciably, but I got the same answer for each number of threads used.

    For comparison, I wrote a sieve of Eratosthenes and timed that too. Using it and a single thread took only 0.2 seconds - about seven times faster than the fastest number of cores.

    I've published a spreadsheet of the results and there's a git repo of the code.