OpenMP parallel for is substantially slower than sequential

I am running a simulation of random events happening in accordance with a given set of probabilities, until n events are hit, and repeating for as many iterations as desired. When I run this sequentially, it finishes a small test set of 24 iterations in 5 seconds. When I set a higher number of threads (16), the time involved becomes 55 seconds. I made these comparisons using time bash -c "...".

I am sorry for asking another one of these questions, but I have been searching for a solution for several hours now. I think the problem has to be some sort of race condition, but I just cannot seem to figure out on what. Variables target, n_probabilities and probabilities should all be private to the thread, and the majority of the time is spent between the print statements 'Started thread %d' and 'Finished thread %d'. This suggests to me that mutation_iterations[i] = simulate_iteration(target, n_probabilities, probabilities); then has to be the offending line, and mutation_iterations is a shared variable, but it should only be accessed a single time per thread so I don't see how it could cause this much of a speed drop.

However, looking at gprof output I don't really see an unusual amount of time spent on the function in question (although the 6.15 for main seems weird to me, as it is not the actual runtime):

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 98.86      6.08     6.08       24   253.33   253.33  simulate_iteration
  0.98      6.14     0.06                             _init
  0.16      6.15     0.01                             main

Admittedly this is my first time reading gprof, so I may be reading this entirely wrong. But as I see it, my program seems to be running for 49 seconds longer than gprof says it should, and I cannot figure out where this comes from.

The offending code:

    int **mutation_iterations = malloc(iterations * sizeof(int *)); 
    srand(0); // Set random number seed

    #pragma omp parallel shared(mutation_iterations) 
    {
        int target = desired_mutations;
        int n_probabilities = simulation_size;

        double probabilities[simulation_size];
        for (int i = 0; i < simulation_size; i++) {
            probabilities[i] = simulation_set[i].probability;
        }

        #pragma omp for
        for (int i = 0; i < iterations; i++) {
                mutation_iterations[i] = NULL;
                printf("Started thread %d\n", i);
                mutation_iterations[i] = simulate_iteration(target, n_probabilities, probabilities);
                printf("Finished thread %d\n", i);
        }
    }

The simulate_iteration(...) function:

int * simulate_iteration(int target, int n_probabilities, double probabilities[]) {
    int *counts = calloc(n_probabilities, sizeof(int));

    int total = 0;
    while (total < target) {
        for (int i = 0; i < n_probabilities; i++) {
            double rand_val = (double) rand() / RAND_MAX;
            if (rand_val < probabilities[i]) {
                counts[i]++;
                total++;
            }
        }
    }

    return counts;
}

I must be overlooking something but I just can't see what. Any and all help would be much appreciated.

Solution

Turns out I was overlooking a contention in rand(), thanks @Homer512 for pointing it out. I ended up using rand_r and that fixed it.

For anyone trying to do something similar, here is a minimum reproducible sample of the null mutational model I was making with the fix included:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <csv.h>
#include <omp.h>
#include <pthread.h>


int * simulate_iteration(unsigned int c_iteration, int target, int n_probabilities, double probabilities[]) {;
    int *counts = calloc(n_probabilities, sizeof(int));

    int total = 0;
    while (total < target) {
        for (int i = 0; i < n_probabilities; i++) {
            double rand_val = (double) rand_r(&c_iteration) / RAND_MAX;
            if (rand_val < probabilities[i]) {
                counts[i]++;
                total++;
            }
        }
    }

    return counts;
}

int main(int argc, char *argv[]) {
    int iterations = 24;
    int **mutation_iterations = malloc(iterations * sizeof(int *)); 

    #pragma omp parallel shared(mutation_iterations) 
    {
        unsigned int gen_seed = 0;
        int target = 500;
        int n_probabilities = 500;

        double probabilities[500];
        for (int i = 0; i < n_probabilities; i++) {
            probabilities[i] = (double) rand_r(&gen_seed) / RAND_MAX / 10;
        }

        #pragma omp for
        for (int i = 0; i < iterations; i++) {
                mutation_iterations[i] = NULL;
                mutation_iterations[i] = simulate_iteration(i, target, n_probabilities, probabilities);
        }
    }
    
    // Print the results
    printf("gene");
    for (int i = 0; i < 500; i++) {
        printf("\titer_%d", i);
    }
    printf("\n");
    for (int j = 0; j < 500; j++) {
        printf("%d", j);
        for (int i = 0; i < iterations; i++) {
            printf("\t%d", mutation_iterations[i][j]);
        }
        printf("\n");
    }

    for (int i = 0; i < iterations; i++) free(mutation_iterations[i]); 
    free(mutation_iterations);

    return 0;

}