c++multithreadingopenmpcontention

Splitting up a program into 4 threads is slower than a single thread


I've been writing a raytracer the past week, and have come to a point where it's doing enough that multi-threading would make sense. I have tried using OpenMP to parallelize it, but running it with more threads is actually slower than running it with one.

Reading over other similar questions, especially about OpenMP, one suggestion was that gcc optimizes serial code better. However running the compiled code below with export OMP_NUM_THREADS=1 is twice as fast as with export OMP_NUM_THREADS=4. I.e. It's the same compiled code on both runs.

Running the program with time:

> export OMP_NUM_THREADS=1; time ./raytracer
real    0m34.344s
user    0m34.310s
sys     0m0.008s


> export OMP_NUM_THREADS=4; time ./raytracer
real    0m53.189s
user    0m20.677s
sys     0m0.096s

User time is a lot smaller than real, which is unusual when using multiple cores- user should be larger than real as several cores are running at the same time.

Code that I have parallelized using OpenMP

void Raytracer::render( Camera& cam ) {

    // let the camera know to use this raytracer for probing the scene
    cam.setSamplingFunc(getSamplingFunction());

    int i, j;

    #pragma omp parallel private(i, j)
    {

        // Construct a ray for each pixel.
        #pragma omp for schedule(dynamic, 4)
        for (i = 0; i < cam.height(); ++i) {
            for (j = 0; j < cam.width(); ++j) {
                cam.computePixel(i, j);
            }
        }
    }
}

When reading this question I thought I had found my answer. It talks about the implementation of gclib rand() synchronizing calls to itself to preserve state for random number generation between threads. I am using rand() quite a lot for monte carlo sampling, so i thought that was the problem. I got rid of calls to rand, replacing them with a single value, but using multiple threads is still slower. EDIT: oops turns out I didn't test this correctly, it was the random values!

Now that those are out of the way, I will discuss an overview of what's being done on each call to computePixel, so hopefully a solution can be found.

In my raytracer I essentially have a scene tree, with all objects in it. This tree is traversed a lot during computePixel when objects are tested for intersection, however, no writes are done to this tree or any objects. computePixel essentially reads the scene a bunch of times, calling methods on the objects (all of which are const methods), and at the very end writes a single value to its own pixel array. This is the only part that I am aware of where more than one thread will try to write to to the same member variable. There is no synchronization anywhere since no two threads can write to the same cell in the pixel array.

Can anyone suggest places where there could be some kind of contention? Things to try?

Thank you in advance.

EDIT: Sorry, was stupid not to provide more info on my system.

Code for compute pixel:

class Camera {

    // constructors destructors
    private:
        // this is the array that is being written to, but not read from.
        Colour* _sensor; // allocated using new at construction.
}

void Camera::computePixel(int i, int j) const {

    Colour col;

    // simple code to construct appropriate ray for the pixel
    Ray3D ray(/* params */);
    col += _sceneSamplingFunc(ray); // calls a const method that traverses scene. 

    _sensor[i*_scrWidth+j] += col;
}

From the suggestions, it might be the tree traversal that causes the slow-down. Some other aspects: there is quite a lot of recursion involved once the sampling function is called (recursive bouncing of rays)- could this cause these problems?


Solution

  • Thanks everyone for the suggestions, but after further profiling, and getting rid of other contributing factors, random-number generation did turn out to be the culprit.

    As outlined in the question above, rand() needs to keep track of its state from one call to the next. If several threads are trying to modify this state, it would cause a race condition, so the default implementation in glibc is to lock on every call, to make the function thread-safe. This is terrible for performance.

    Unfortunately the solutions to this problem that I've seen on stackoverflow are all local, i.e. deal with the problem in the scope where rand() is called. Instead I propose a "quick and dirty" solution that anyone can use in their program to implement independent random number generation for each thread, requiring no synchronization.

    I have tested the code, and it works- there is no locking, and no noticeable slowdown as a result of calls to threadrand. Feel free to point out any blatant mistakes.

    threadrand.h

    #ifndef _THREAD_RAND_H_
    #define _THREAD_RAND_H_
    
    // max number of thread states to store
    const int maxThreadNum = 100;
    
    void init_threadrand();
    
    // requires openmp, for thread number
    int threadrand();
    
    #endif // _THREAD_RAND_H_
    

    threadrand.cpp

    #include "threadrand.h"
    #include <cstdlib>
    #include <boost/scoped_ptr.hpp>
    #include <omp.h>
    
    // can be replaced with array of ordinary pointers, but need to
    // explicitly delete previous pointer allocations, and do null checks.
    //
    // Importantly, the double indirection tries to avoid putting all the
    // thread states on the same cache line, which would cause cache invalidations
    // to occur on other cores every time rand_r would modify the state.
    // (i.e. false sharing)
    // A better implementation would be to store each state in a structure
    // that is the size of a cache line
    static boost::scoped_ptr<unsigned int> randThreadStates[maxThreadNum];
    
    // reinitialize the array of thread state pointers, with random
    // seed values.
    void init_threadrand() {
        for (int i = 0; i < maxThreadNum; ++i) {
            randThreadStates[i].reset(new unsigned int(std::rand()));
        }
    }
    
    // requires openmp, for thread number, to index into array of states.
    int threadrand() {
        int i = omp_get_thread_num();
        return rand_r(randThreadStates[i].get());
    }
    

    Now you can initialize the random states for threads from main using init_threadrand(), and subsequently get a random number using threadrand() when using several threads in OpenMP.