c++multithreadingperformancemutexppl

Multithreading alternative to mutex in parallel_for


I'm fairly new to C++, therefore please pardon if this is a stupid question, but I didn't find good example of what I'm looking for on the internet.

Basically I'm using a parallel_for cycle to find a maximum inside a 2D array (and a bunch of other operations in between). First of all I don't even know if this is the best approach, but given the length of this 2D array, I though splitting the calculations would be faster.

My code:

vector<vector<double>> InterpU(1801, vector<double>(3601, 0));
Concurrency::parallel_for(0, 1801, [&](int i) {

    long k = 0; long l = 0;
    pair<long, long> Normalized;
    double InterpPointsU[4][4];
    double jRes;
    double iRes = i * 0.1;
    double RelativeY, RelativeX;
    int p, q;

    while (iRes >= (k + 1) * DeltaTheta) k++;
    RelativeX = iRes / DeltaTheta - k;
    for (long j = 0; j < 3600; j++)
    {
        jRes = j * 0.1;
        while (jRes >= (l + 1) * DeltaPhi) l++;
        RelativeY = jRes / DeltaPhi - l;
        p = 0;
        for (long m = k - 1; m < k + 3; m++)
        {
            q = 0;
            for (long n = l - 1; n < l + 3; n++)
            {
                Normalized = Normalize(m, n, PointsTheta, PointsPhi);
                InterpPointsU[p][q] = U[Normalized.first][Normalized.second];
                q++;
            }
            p++;
        }
        InterpU[i][j] = bicubicInterpolate(InterpPointsU, RelativeX, RelativeY);
        if (InterpU[i][j] > MaxU)
        {
            SharedDataLock.lock();
            MaxU = InterpU[i][j];
            SharedDataLock.unlock();
        }
    }
    InterpU[i][3600] = InterpU[i][0];
});

You can see here that I'm using a mutex called SharedDataLock to protect multiple threads accessing the same resource. MaxU is a variable that should only containe the maximum of the InterpU vector. The code works well, but since I'm having speed performance problem, I began to look into atomic and some other stuff.

Is there any good example on how to modify a similar code to make it faster?


Solution

  • As mentioned by VTT, you can simply find the local maximum of each thread, and merge those afterwards With use of combinable:

    Concurrency::combinable<double> CombinableMaxU;
    Concurrency::parallel_for(0, 1801, [&](int i) {
        ...
            CombinableMaxU.local() = std::max(CombinableMaxU.local(), InterpU[i][j]);
    }
    MaxU = std::max(MaxU, CombinableMaxU.combine(std::max<double>));
    

    Note that your current code is actually wrong (unless MaxU is atomic), you read MaxU outside of the lock, while it can be written simultaneously by other threads. Generally, you must not read a value that is being written to simultaneously unless both sides are protected by atomic semantics or locks and memory fences. A reason is that a variable access may very well consist of multiple memory accesses, depending on how the type is supported by hardware.

    But in your case, you even have a classic race condition:

    MaxU == 1
      Thread a                 |   Thread b
    InterpU[i][j] = 3          | InterpU[i][j] = 2
    if (3 > MaxU)              |  if (2 > MaxU)
    SharedDataLock.lock();     | SharedDataLock.lock();
    (gets the lock)            | (waiting for lock)
    MaxU = 3                   | ...
    SharedDataLock.unlock();   | ...
    ...                        | (gets the lock)
                               | MaxU = 2
                               | SharedDataLock.unlock();
    MaxU == 2
    

    Locks are hard.

    You can also use an atomic and compute the maximum on that. However, I would guess1 that it still doesn't perform well inside the loop2, and outside the loop it doesn't matter whether you use atomics or locks.

    1: When in doubt, don't guess - measure!

    2: Just because something is atomic and supported by hardware, doesn't mean it is as efficient as accessing local data. First, atomic instructions are often much more costly than their non-atomic counterparts, second you have to deal with very bad cache effects, because cores/caches will fight for the ownership of the data. While atomics may be more elegant in many cases (not this one IMHO), reduction is faster most of the time.