I was revisting some code I wrote a long time ago, and decided to rewrite it to better make use of threads (and better use of programming in general..).
It is located here: https://github.com/buddhabrot/buddhabrot/blob/master/basic.c:
It is an application that renders a buddhabrot fractal. For reasons out of the scope of this question it is hard to use memoization to optimize this, and basically if you'd profile this, over 99% of the time is spent in the innermost loop that eventually does:
buddhabrot[col][row]++;
Multiple threads will execute this code. Since incrementing is not thread-safe, I used a specific mutex lock around this part of the memory. So, each addressable location in the buddhabrot memory has a separate mutex.
Now, this is more efficient than using one lock of course (which would definitely make all the threads wait for each other), but it is less memory efficient; it appears the mutexes take some data as well. I am also wondering about other repercussions in the pthreads implementations with millions of mutexes?
I now have two other strategies to consider:
Use a less dense set of mutex locks, for each "region" in the map. So, a lock for [col/16][row/16], for instance, would only lock a thread if it visits the same region of 16 pixels as another one. The density of the locks could be dynamically adjusted. But as I was modeling this I was wondering if I'm not solving an existing problem that might even be implemented by kernels, and I also can't really find a way to make this without slowing things down. I also thought about "trees of mutexes", but all of this is just too slow inside this loop (to give an indication, after optimizating the order of some maths operations behind the compiler's back I could squeeze out about 30% more processor time). Is there a topic for this, how do I look for mor einformation on "mutex density planning"..?
Copy the memory for each thread so I don't even have to mutex around it. But this is even more memory-inefficient. It would solve the problem of having millions of mutexes without knowing the repercussions thereof.
So, is there anything else, anything better I could do?
You can use atomic increment functions like InterlockedIncrement from the intrin.h on Windows platforms.
#include <intrin.h>
#pragma intrinsic(_InterlockedExchangeAdd, _InterlockedIncrement, _InterlockedDecrement, _InterlockedCompareExchange, _InterlockedExchange)
#define InterlockedExchangeAdd _InterlockedExchangeAdd
#define InterlockedIncrement _InterlockedIncrement
#define InterlockedDecrement _InterlockedDecrement
#define InterlockedCompareExchange _InterlockedCompareExchange
#define InterlockedExchange _InterlockedExchange
#pragma intrinsic(abs, fabs, labs, memcmp, memcpy, memset, strcat, strcmp, strcpy, strlen)
#pragma intrinsic(acos, cosh, pow, tanh, asin, fmod, sinh)
#pragma intrinsic(atan, exp, log10, sqrt, atan2, log, sin, tan, cos)
This incrementation is atomic and there is no need to have millions of mutex or a global lock on your matrix.