I am writing a multithreaded code where a bunch of std::async
calls spawn a fixed number of threads for the duration of the entire program. Each thread works off the same const BigData
structure on a read-only basis. There are frequent, random reads from const BigData
but the threads are otherwise totally independent. Can one reasonably expect to get perfect scaling or is there a slowdown to be expected from more memory accesses?
EDIT: After some profiling, this seems to be the culprit:
class Point {
friend Point operator+(const Point& lhs, const Point& rhs) noexcept {
return Point{lhs.x + rhs.x, lhs.y + rhs.y, lhs.z + rhs.z};
};
friend Point operator-(const Point& lhs, const Point& rhs) noexcept {
return Point{lhs.x - rhs.x, lhs.y - rhs.y, lhs.z - rhs.z};
};
public:
Point() noexcept;
Point(const Real& x, const Real& y, const Real& z) noexcept
: x{x}, y{y}, z{z} {};
private:
Real x{0};
Real y{0};
Real z{0};
};
After refactoring my code to avoid unnecessary calls to operator+
and operator-
, I seem to get better scaling.
Yes there can be a slowdown. Main memory (RAM) bandwidth is limited, and if you have multiple cores reading a lot of data quickly you may saturate the memory bus. The maximum memory bandwidth is typically tens of gigabytes per second (see the page for your specific processor, e.g. i9-9900K which shows 41.6 GB/s).
As well, all cores on one physical package share a single L3 cache, so if you are reading some data more than once you may have fewer cache hits as your threads push each others' data out of L3 (which is the largest cache).
If you want to know how much slowdown there is from certain configurations, you have only one choice: test them. Consider adding prefetch instructions to your code if you know ahead of time what memory you are likely to need, especially if your access pattern is non-sequential.