I am using threading in c++, on MAC for the first time.
Below is my code. (Motivation behind my code)
#include <iostream>
#include <thread>
#include <chrono>
using namespace std;
using namespace chrono;
const long long maxLimit = 2e9;
void getEvenSum(long long &sum) {
for(int i = 0 ; i <= maxLimit ; i+=2) {
sum += i;
}
}
void getOddSum(long long &sum) {
for(int i = 1 ; i <= maxLimit ; i+=2) {
sum += i;
}
}
int main() {
auto startTime = high_resolution_clock::now();
long long evenSum = 0 , oddSum = 0;
thread evenSumThread(getEvenSum, ref(evenSum));
thread oddSumThread(getOddSum, ref(oddSum));
evenSumThread.join();
oddSumThread.join();
auto endTime = high_resolution_clock::now();
auto duration = duration_cast<microseconds>(endTime - startTime);
cout << " final sum is " << evenSum << " " << oddSum << endl;
cout << " time taken with thread :" << duration.count() / (long double)1e6 << endl;
startTime = high_resolution_clock::now();
evenSum = 0 , oddSum = 0;
getEvenSum(evenSum);
getOddSum(oddSum);
endTime = high_resolution_clock::now();
duration = duration_cast<microseconds>(endTime - startTime);
cout << " final sum is " << evenSum << " " << oddSum << endl;
cout << " time taken without thread " << duration.count() / (long double)1e6 << endl;
return 0;
}
OUTPUT :
final sum is 1000000001000000000 1000000000000000000
time taken with thread :5.01665
final sum is 1000000001000000000 1000000000000000000
time taken without thread 2.83442
The output is quite unexpected. Threading takes 5 seconds and non-threading takes 2.8 seconds. HOW !!!
Solutions I have tried and failed,
g++ --std=c++11 file.cpp
g++ --std=c++11 -O3 -s -DNDEBUG file.cpp
( time changed frantically, but still threading takes more time )g++ --std=c++17 file.cpp
g++ clang++ -std=c++11 file.cpp
( Got error, that "CLANG" not present )PS : The same code used to work when I had practiced last time in the Linux (2 years back)
You should enable optimizations like -O3
. on most recent gcc and clang with -O3
both functions are completely optimized away, you can prevent that optimization by declaring its argument as volatile
.
void getEvenSum(volatile unsigned long long& sum) {
for (unsigned long long i = 0; i <= maxLimit; i += 2) {
sum += i;
}
}
this also convert the variables to unsigned long long
to avoid signed overflow which is undefined behavior.
The second problem is now false sharing, one way around it is to pad both variables away from each other, or align those variables to cache line boundaries (hint: use std::hardware_destructive_interference_size, since C++17)
alignas(64) unsigned long long evenSum = 0;
alignas(64) unsigned long long oddSum = 0;
with that the multithreaded version is now faster than the single-threaded version. online godbolt result
By disabling those optimizations We went from the function taking 0 time to some time ... so did we really make the code faster ?!