I am trying to measure the performance of concurrent insertion in folly
hashmap. A simplified version of a program for such insertion is brought here:
#include <folly/concurrency/ConcurrentHashMap.h>
#include <chrono>
#include <iostream>
#include <mutex>
#include <thread>
#include <vector>
const int kNumMutexLocks = 2003;
std::unique_ptr<std::mutex[]> mutices(new std::mutex[kNumMutexLocks]);
__inline__ void
concurrentInsertion(unsigned int threadId, unsigned int numInsertionsPerThread,
unsigned int numInsertions, unsigned int numUniqueKeys,
folly::ConcurrentHashMap<int, int> &follyMap) {
int base = threadId * numInsertionsPerThread;
for (int i = 0; i < numInsertionsPerThread; i++) {
int idx = base + i;
if (idx >= numInsertions)
break;
int val = idx;
int key = val % numUniqueKeys;
mutices[key % kNumMutexLocks].lock();
auto found = follyMap.find(key);
if (found != follyMap.end()) {
int oldVal = found->second;
if (oldVal < val) {
follyMap.assign(key, val);
}
} else {
follyMap.insert(key, val);
}
mutices[key % kNumMutexLocks].unlock();
}
}
void func(unsigned int numInsertions, float keyValRatio) {
const unsigned int numThreads = 12; // Simplified just for this post
unsigned int numUniqueKeys = numInsertions * keyValRatio;
unsigned int numInsertionsPerThread = ceil(numInsertions * 1.0 / numThreads);
std::vector<std::thread> insertionThreads;
insertionThreads.reserve(numThreads);
folly::ConcurrentHashMap<int, int> follyMap;
auto start = std::chrono::steady_clock::now();
for (int i = 0; i < numThreads; i++) {
insertionThreads.emplace_back(std::thread([&, i] {
concurrentInsertion(i, numInsertionsPerThread, numInsertions,
numUniqueKeys, follyMap);
}));
}
for (int i = 0; i < numThreads; i++) {
insertionThreads[i].join();
}
auto end = std::chrono::steady_clock::now();
auto diff = end - start;
float insertionTimeMs =
std::chrono::duration<double, std::milli>(diff).count();
std::cout << "i: " << numInsertions << "\tj: " << keyValRatio
<< "\ttime: " << insertionTimeMs << std::endl;
}
int main() {
std::vector<float> js = {0.5, 0.25};
for (auto j : js) {
std::cout << "-------------" << std::endl;
for (int i = 2048; i < 4194304 * 8; i *= 2) {
func(i, j);
}
}
}
The problem is that using this loop in main, suddenly increases the measured time in the func
function. That is, if I call the function directly from main without any loop (as shown in what follows), the measure time for some cases is suddenly more than 100X smaller.
int main() {
func(2048, 0.25); // ~ 100X faster now that the loop is gone.
}
Possible Reasons
More Details
Please note that if I unroll the loop in main, I have the same issue. That is, the following program has the same problem:
int main() {
performComputation(input A);
...
performComputation(input Z);
}
Sample Output
The output of the first program is shown here:
i: 2048 j: 0.5 time: 1.39932
...
i: 16777216 j: 0.5 time: 3704.33
-------------
i: 2048 j: 0.25 time: 277.427 <= sudden increase in execution time
i: 4096 j: 0.25 time: 157.236
i: 8192 j: 0.25 time: 50.7963
i: 16384 j: 0.25 time: 133.151
i: 32768 j: 0.25 time: 8.75953
...
i: 2048 j: 0.25 time: 162.663
Running the func
alone in main with i=2048
and j=0.25
yields:
i: 2048 j: 0.25 time: 1.01
Any comment/insight is highly appreciated.
I can't be too confident on the exact details, but it seems to me to be a result of memory allocation in building the map. I replicated the behaviour you're seeing using a plain unordered_map
and a single mutex
, and making the map object in func
static
fixed it entirely. (Actually now it's slightly slower the first time around, since no memory has been allocated for the map yet, and then faster and a consistent time every subsequent run.)
I'm not sure why this makes a difference, since the map has been destructed and the memory should have been freed. For some reason it seems the map's freed memory isn't reused on subsequent calls to func
. Perhaps someone else more knowledgeable than I can elaborate on this.
Edit: reduced minimal, reproducible example and output
void func(int num_insertions)
{
const auto start = std::chrono::steady_clock::now();
std::unordered_map<int, int> map;
for (int i = 0; i < num_insertions; ++i)
{
map.emplace(i, i);
}
const auto end = std::chrono::steady_clock::now();
const auto diff = end - start;
const auto time = std::chrono::duration<double, std::milli>(diff).count();
std::cout << "i: " << num_insertions << "\ttime: " << time << "\n";
}
int main()
{
func(2048);
func(16777216);
func(2048);
}
With non-static map:
i: 2048 time: 0.6035
i: 16777216 time: 4629.03
i: 2048 time: 124.44
With static map:
i: 2048 time: 0.6524
i: 16777216 time: 4828.6
i: 2048 time: 0.3802
Another edit: should also mention that the static version also requires a call to map.clear()
at the end, though that's not really relevant to the question of the performance of the insertions.