This week I've been trying to spot an unusual memory behavior: when I run my code with one single thread I have a specific memory footprint, but if I run it with multiple threads then the memory footprint increases a lot with no apparent reason.
I know strange things can happen due to the indeterministic nature of a multithreaded program, so some deterministic approach is better used to debug this. However, even in a such very deterministic situation I failed to see why the memory footprint increases.
Here's the smallest program I've been able to write that reproduce the problem:
#include <chrono>
#include <iomanip>
#include <iostream>
#include <openssl/sha.h>
#include <sparsehash/sparse_hash_map>
#include <thread>
struct mystruct
{
uint64_t values[16];
inline bool operator==(const mystruct &other) const {
return !memcmp(values, other.values, sizeof(values));
}
inline bool operator!=(const mystruct &other) const {
return !(*this == other);
}
};
namespace std {
template<>
class hash<mystruct> {
public:
inline size_t operator()(const mystruct &s) const
{
return s.values[0];
}
};
}
google::sparse_hash_map<mystruct, uint64_t, std::hash<mystruct>> h;
const uint64_t N = 1ull * 1024ull * 1024ull * 1024ull;
const int step = 128;
volatile uint64_t next = 0;
uint64_t total_keys = 0;
const uint64_t THREADS = 1;
void insert_data(uint64_t data) {
unsigned char buf[256];
mystruct b;
for (uint64_t j = 0; j < N / THREADS; j++) {
SHA256((unsigned char*)&data, 8, buf);
memcpy(&b, buf, sizeof(mystruct));
while (next != data)
;
h[b] = 1;
total_keys++;
next += step;
data += THREADS * step;
}
}
int main() {
std::thread t[THREADS];
for (uint64_t i = 0; i < THREADS; i++) {
t[i] = std::thread(insert_data, i * step);
}
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
while (total_keys < N) {
std::this_thread::sleep_for(std::chrono::milliseconds(100));
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
uint64_t elapsed_ms = std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count();
double speed = 1000.0*(double)total_keys / (double)elapsed_ms ;
std::cout << std::fixed << "Processed " << total_keys << " items in " << elapsed_ms << " ms (" << std::setprecision(0) << speed << " keys/s)" << ", next key is " << next << std::endl;
}
for (uint64_t i = 0; i < THREADS; i++) {
t[i].join();
}
}
The idea of the program is dead simple:
So, the goal is to verify that for a predefined sequence of inserts the memory consumption is the same regardless of how many threads push the data into the hashtable, that is the number of inserted keys is roughly the same (I don't mind if there's a few keys of differences, eg due to OS deciding to kill the application). Well the difference can be huge.
The insert_data
function uses a simple spin loop to guarantee the insertion sequence, that is the numbers 0, 128, 256, and so on... (using a lock/mutex makes no differences).
I ran the program in the following environment:
-O3
flagand these are the results when the above code runs with different number of the THREADS variable:
Processed 54241 items in 100 ms (542410 keys/s), next key is 6942848
Processed 104857 items in 200 ms (524285 keys/s), next key is 13421696
...
Processed 13410458 items in 35698 ms (375664 keys/s), next key is 1716538624
Processed 13421773 items in 35799 ms (374920 keys/s), next key is 1717986944
...
... 6 seconds of data because hashtable is being resized...
...
Processed 13421773 items in 41245 ms (325416 keys/s), next key is 1717986944
Processed 13438143 items in 41345 ms (325025 keys/s), next key is 1720082432
...
Processed 23580346 items in 65567 ms (359637 keys/s), next key is 3018284416
Killed
Processed 69922 items in 101 ms (692297 keys/s), next key is 8950016
Processed 121766 items in 202 ms (602802 keys/s), next key is 15586048
...
Processed 13409271 items in 37098 ms (361455 keys/s), next key is 1716386688
Processed 13421773 items in 37198 ms (360820 keys/s), next key is 1717986944
...
... 6 seconds of data because hashtable is being resized...
...
Processed 13421773 items in 43204 ms (310660 keys/s), next key is 1717986944
Processed 13443021 items in 43304 ms (310434 keys/s), next key is 1720706688
...
Processed 20024294 items in 64129 ms (312250 keys/s), next key is 2563110656
Killed
Processed 31 items in 100 ms (310 keys/s), next key is 3968
Processed 33882 items in 200 ms (169410 keys/s), next key is 4336896
...
Processed 13399262 items in 48949 ms (273739 keys/s), next key is 1715105536
Processed 13421773 items in 49049 ms (273640 keys/s), next key is 1717986944
...
... 5 seconds of data because hashtable is being resized...
...
Processed 13421773 items in 54091 ms (248133 keys/s), next key is 1717986944
Processed 13421773 items in 54201 ms (247630 keys/s), next key is 1717986944
Killed
As you can see, all three tests identified the particular key 1717986944
as the key that triggers the hashtable resize, and all of them happen at the same inserted key n.13421773
, and this confirms that the inserts always happen strictly in the same order regardless of the number of running threads.
However, the THREADS=1 inserted 3556052 more keys (corresponding to 434MB) than THREADS=2, and inserted 6602521 more keys (corresponding to 805MB) than THREADS=4.
Can anyone explain me why I have a such weird memory consumption even in such deterministic conditions? Is there something that I'm actually missing?
Posting this as an answer since I was able to understand what's going on.
After a good amount of time and debugging I found out that the root cause of the memory consumption is a pathological memory alloc/dealloc pattern that produces heap fragmentation. No memory leaks or whatsoever. I was able to reproduce the problem with a custom hashtable implementation too.
As weird as it is, in both cases adding a malloc_trim(0);
at the end of the main loop just after the line
std::cout << std::fixed << "Processed " << total_keys << " items in " << elapsed_ms << " ms (" << std::setprecision(0) << speed << " keys/s)" << ", next key is " << next << std::endl;
compensates a bit and allows the program to reclaim more memory.