I've read the documentation about this parameter, but the difference is really huge! When enabled, the memory usage of a simple program (see below) is about 7 GB and when it's disabled, the reported usage is about 160 KB.
top
also shows about 7 GB, which kinda confirms the result with pages-as-heap=yes
.
(I have a theory, but I don't believe it would explain such huge difference, so - asking for some help).
What especially bothers me, is that most of the reported memory usage is used by std::string
, while what?
is never printed (meaning - the actual capacity is pretty small).
I do need to use pages-as-heap=yes
while profiling my app, I just wonder how to avoid the "false positives"
The code snippet:
#include <iostream>
#include <thread>
#include <vector>
#include <chrono>
void run()
{
while (true)
{
std::string s;
s += "aaaaa";
s += "aaaaaaaaaaaaaaa";
s += "bbbbbbbbbb";
s += "cccccccccccccccccccccccccccccccccccccccccccccccccc";
if (s.capacity() > 1024) std::cout << "what?" << std::endl;
std::this_thread::sleep_for(std::chrono::seconds(1));
}
}
int main()
{
std::vector<std::thread> workers;
for( unsigned i = 0; i < 192; ++i ) workers.push_back(std::thread(&run));
workers.back().join();
}
Compiled with: g++ --std=c++11 -fno-inline -g3 -pthread
With pages-as-heap=yes
:
100.00% (7,257,714,688B) (page allocation syscalls) mmap/mremap/brk, --alloc-fns, etc.
->99.75% (7,239,757,824B) 0x54E0679: mmap (mmap.c:34)
| ->53.63% (3,892,314,112B) 0x545C3CF: new_heap (arena.c:438)
| | ->53.63% (3,892,314,112B) 0x545CC1F: arena_get2.part.3 (arena.c:646)
| | ->53.63% (3,892,314,112B) 0x5463248: malloc (malloc.c:2911)
| | ->53.63% (3,892,314,112B) 0x4CB7E76: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->53.63% (3,892,314,112B) 0x4CF8E37: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->53.63% (3,892,314,112B) 0x4CF9C69: std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->53.63% (3,892,314,112B) 0x4CF9D22: std::string::reserve(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->53.63% (3,892,314,112B) 0x4CF9FB1: std::string::append(char const*, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->53.63% (3,892,314,112B) 0x401252: run() (test.cpp:11)
| | ->53.63% (3,892,314,112B) 0x403929: void std::_Bind_simple<void (*())()>::_M_invoke<>(std::_Index_tuple<>) (functional:1700)
| | ->53.63% (3,892,314,112B) 0x403864: std::_Bind_simple<void (*())()>::operator()() (functional:1688)
| | ->53.63% (3,892,314,112B) 0x4037D2: std::thread::_Impl<std::_Bind_simple<void (*())()> >::_M_run() (thread:115)
| | ->53.63% (3,892,314,112B) 0x4CE2C7E: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->53.63% (3,892,314,112B) 0x51C96B8: start_thread (pthread_create.c:333)
| | ->53.63% (3,892,314,112B) 0x54E63DB: clone (clone.S:109)
| |
| ->35.14% (2,550,136,832B) 0x545C35B: new_heap (arena.c:427)
| | ->35.14% (2,550,136,832B) 0x545CC1F: arena_get2.part.3 (arena.c:646)
| | ->35.14% (2,550,136,832B) 0x5463248: malloc (malloc.c:2911)
| | ->35.14% (2,550,136,832B) 0x4CB7E76: operator new(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->35.14% (2,550,136,832B) 0x4CF8E37: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->35.14% (2,550,136,832B) 0x4CF9C69: std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->35.14% (2,550,136,832B) 0x4CF9D22: std::string::reserve(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->35.14% (2,550,136,832B) 0x4CF9FB1: std::string::append(char const*, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->35.14% (2,550,136,832B) 0x401252: run() (test.cpp:11)
| | ->35.14% (2,550,136,832B) 0x403929: void std::_Bind_simple<void (*())()>::_M_invoke<>(std::_Index_tuple<>) (functional:1700)
| | ->35.14% (2,550,136,832B) 0x403864: std::_Bind_simple<void (*())()>::operator()() (functional:1688)
| | ->35.14% (2,550,136,832B) 0x4037D2: std::thread::_Impl<std::_Bind_simple<void (*())()> >::_M_run() (thread:115)
| | ->35.14% (2,550,136,832B) 0x4CE2C7E: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->35.14% (2,550,136,832B) 0x51C96B8: start_thread (pthread_create.c:333)
| | ->35.14% (2,550,136,832B) 0x54E63DB: clone (clone.S:109)
| |
| ->10.99% (797,306,880B) 0x51CA1D4: pthread_create@@GLIBC_2.2.5 (allocatestack.c:513)
| ->10.99% (797,306,880B) 0x4CE2DC1: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>, void (*)()) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->10.99% (797,306,880B) 0x4CE2ECB: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->10.99% (797,306,880B) 0x401BEA: std::thread::thread<void (*)()>(void (*&&)()) (thread:138)
| ->10.99% (797,306,880B) 0x401353: main (test.cpp:24)
|
->00.25% (17,956,864B) in 1+ places, all below ms_print's threshold (01.00%)
while with pages-as-heap=no
:
96.38% (159,289B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->43.99% (72,704B) 0x4EBAEFE: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->43.99% (72,704B) 0x40106B8: call_init.part.0 (dl-init.c:72)
| ->43.99% (72,704B) 0x40107C9: _dl_init (dl-init.c:30)
| ->43.99% (72,704B) 0x4000C68: ??? (in /lib/x86_64-linux-gnu/ld-2.23.so)
|
->33.46% (55,296B) 0x40138A3: _dl_allocate_tls (dl-tls.c:322)
| ->33.46% (55,296B) 0x53D126D: pthread_create@@GLIBC_2.2.5 (allocatestack.c:588)
| ->33.46% (55,296B) 0x4EE9DC1: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>, void (*)()) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->33.46% (55,296B) 0x4EE9ECB: std::thread::_M_start_thread(std::shared_ptr<std::thread::_Impl_base>) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->33.46% (55,296B) 0x401BEA: std::thread::thread<void (*)()>(void (*&&)()) (thread:138)
| ->33.46% (55,296B) 0x401353: main (test.cpp:24)
|
->12.12% (20,025B) 0x4EFFE37: std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->12.12% (20,025B) 0x4F00C69: std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->12.12% (20,025B) 0x4F00D22: std::string::reserve(unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->12.12% (20,025B) 0x4F00FB1: std::string::append(char const*, unsigned long) (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| ->12.07% (19,950B) 0x401285: run() (test.cpp:14)
| | ->12.07% (19,950B) 0x403929: void std::_Bind_simple<void (*())()>::_M_invoke<>(std::_Index_tuple<>) (functional:1700)
| | ->12.07% (19,950B) 0x403864: std::_Bind_simple<void (*())()>::operator()() (functional:1688)
| | ->12.07% (19,950B) 0x4037D2: std::thread::_Impl<std::_Bind_simple<void (*())()> >::_M_run() (thread:115)
| | ->12.07% (19,950B) 0x4EE9C7E: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| | ->12.07% (19,950B) 0x53D06B8: start_thread (pthread_create.c:333)
| | ->12.07% (19,950B) 0x56ED3DB: clone (clone.S:109)
| |
| ->00.05% (75B) in 1+ places, all below ms_print's threshold (01.00%)
|
->05.58% (9,216B) 0x40315B: __gnu_cxx::new_allocator<std::_Sp_counted_ptr_inplace<std::thread::_Impl<std::_Bind_simple<void (*())()> >, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, (__gnu_cxx::_Lock_policy)2> >::allocate(unsigned long, void const*) (new_allocator.h:104)
| ->05.58% (9,216B) 0x402FC2: std::allocator_traits<std::allocator<std::_Sp_counted_ptr_inplace<std::thread::_Impl<std::_Bind_simple<void (*())()> >, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, (__gnu_cxx::_Lock_policy)2> > >::allocate(std::allocator<std::_Sp_counted_ptr_inplace<std::thread::_Impl<std::_Bind_simple<void (*())()> >, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, (__gnu_cxx::_Lock_policy)2> >&, unsigned long) (alloc_traits.h:488)
| ->05.58% (9,216B) 0x402D4B: std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<std::thread::_Impl<std::_Bind_simple<void (*())()> >, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, std::_Bind_simple<void (*())()> >(std::_Sp_make_shared_tag, std::thread::_Impl<std::_Bind_simple<void (*())()> >*, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > > const&, std::_Bind_simple<void (*())()>&&) (shared_ptr_base.h:616)
| ->05.58% (9,216B) 0x402BDE: std::__shared_ptr<std::thread::_Impl<std::_Bind_simple<void (*())()> >, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, std::_Bind_simple<void (*())()> >(std::_Sp_make_shared_tag, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > > const&, std::_Bind_simple<void (*())()>&&) (shared_ptr_base.h:1090)
| ->05.58% (9,216B) 0x402A76: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<void (*())()> > >::shared_ptr<std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, std::_Bind_simple<void (*())()> >(std::_Sp_make_shared_tag, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > > const&, std::_Bind_simple<void (*())()>&&) (shared_ptr.h:316)
| ->05.58% (9,216B) 0x402771: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<void (*())()> > > std::allocate_shared<std::thread::_Impl<std::_Bind_simple<void (*())()> >, std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > >, std::_Bind_simple<void (*())()> >(std::allocator<std::thread::_Impl<std::_Bind_simple<void (*())()> > > const&, std::_Bind_simple<void (*())()>&&) (shared_ptr.h:594)
| ->05.58% (9,216B) 0x402325: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<void (*())()> > > std::make_shared<std::thread::_Impl<std::_Bind_simple<void (*())()> >, std::_Bind_simple<void (*())()> >(std::_Bind_simple<void (*())()>&&) (shared_ptr.h:610)
| ->05.58% (9,216B) 0x401F9C: std::shared_ptr<std::thread::_Impl<std::_Bind_simple<void (*())()> > > std::thread::_M_make_routine<std::_Bind_simple<void (*())()> >(std::_Bind_simple<void (*())()>&&) (thread:196)
| ->05.58% (9,216B) 0x401BC4: std::thread::thread<void (*)()>(void (*&&)()) (thread:138)
| ->05.58% (9,216B) 0x401353: main (test.cpp:24)
|
->01.24% (2,048B) 0x402C9A: __gnu_cxx::new_allocator<std::thread>::allocate(unsigned long, void const*) (new_allocator.h:104)
->01.24% (2,048B) 0x402AF5: std::allocator_traits<std::allocator<std::thread> >::allocate(std::allocator<std::thread>&, unsigned long) (alloc_traits.h:488)
->01.24% (2,048B) 0x402928: std::_Vector_base<std::thread, std::allocator<std::thread> >::_M_allocate(unsigned long) (stl_vector.h:170)
->01.24% (2,048B) 0x40244E: void std::vector<std::thread, std::allocator<std::thread> >::_M_emplace_back_aux<std::thread>(std::thread&&) (vector.tcc:412)
->01.24% (2,048B) 0x40206D: void std::vector<std::thread, std::allocator<std::thread> >::emplace_back<std::thread>(std::thread&&) (vector.tcc:101)
->01.24% (2,048B) 0x401C82: std::vector<std::thread, std::allocator<std::thread> >::push_back(std::thread&&) (stl_vector.h:932)
->01.24% (2,048B) 0x401366: main (test.cpp:24)
Please ignore the crappy handling of the threads, it's just a very short example.
It appears, that this is not related to std::string
at all. As @Lawrence suggested, this can be reproduced by simply allocating a single int
on the heap (with new
). I believe @Lawrence is very close to the real answer here, quoting his comments (easier for further readers):
Lawrence:
@KirilKirov The string allocation is not actually taking that much space... Each thread gets it's initial stack and then heap access maps some large amount of space (around 70m) that gets inaccurately reflected. You can measure it by just declaring 1 string and then having a spin loop... the same virtual memory usage is shown – Lawrence Sep 28 at 14:51
me:
@Lawrence - you're damn right! OK, so, you're saying (and it appears to be like this), that on each thread, on the first heap allocation, the memory manager (or the OS, or whatever) dedicates huge chunk of memory for the threads' heap needs? And this chunk will be reused later (or shrinked, if necessary)? – Kiril Kirov Sep 28 at 15:45
Lawrence:
@KirilKirov something of that nature... exact allocations probably depends on malloc implementation and whatnot – Lawrence 2 days ago
massif
with --pages-as-heap=yes
and the top
column you are observing both measure the virtual memory used by a process. This virtual memory includes all space mmap
'd in the implementation of malloc and during the creation of threads. For example, the default stack size for a thread will be 8192k
which is reflected in the creation of each thread and contributes to the virtual memory footprint.
The specific allocation scheme will be dependent on implementation but it seems that the first heap allocation on a new thread will mmap
roughly 65 megabytes of space. This can be viewed by looking at the pmap output for a process.
Excerpt from a very similar program to the example:
75170: ./a.out
0000000000400000 24K r-x-- a.out
0000000000605000 4K r---- a.out
0000000000606000 4K rw--- a.out
0000000001b6a000 200K rw--- [ anon ]
00007f669dfa4000 4K ----- [ anon ]
00007f669dfa5000 8192K rw--- [ anon ]
00007f669e7a5000 4K ----- [ anon ]
00007f669e7a6000 8192K rw--- [ anon ]
00007f669efa6000 4K ----- [ anon ]
00007f669efa7000 8192K rw--- [ anon ]
...
00007f66cb800000 8192K rw--- [ anon ]
00007f66cc000000 132K rw--- [ anon ]
00007f66cc021000 65404K ----- [ anon ]
00007f66d0000000 132K rw--- [ anon ]
00007f66d0021000 65404K ----- [ anon ]
00007f66d4000000 132K rw--- [ anon ]
00007f66d4021000 65404K ----- [ anon ]
...
00007f6880586000 8192K rw--- [ anon ]
00007f6880d86000 1056K r-x-- libm-2.23.so
00007f6880e8e000 2044K ----- libm-2.23.so
...
00007f6881c08000 4K r---- libpthread-2.23.so
00007f6881c09000 4K rw--- libpthread-2.23.so
00007f6881c0a000 16K rw--- [ anon ]
00007f6881c0e000 152K r-x-- ld-2.23.so
00007f6881e09000 24K rw--- [ anon ]
00007f6881e33000 4K r---- ld-2.23.so
00007f6881e34000 4K rw--- ld-2.23.so
00007f6881e35000 4K rw--- [ anon ]
00007ffe9d75b000 132K rw--- [ stack ]
00007ffe9d7f8000 12K r---- [ anon ]
00007ffe9d7fb000 8K r-x-- [ anon ]
ffffffffff600000 4K r-x-- [ anon ]
total 7815008K
It seems that malloc becomes more conservative as you approach some threshold of virtual memory per process. Also, my comment about libraries being mapped separately was misguided (they should be shared per process)