According to Valgrind's memcheck tool, if I allocate a large local variable in a function and launch that function using multiprocessing.Pool().apply_async()
, the heap size for both the subprocess and the main process increases. Why does main's heap size increase?
I am working with a multiprocessing pool of workers, each of which will be dealing with a large amount of data from an input file. I want to see how my memory footprint scales based on the size of the input file. To do this, I ran my script under Valgrind using memcheck with the technique described in this SO answer. (I have since learned that Valgrind's Massif tool is better suited for this, so I will use it instead going forward.)
There was something that seemed odd in the memcheck output that I would like help understanding.
I am using CPython 2.7.6 on Red Hat Linux, and running memcheck like this:
valgrind --tool=memcheck --suppressions=./valgrind-python.supp python test.py
import multiprocessing
def mem_user():
tmp = 'a'*1
return
pool = multiprocessing.Pool(processes=1)
pool.apply_async(mem_user)
pool.close()
pool.join()
Heap Summaries (one per process):
total heap usage: 45,193 allocs, 32,392 frees, 7,221,910 bytes allocated
total heap usage: 44,832 allocs, 22,006 frees, 7,181,635 bytes allocated
If I change the tmp = 'a'*1
line to tmp = 'a'*10000000
I get these summaries:
total heap usage: 44,835 allocs, 22,009 frees, 27,181,763 bytes allocated
total heap usage: 45,195 allocs, 32,394 frees, 17,221,998 bytes allocated
Why do the heap sizes of both processes increase? I understand that space for objects is allocated on the heap, so the larger heap certainly makes sense for one of the processes. But I expected a subprocess to be given its own heap, stack, and instance of the interpreter, so I don't understand why a local variable allocated in the subprocess increased main's heap size as well. If they share the same heap, then does CPython implement its own version of fork() that doesn't allocate unique heap space to the subprocess?
The problem has nothing to do with how fork
is implemented. You can see for yourself that multiprocessing
calls os.fork
, which is a very thin wrapper around fork
.
So, what is going on?
The compiler is seeing that 'a' * 10000000
in your source code and optimizing it into a literal of 10000000 characters. That means the module object is now 10000000 bytes longer, and since it's being imported in both processes, they both get bigger by that much.
To see this:
$ python2.7
>>> def f():
... temp = 'a' * 10
...
>>> f.__code__.co_consts
(None, 'a', 10, 'aaaaaaaaaa')
>>> import dis
>>> dis.dis(f)
2 0 LOAD_CONST 3 ('aaaaaaaaaa')
3 STORE_FAST 0 (temp)
6 LOAD_CONST 0 (None)
9 RETURN_VALUE
Notice that the compiler is smart enough to add 'aaaaaaaaaa'
to the constants, but not smart enough to also remove 'a'
and 10
. That's because it uses a very narrow peephole optimizer. Besides the fact that it doesn't know whether you're also using 'a'
somewhere else in the same function, it doesn't want to remove a value from the middle of the co_consts
list and go back and update every other bytecode to use the shifted-up indices.
I don't actually know why the child ends up growing by 20000000 bytes instead of 10000000. Presumably it's ending up with its own copy of the module, or at least the code object, instead of using the copy shared from the parent. But if I try to print id(f.__code__)
or anything else, I get the same values in the parent and child, so…