Recently I started having some problems with Django (3.1) tests, which I finally tracked down to some kind of memory leak.
I normally run my suite (roughly 4000 tests at the moment) with --parallel=4
which results in a high memory watermark of roughly 3GB (starting from 500MB or so).
For auditing purposes, though, I occasionally run it with --parallel=1
- when I do this, the memory usage keeps increasing, ending up over the VM's allocated 6GB.
I spent some time looking at the data and it became clear that the culprit is, somehow, Webtest - more specifically, its response.html
and response.forms
: each call during the test case might allocate a few MBs (two or three, generally) which don't get released at the end of the test method and, more importantly, not even at the end of the TestCase
.
I've tried everything I could think of - gc.collect()
with gc.DEBUG_LEAK
shows me a whole lot of collectable items, but it frees no memory at all; using delattr()
on various TestCase
and TestResponse
attributes and so on resulted in no change at all, etc.
I'm quite literally at my wits' end, so any pointer to solve this (beside editing the thousand or so tests which use WebTest responses, which is really not feasible) would be very much appreciated.
(please note that I also tried using guppy
and tracemalloc
and memory_profiler
but neither gave me any kind of actionable information.)
I found that one of our EC2 testing instances isn't affected by the problem, so I spent some more time trying to figure this out.
Initially, I tried to find the "sensible" potential causes - for instance, the cached template loader, which was enabled on my local VM and disabled on the EC2 instance - without success.
Then I went all in: I replicated the EC2 virtualenv (with pip freeze
) and the settings (copying the dotenv), and checked out the same commit where the tests were running normally on the EC2.
Et voilĂ ! THE MEMORY LEAK IS STILL THERE!
Now, I'm officially giving up and will use --parallel=2
for future tests until some absolute guru can point me in the right directions.
And now the memory leak is there even with --parallel=2
. I guess that's somehow better, since it looks increasingly like it's a system problem rather than an application problem. Doesn't solve it but at least I know it's not my fault.
Thanks to Tim Boddy's reply to this question I tried using chap
to figure out what's making memory grow. Unfortunately I can't "read" the results properly but it looks like some non-python library is actually causing the problem.
So, this is what I've seen analyzing the core after a few minutes running the tests that I know cause the leak:
chap> summarize writable
49 ranges take 0x1e0aa000 bytes for use: unknown
1188 ranges take 0x12900000 bytes for use: python arena
1 ranges take 0x4d1c000 bytes for use: libc malloc main arena pages
7 ranges take 0x3021000 bytes for use: stack
139 ranges take 0x476000 bytes for use: used by module
1384 writable ranges use 0x38b5d000 (951,439,360) bytes.
chap> count used
3144197 allocations use 0x14191ac8 (337,189,576) bytes.
The interesting point is that the non-leaking EC2 instance shows pretty much the same values as the one I get from count used
- which would suggest that those "unknown" ranges are the actual hogs.
This is also supported by the output of summarize used
(showing first few lines):
Unrecognized allocations have 886033 instances taking 0x8b9ea38(146,401,848) bytes.
Unrecognized allocations of size 0x130 have 148679 instances taking 0x2b1ac50(45,198,416) bytes.
Unrecognized allocations of size 0x40 have 312166 instances taking 0x130d980(19,978,624) bytes.
Unrecognized allocations of size 0xb0 have 73886 instances taking 0xc66ca0(13,003,936) bytes.
Unrecognized allocations of size 0x8a8 have 3584 instances taking 0x793000(7,942,144) bytes.
Unrecognized allocations of size 0x30 have 149149 instances taking 0x6d3d70(7,159,152) bytes.
Unrecognized allocations of size 0x248 have 10137 instances taking 0x5a5508(5,920,008) bytes.
Unrecognized allocations of size 0x500018 have 1 instances taking 0x500018(5,242,904) bytes.
Unrecognized allocations of size 0x50 have 44213 instances taking 0x35f890(3,537,040) bytes.
Unrecognized allocations of size 0x458 have 2969 instances taking 0x326098(3,301,528) bytes.
Unrecognized allocations of size 0x205968 have 1 instances taking 0x205968(2,120,040) bytes.
The size of those single-instance allocations is very similar to the kind of deltas I see if I add calls to resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
in my test runner when starting/stopping tests - but they're not recognized as Python allocations, hence my feeling.
First of all, a huge apology: I was mistaken in thinking WebTest was the cause of this, and the reason was indeed in my own code, rather than libraries or anything else.
The real cause was a mixin class where I, unthinkingly, added a dict as class attribute, like
class MyMixin:
errors = dict()
Since this mixin is used in a few forms, and the tests generate a fair amout of form errors (that are added to the dict), this ended up hogging memory.
While this is not very interesting in itself, there are a few takeaways that may be helpful to future explorers who stumble across the same kind of problem. They might all be obvious to everybody except me and a single other developer - in which case, hello other developer.
git bisect
is your friend, as long as you can find a commit that actually works.