I have a program that processes a dataset. When it reads the dataset, I log its memory using memory_usage(deep=True).sum()
, and it shows X GB. However, at the same time the process profiling running the app reports 3.5X times RAM consumed. I have tried optimizations that led to decrease of the overall usage, but the ratio between dataset size and actual process memory consumption stays the same.
Questions:
Minimal example:
from memory_profiler import profile
from pandas import DataFrame
NUMBER_OF_RECORDS = 5_000_000
@profile
def main():
dataset = DataFrame(
{
"a": [1] * NUMBER_OF_RECORDS,
"b": [2] * NUMBER_OF_RECORDS,
"c": [3] * NUMBER_OF_RECORDS,
}
)
print("Memory usage: %s MB", dataset.memory_usage().sum() / 1024 / 1024)
main()
Output:
Memory usage: %s MB 114.44104385375977
Line # Mem usage Increment Occurrences Line Contents
=============================================================
7 88.7 MiB 88.7 MiB 1 @profile
8 def main():
9 675.3 MiB 472.1 MiB 2 dataset = DataFrame(
10 203.2 MiB 0.0 MiB 1 {
11 126.9 MiB 38.2 MiB 1 "a": [1] * NUMBER_OF_RECORDS,
12 165.0 MiB 38.2 MiB 1 "b": [2] * NUMBER_OF_RECORDS,
13 203.2 MiB 38.2 MiB 1 "c": [3] * NUMBER_OF_RECORDS,
14 }
15 )
16
17 675.7 MiB 0.3 MiB 1 print("Memory usage: %s MB", dataset.memory_usage().sum() / 1024 / 1024)
The Python runtime does its own memory management. When you first create the three large lists, Python has to request memory for them from the OS. Once the lists are no longer needed (i.e., once DataFrame
returns), that memory is retained by Python to use for future objects, rather than being returned to the OS and requiring Python to request more memory in the future.
If you create the three lists again after creating the data frame, you shouldn't see Python's memory usage increase: the new lists will just be reusing the memory previously used by the old lists.