[SOLVED] Why does the process running pandas use 3-4 times the memory of the dataset size?

Why does the process running pandas use 3-4 times the memory of the dataset size?

I have a program that processes a dataset. When it reads the dataset, I log its memory using memory_usage(deep=True).sum(), and it shows X GB. However, at the same time the process profiling running the app reports 3.5X times RAM consumed. I have tried optimizations that led to decrease of the overall usage, but the ratio between dataset size and actual process memory consumption stays the same.

Questions:

Is that expected for pandas data frames to consume 3-4 times the memory of dataset size?
If yes, are there any optimizations that would specifically address the ratio between dataset size and consumed memory? For example, downcasting data types reduced both dataset size and consumed memory, but the ratio stayed the same.
Is there a particular reason for this discrepancy? Is that documented somewhere?

Minimal example:

from memory_profiler import profile
from pandas import DataFrame

NUMBER_OF_RECORDS = 5_000_000


@profile
def main():
    dataset = DataFrame(
        {
            "a": [1] * NUMBER_OF_RECORDS,
            "b": [2] * NUMBER_OF_RECORDS,
            "c": [3] * NUMBER_OF_RECORDS,
        }
    )

    print("Memory usage: %s MB", dataset.memory_usage().sum() / 1024 / 1024)

main()

Output:

Memory usage: %s MB 114.44104385375977

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     7     88.7 MiB     88.7 MiB           1   @profile
     8                                         def main():
     9    675.3 MiB    472.1 MiB           2       dataset = DataFrame(
    10    203.2 MiB      0.0 MiB           1           {
    11    126.9 MiB     38.2 MiB           1               "a": [1] * NUMBER_OF_RECORDS,
    12    165.0 MiB     38.2 MiB           1               "b": [2] * NUMBER_OF_RECORDS,
    13    203.2 MiB     38.2 MiB           1               "c": [3] * NUMBER_OF_RECORDS,
    14                                                 }
    15                                             )
    16                                         
    17    675.7 MiB      0.3 MiB           1       print("Memory usage: %s MB", dataset.memory_usage().sum() / 1024 / 1024)

Solution

The Python runtime does its own memory management. When you first create the three large lists, Python has to request memory for them from the OS. Once the lists are no longer needed (i.e., once DataFrame returns), that memory is retained by Python to use for future objects, rather than being returned to the OS and requiring Python to request more memory in the future.

If you create the three lists again after creating the data frame, you shouldn't see Python's memory usage increase: the new lists will just be reusing the memory previously used by the old lists.