pythonpandasoverhead

Python is facing an overhead every 98 executions?


I have a big database, I just want to assign a constant to a new column. At the first executions (from 1 to 97); everything is fine, the code runs fast. Then the memory rockets at the iteration 98, then it's fine until the iteration 196 (98 iterations after) where the RAM rockets again, then the loop continues the memory rockets at every i where i is a multiplication of 98...

I guess that the mysterious number 98 may vary according to your PC. And you may have to change the database size in order to reproduce the problem.

Here is my code

Edit: I think that's not garbage collection, because the gc.isenabled() returns False at the end of the code

import pandas as pd
import numpy as np

n = 2000000
data = pd.DataFrame({'a' : range(n)})
for i in range(1, 100):
    data['col_' + str(i)] = np.random.choice(['a', 'b'], n)

gc.disable()
for i in range(1, 600):
    data['test_{}'.format(i)] = i
    print(str(i)) # slow at every i multiplication of 98

gc.isenabled()
> False

And here is my memory usage, the peaks are at the iteration i*98 (where i is an integer)

I'm on Windows 10, Python 3.6.1 | Anaconda 4.4.0 | pandas 0.24.2

I have 16 GB RAM & 8 core CPU

enter image description here


Solution

  • Firstly, I want to confirm the same behavior on Ubuntu with 16 GB of RAM and GC being disabled. Therefore, it is definitely not an issue with GC or Windows memory management.

    Secondly, on my system it slows down after every 99 iterations: after 99, after 198, after 297, etc. Anyway, I have a rather limited Swap file, so when the RAM+Swap is filled, then it crashes with a following stacktrace:

    294
    295
    296
    297
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py", line 2657, in get_loc
        return self._engine.get_loc(key)
      File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
      File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
      File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
      File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
    KeyError: 'test_298'
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 1053, in set
        loc = self.items.get_loc(item)
      File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py", line 2659, in get_loc
        return self._engine.get_loc(self._maybe_cast_indexer(key))
      File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
      File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
      File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
      File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
    KeyError: 'test_298'
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "py-memory-test.py", line 12, in <module>
        data['test_{}'.format(i)] = i
      File "/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py", line 3370, in __setitem__
        self._set_item(key, value)
      File "/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py", line 3446, in _set_item
        NDFrame._set_item(self, key, value)
      File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 3172, in _set_item
        self._data.set(key, value)
      File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 1056, in set
        self.insert(len(self.items), item, value)
      File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 1184, in insert
        self._consolidate_inplace()
      File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 929, in _consolidate_inplace
        self.blocks = tuple(_consolidate(self.blocks))
      File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 1899, in _consolidate
        _can_consolidate=_can_consolidate)
      File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/blocks.py", line 3149, in _merge_blocks
        new_values = new_values[argsort]
    MemoryError
    

    Thus, it seems that pandas does some kind of merging/consolidation/repacking on insert sometimes. Let's take a look on the core/internals/managers.py's insert function, it has following lines:

    def insert(self, loc, item, value, allow_duplicates=False):
        ...
        self._known_consolidated = False
    
        if len(self.blocks) > 100:
            self._consolidate_inplace()
    

    I guess this is exactly what we were looking for!

    Each time we are doing insert new block is created. When the number of blocks exceeds some limit, extra work (consolidation) is performed. Difference between 100 blocks limit in the code and our obtained empirically numbers around 98-99 may be explained by the presence of some extra dataframe meta-data which requires some room for, too.

    UPD: in order to proof this hypothesis I have tried to change 100 -> 1000000 and it worked just fine, no performance gaps, no MemoryError. However, there is no public API to modify this parameter in the run-time, it is simply hardcoded.

    UPD2: submitted an issue to pandas, since MemoryError doesn't look like an appropriate behavior for such a simple program.