I have a big database, I just want to assign a constant to a new column. At the first executions (from 1 to 97); everything is fine, the code runs fast. Then the memory rockets at the iteration 98, then it's fine until the iteration 196 (98 iterations after) where the RAM rockets again, then the loop continues the memory rockets at every i
where i
is a multiplication of 98...
I guess that the mysterious number 98 may vary according to your PC. And you may have to change the database size in order to reproduce the problem.
Here is my code
Edit: I think that's not garbage collection, because the gc.isenabled()
returns False
at the end of the code
import pandas as pd
import numpy as np
n = 2000000
data = pd.DataFrame({'a' : range(n)})
for i in range(1, 100):
data['col_' + str(i)] = np.random.choice(['a', 'b'], n)
gc.disable()
for i in range(1, 600):
data['test_{}'.format(i)] = i
print(str(i)) # slow at every i multiplication of 98
gc.isenabled()
> False
And here is my memory usage, the peaks are at the iteration i*98
(where i
is an integer)
I'm on Windows 10, Python 3.6.1 | Anaconda 4.4.0 | pandas 0.24.2
I have 16 GB RAM & 8 core CPU
Firstly, I want to confirm the same behavior on Ubuntu with 16 GB of RAM and GC being disabled. Therefore, it is definitely not an issue with GC or Windows memory management.
Secondly, on my system it slows down after every 99 iterations: after 99, after 198, after 297, etc. Anyway, I have a rather limited Swap file, so when the RAM+Swap is filled, then it crashes with a following stacktrace:
294
295
296
297
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py", line 2657, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'test_298'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 1053, in set
loc = self.items.get_loc(item)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py", line 2659, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1601, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1608, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'test_298'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "py-memory-test.py", line 12, in <module>
data['test_{}'.format(i)] = i
File "/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py", line 3370, in __setitem__
self._set_item(key, value)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py", line 3446, in _set_item
NDFrame._set_item(self, key, value)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 3172, in _set_item
self._data.set(key, value)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 1056, in set
self.insert(len(self.items), item, value)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 1184, in insert
self._consolidate_inplace()
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 929, in _consolidate_inplace
self.blocks = tuple(_consolidate(self.blocks))
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/managers.py", line 1899, in _consolidate
_can_consolidate=_can_consolidate)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/internals/blocks.py", line 3149, in _merge_blocks
new_values = new_values[argsort]
MemoryError
Thus, it seems that pandas
does some kind of merging/consolidation/repacking on insert sometimes. Let's take a look on the core/internals/managers.py's insert
function, it has following lines:
def insert(self, loc, item, value, allow_duplicates=False):
...
self._known_consolidated = False
if len(self.blocks) > 100:
self._consolidate_inplace()
I guess this is exactly what we were looking for!
Each time we are doing insert
new block is created. When the number of blocks exceeds some limit, extra work (consolidation) is performed. Difference between 100 blocks limit in the code and our obtained empirically numbers around 98-99 may be explained by the presence of some extra dataframe meta-data which requires some room for, too.
UPD: in order to proof this hypothesis I have tried to change 100 -> 1000000 and it worked just fine, no performance gaps, no MemoryError
. However, there is no public API to modify this parameter in the run-time, it is simply hardcoded.
UPD2: submitted an issue to pandas
, since MemoryError
doesn't look like an appropriate behavior for such a simple program.