In a recent post Pandas performance while iterating a state vector, I noticed a performance when slicing pandas dataframes that i do not understand.
The code presented here does not do anything usefull, but highlight the issue:
What baffles me that the way i assign values to extra_columns before the loop affects the loop performance
Python code
import timeit
setup_stmt ="""
import pandas as pd
num_cols = 500
n_iter = 100
extra_column = [ "product"]
columns = [chr(i+65) for i in range(num_cols)]
index= range(n_iter)
"""
stmt1 ="""
df = pd.DataFrame(index = index, columns=extra_column + columns)
df["product"] = "x"
for i in index:
df.loc[i,columns] = 0
"""
stmt2 ="""
df = pd.DataFrame(index = index, columns=extra_column + columns)
df.product = "x"
for i in index:
df.loc[i,columns] = 0
"""
stmt3 ="""
df = pd.DataFrame(index= index, columns=extra_column + columns)
df.loc[index,"product"] = "x"
for i in index:
df.loc[i,columns] = 0
"""
stmt4 ="""
df = pd.DataFrame(index = index, columns=extra_column + columns)
for i in index:
df.loc[i,columns] = 0
df["product"] = "x"
"""
print(f" stmt1 takes { timeit.timeit(setup= setup_stmt, stmt= stmt1, number=10):2.2f} seconds" )
print(f" stmt2 takes { timeit.timeit(setup= setup_stmt, stmt= stmt2, number=10):2.2f} seconds" )
print(f" stmt3 takes { timeit.timeit(setup= setup_stmt, stmt= stmt3, number=10):2.2f} seconds" )
print(f" stmt4 takes { timeit.timeit(setup= setup_stmt, stmt= stmt4, number=10):2.2f} seconds" )
Output
stmt1 takes 20.60 seconds
stmt2 takes 0.46 seconds
stmt3 takes 0.46 seconds
stmt4 takes 0.46 seconds
TL;DR
Your original df
has a single block in memory.
With stmt1
, use of df["product"] = "x"
makes the BlockManager
(an internal memory manager) add a new block. Having multiple blocks adds overhead, as pandas
needs to check and consolidate them each time a row gets modified.
With stmt3
, you do not have this issue, as df.loc[index,"product"] = "x"
is an in-place modification, that keeps the original, single block intact.
stmt2
should be ignored (see note at the end). stmt4
is irrelevant, as the second block is created only after the for loop.
Answer
The difference in performance between your stmt1
and stmt3
has to do with the so-called BlockManager
, which is an internal manager that tries to keep columns with compatible dtypes together as blocks in memory.
Useful information about the use of the BlockManager
for a specific pd.DataFrame
can be retrieved by accessing df._mgr
. With your example:
import pandas as pd
num_cols = 3
n_iter = 3
extra_column = ["product"]
columns = [chr(i+65) for i in range(num_cols)]
index= range(n_iter)
df = pd.DataFrame(index=index, columns=extra_column + columns)
product A B C
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
df._mgr
BlockManager
Items: Index(['product', 'A', 'B', 'C'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
NumpyBlock: slice(0, 4, 1), 4 x 3, dtype: object # all cols
So, here we see that the BlockManager
is working with a single block in memory.
stmt1
: adding a new column / replacing one adds a blockIf now we use bracket notation ([]
) to assign "x" to column "product", we are really re-creating that column. As a result, a second block is created:
df["product"] = "x"
print(df._mgr)
BlockManager
Items: Index(['product', 'A', 'B', 'C'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
NumpyBlock: slice(1, 4, 1), 3 x 3, dtype: object # cols "A, "B", "C"
NumpyBlock: slice(0, 1, 1), 1 x 3, dtype: object # col "product"
The important thing here is that this column is replacing the old column "product": it's a new column. E.g., if we use df.loc
to create a new column, the same thing happens:
df.loc[:, "new_col"] = "x"
print(df._mgr)
BlockManager
Items: Index(['product', 'A', 'B', 'C', 'new_col'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
NumpyBlock: slice(1, 4, 1), 3 x 3, dtype: object # cols "A, "B", "C"
NumpyBlock: slice(0, 1, 1), 1 x 3, dtype: object # col "product"
NumpyBlock: slice(4, 5, 1), 1 x 3, dtype: object # col "new_col"
stmt3
: in-place modification keeps block intactHere we see the difference with df.loc[index,"product"] = "x"
, because in this case we are not re-creating "product", we are simply updating its values. This does not create a new block:
df = pd.DataFrame(index = index, columns=extra_column + columns)
df.loc[index,"product"] = "x"
print(df._mgr)
BlockManager
Items: Index(['product', 'A', 'B', 'C'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
NumpyBlock: slice(0, 4, 1), 4 x 3, dtype: object # "product" still here
Key takeaways
The upshot of all this for the different versions you use:
stmt1
with df["product"] = "x"
internally has two blocksstmt3
with df.loc[index,"product"] = "x"
internally keeps one blockstmt4
with df["product"] = "x"
after the for loop only has two blocks after that loop.The significant delay for stmt1
is caused by pandas
needing to reconcile multiple blocks each time df.loc[i,columns] = 0
is executed in the loop. These internal checks trigger extra memory operations, as pandas
must align modified rows across separate blocks. This results in a sizeable slowdown compared to the single-block df
.
Interestingly, df.copy
leads to a reset of the blocks. Consequently, adding df = df.copy()
gets the performance of stmt1
very close to stmt3
again:
# adding: `df = df.copy()`
stmt1 ="""
df = pd.DataFrame(index = index, columns=extra_column + columns)
df["product"] = "x"
df = df.copy()
for i in index:
df.loc[i,columns] = 0
"""
print(f" stmt1 takes { timeit.timeit(setup= setup_stmt, stmt= stmt1, number=10):2.2f} seconds" )
print(f" stmt3 takes { timeit.timeit(setup= setup_stmt, stmt= stmt3, number=10):2.2f} seconds" )
Prints:
stmt1 takes 1.00 seconds
stmt3 takes 0.99 seconds
Further reading
Some interesting reads on this complex topic and the difficulty of establishing its influence for specific use cases:
There are plans to replace the BlockManager
: see here, cf. here.
A note on stmt2
stmt2
should be ignored here, because it is not doing what you think it does. "dot notation" is a convenience feature that can provide attribute access to a df
column. But this method comes with a few caveats. One being:
The attribute will not be available if it conflicts with an existing method name, e.g.
s.min
is not allowed, buts['min']
is possible.
This applies here, because df.product
is a method of class pd.DataFrame
. I.e., when you do df.product = "x"
, you are simply overwriting the method and storing the string "x" in its place:
df = pd.DataFrame({"product": [1]})
type(df.product)
method
df.product = "x"
type(df.product)
str
I.e., we never updated the actual df
:
product
0 1 # nothing changed