I want to make a pandas dataframe that describes the state of a system at different times
I have created a somewhat simplified version that simulates my change data : df_change
Question 1
Can you think of a more efficient way of generating the matrix than what i do in this code? i have a state that i update in a loop and insert
Question 2
This is what i discovered while trying to write the sample code for this discussion. I see 20 fold performanne boost in loop iteration performance if i do the assignments to the "household" columns after the loop. Why is this? I am using python = 3.12.4 and pandas 2.2.2.
df["product"] ="some_product"
#%%
import numpy as np
import pandas as pd
from tqdm import tqdm
num_cols =600
n_changes = 40000
# simulate changes
extra_colums = ["n_changes","product"]
columns = [chr(i+65) for i in range(num_cols)]
state = { icol : np.random.random() for icol in columns}
change_index = np.random.randint(0,4,n_changes).cumsum()
change_col = [columns[np.random.randint(0,num_cols)] for i in range(n_changes)]
change_val= np.random.normal(size=n_changes)
# create change matrix
df_change=pd.DataFrame(index= change_index )
df_change['col'] = change_col
df_change['val'] = change_val
index = np.unique(change_index)
# %%
# Slow approach gives 5 iterations/s
df = pd.DataFrame(index= index, columns=extra_colums + columns)
df["product"] ="some_product"
for i in tqdm(index):
state.update(zip(df_change.loc[[i],"col"] , df_change.loc[[i],"val"]))
df.loc[i,columns] = pd.Series(state)
# %%
# Fast approach gives 1000 iterations/sec
df2 = pd.DataFrame(index= index, columns=extra_colums + columns)
for i in tqdm(index):
state.update(zip(df_change.loc[[i],"col"] , df_change.loc[[i],"val"]))
df2.loc[i,columns] = pd.Series(state)
df2["product"] ="some_product"
Edit
I marked the answer by ouroboros1 as theaccepted solution - it works really well and answered Question 1.
I am still curios about Question 2 : the difference in pandas performance using the two methods where i iterate through the rows. I found that I can also get a performance similar to the original "df2" method depending on how i assign the value before the loop.
The interesting point here is that pre assignment changes the performance in loop that follows.
# Fast approach gives 1000 iterations/sec
df3 = pd.DataFrame(index=index, columns=extra_colums + columns)
#df3.loc[index,"product"] = "some_product" # Fast
#df3["product"] = "some_product" # Slow
df3.product = "some_product" # Fast
for i in tqdm(index):
state.update(zip(df_change.loc[[i], "col"], df_change.loc[[i], "val"]))
df3.loc[i, columns] = np.array(list(state.values()))
Here's one approach that should be much faster:
Data sample
num_cols = 4
n_changes = 6
np.random.seed(0) # reproducibility
# setup ...
df_change
col val
1 C 0.144044
4 A 1.454274
5 A 0.761038
7 A 0.121675
7 C 0.443863
10 B 0.333674
state
{'A': 0.5488135039273248,
'B': 0.7151893663724195,
'C': 0.6027633760716439,
'D': 0.5448831829968969}
Code
out = (df_change.reset_index()
.pivot_table(index='index',
columns='col',
values='val',
aggfunc='last')
.rename_axis(index=None, columns=None)
.assign(product='some_product')
.reindex(columns=extra_colums + columns)
.fillna(pd.DataFrame(state, index=[index[0]]))
.ffill()
)
Output
n_changes product A B C D
1 NaN some_product 0.548814 0.715189 0.144044 0.544883
4 NaN some_product 1.454274 0.715189 0.144044 0.544883
5 NaN some_product 0.761038 0.715189 0.144044 0.544883
7 NaN some_product 0.121675 0.715189 0.443863 0.544883
10 NaN some_product 0.121675 0.333674 0.443863 0.544883
# note:
# A updated in 4, 5, 7
# B updated in 10
# C updated in 1, 7
Explanation / Intermediates
df.reset_index
to access 'index' inside df.pivot_table
. For the aggfunc
use 'last'
. I.e., we only need to propagate the last value in case of duplicate 'col' values per index value.df.rename_axis
to reset index and columns names to None
.# df_chagne.reset_index().pivot_table(...).rename_axis(...)
A B C
1 NaN NaN 0.144044
4 1.454274 NaN NaN
5 0.761038 NaN NaN
7 0.121675 NaN 0.443863
10 NaN 0.333674 NaN
df.assign
to add column 'product' with a scalar ('some_product').df.reindex
to get the columns in the desired order (with extra_columns
up front). Not yet existing column 'n_changes' will be added with NaN
values.df.fillna
and use a pd.DataFrame
with state
for only the first index value (index[0]
), to fill the first row (alternatively, use df.combine_first
).# after .fillna(...)
n_changes product A B C D
1 NaN some_product 0.548814 0.715189 0.144044 0.544883
4 NaN some_product 1.454274 NaN NaN NaN
5 NaN some_product 0.761038 NaN NaN NaN
7 NaN some_product 0.121675 NaN 0.443863 NaN
10 NaN some_product NaN 0.333674 NaN NaN
df.ffill
.Performance comparison:
num_cols = 100
n_changes = 100
np.random.seed(0) # reproducibility
# out:
7.01 ms ± 435 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# df2 (running this *afterwards*, as you are updating `state`
93.7 ms ± 3.79 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Equality check:
df2.astype(out.dtypes).equals(out)
# True