Please imagine I have a dataframe like this:
df = pd.DataFrame(index=pd.Index(['1', '1', '2', '2'], name='from'), columns=['to'], data= ['2', '2', '4', '5'])
df:
Now, I would like to calculate a matrix comprising of the percentage of times each value in the index "from" transitions to each value in column 'to', which is known as a transition matrix. I can achieve this by creating an empty transition matrix first and then populating it with the percentages using a for loop:
#Create an empty matrix to populate later (using sparse dtype to save memory):
matrix = pd.DataFrame(index=df.index.unique(), columns=df.to.unique(), data=0, dtype=pd.SparseDtype(dtype=np.float16, fill_value=0))
matrix:
for i in range(len(df)):
from_, to = df.index[i], df.to.iloc[i]
matrix[to] = matrix[to].sparse.to_dense() # Convert to dense format because sparse dtype does not allow value assignment with .loc in the next line:
matrix.loc[from_, to] += 1 # Do a normal insertion with .loc[]
matrix[to] = matrix[to].astype(pd.SparseDtype(dtype=np.float16, fill_value=0)) # Back to the original sparse format
matrix = (matrix.div(matrix.sum(axis=1), axis=0)*100) # converting counts to percentages
matrix:
This works. For example, index "1" only transitioned to "2" (100% of the time) and index "2" transitioned to "4" 50% of the time and to "5" the other 50% of the time, as can be verified in df.
Issue: The actual matrix is about 500K by 500K and the for loop takes a really long time to finish. So, is there a vectorized or other efficient way of calculating matrix
from df
Note: I would get MemoryError without using the whole Sparse dtype thing even with dtype=float16
in pd.DataFrame()
so I prefer to keep that if possible. It would be great if the 500K by 500K matrix will not take up more than 10-12Gb of RAM. Also, if it matters, these percentages will always have a 0-100 range, obviously.
Option 1: pd.crosstab
pd.crosstab
with normalize='index'
.df.mul
and df.rename_axis
for formatting.out = (pd.crosstab(index=df.index,
columns=df['to'],
normalize='index'
)
.mul(100)
.rename_axis(index='from', columns=None)
)
Output:
2 4 5
from
1 100.0 0.0 0.0
2 0.0 50.0 50.0
Option 2: df.pivot_table
df.reset_index
to use df.pivot_table
with aggfunc='size'
and fill_value=0
.df.div
to divide by df.value_counts
along axis=0
.df.mul
and df.rename_axis
.out2 = (df.reset_index()
.pivot_table(index='from', columns='to',
values='to', aggfunc='size', fill_value=0)
.div(df.index.value_counts(), axis=0)
.mul(100)
.rename_axis(columns=None)
)
out2.equals(out)
# True
Option 3: df.groupby
df.groupby
with df.index
and get groupby.value_counts
with normalize=True
.df.unstack
with fill_value=0
to move level='to'
(i.e. -1
) to columns.df.mul
+ df.rename_axis
.out3 = (df.groupby(df.index)
.value_counts(normalize=True)
.unstack(-1, fill_value=0)
.mul(100)
.rename_axis(columns=None)
)
out3.equals(out)
# True
Edit:
With a very sizeable df
, I would consider creating a csr_matrix
to avoid the MemoryError
. In this case, we only use df.groupby
+ value_counts
and use Index.factorize
to pass compatible row_ind
and col_ind
values.
from scipy.sparse import csr_matrix
g = df.groupby(df.index).value_counts(normalize=True).mul(100)
idx_codes, idx_unique = g.index.get_level_values(0).factorize()
col_codes, col_unique = g.index.get_level_values(1).factorize()
m = csr_matrix((g.values, (idx_codes, col_codes)))
If you want, you can turn that back into a df
with df.sparse.from_spmatrix
:
df_m = pd.DataFrame.sparse.from_spmatrix(m,
index=idx_unique,
columns=col_unique)
Output:
2 4 5
1 100.0 0.0 0.0
2 0.0 50.0 50.0
But understanding this, you can also continue with m
and still use idx_unique
and col_unique
for slicing. E.g.:
bool_idx = col_unique.isin(['4','5'])
m[:, bool_idx].sum(axis=1)
matrix([[ 0.],
[100.]])
Compare:
df_m.loc[:, bool_idx].sum(axis=1)
1 0
2 100.0
dtype: Sparse[float64, 0]