python pandas dataframe numpy vectorization

Manipulation of a Pandas dataframe most time- and memory-efficiently

Please imagine I have a dataframe like this:

df = pd.DataFrame(index=pd.Index(['1', '1', '2', '2'], name='from'), columns=['to'], data= ['2', '2', '4', '5'])

df:

Now, I would like to calculate a matrix comprising of the percentage of times each value in the index "from" transitions to each value in column 'to', which is known as a transition matrix. I can achieve this by creating an empty transition matrix first and then populating it with the percentages using a for loop:

#Create an empty matrix to populate later (using sparse dtype to save memory):
matrix = pd.DataFrame(index=df.index.unique(), columns=df.to.unique(), data=0, dtype=pd.SparseDtype(dtype=np.float16, fill_value=0))

matrix:

for i in range(len(df)):
    from_, to = df.index[i], df.to.iloc[i]     
    matrix[to] = matrix[to].sparse.to_dense() # Convert to dense format because sparse dtype does not allow value assignment with .loc in the next line:  
    matrix.loc[from_, to] += 1     # Do a normal insertion with .loc[]
    matrix[to] = matrix[to].astype(pd.SparseDtype(dtype=np.float16, fill_value=0)) # Back to the original sparse format

matrix = (matrix.div(matrix.sum(axis=1), axis=0)*100) # converting counts to percentages

matrix:

This works. For example, index "1" only transitioned to "2" (100% of the time) and index "2" transitioned to "4" 50% of the time and to "5" the other 50% of the time, as can be verified in df.

Issue: The actual matrix is about 500K by 500K and the for loop takes a really long time to finish. So, is there a vectorized or other efficient way of calculating matrix from df

Note: I would get MemoryError without using the whole Sparse dtype thing even with dtype=float16 in pd.DataFrame() so I prefer to keep that if possible. It would be great if the 500K by 500K matrix will not take up more than 10-12Gb of RAM. Also, if it matters, these percentages will always have a 0-100 range, obviously.

Solution

Option 1: pd.crosstab

Use pd.crosstab with normalize='index'.
Add df.mul and df.rename_axis for formatting.

out = (pd.crosstab(index=df.index, 
                   columns=df['to'], 
                   normalize='index'
                   )
       .mul(100)
       .rename_axis(index='from', columns=None)
       )

Output:

          2     4     5
from                   
1     100.0   0.0   0.0
2       0.0  50.0  50.0

Option 2: df.pivot_table

Use df.reset_index to use df.pivot_table with aggfunc='size' and fill_value=0.
Chain df.div to divide by df.value_counts along axis=0.
Again, add df.mul and df.rename_axis.

out2 = (df.reset_index()
        .pivot_table(index='from', columns='to', 
                     values='to', aggfunc='size', fill_value=0)
        .div(df.index.value_counts(), axis=0)
        .mul(100)
        .rename_axis(columns=None)
        )

out2.equals(out)
# True

Option 3: df.groupby

Use df.groupby with df.index and get groupby.value_counts with normalize=True.
Apply df.unstack with fill_value=0 to move level='to' (i.e. -1) to columns.
Again, add df.mul + df.rename_axis.

out3 = (df.groupby(df.index)
        .value_counts(normalize=True)
        .unstack(-1, fill_value=0)
        .mul(100)
        .rename_axis(columns=None)
        )

out3.equals(out)
# True

Edit:

With a very sizeable df, I would consider creating a csr_matrix to avoid the MemoryError. In this case, we only use df.groupby + value_counts and use Index.factorize to pass compatible row_ind and col_ind values.

from scipy.sparse import csr_matrix

g = df.groupby(df.index).value_counts(normalize=True).mul(100)

idx_codes, idx_unique = g.index.get_level_values(0).factorize()
col_codes, col_unique = g.index.get_level_values(1).factorize() 

m = csr_matrix((g.values, (idx_codes, col_codes)))

If you want, you can turn that back into a df with df.sparse.from_spmatrix:

df_m = pd.DataFrame.sparse.from_spmatrix(m, 
                                         index=idx_unique, 
                                         columns=col_unique)

Output:

       2     4     5
1  100.0   0.0   0.0
2    0.0  50.0  50.0

But understanding this, you can also continue with m and still use idx_unique and col_unique for slicing. E.g.:

bool_idx = col_unique.isin(['4','5'])

m[:, bool_idx].sum(axis=1)

matrix([[  0.],
        [100.]])

Compare:

df_m.loc[:, bool_idx].sum(axis=1)

1        0
2    100.0
dtype: Sparse[float64, 0]