pythonpandasdataframesorting

Expand and then sort dataframe based on the value order in the first row


Suppose I have a DataFrame with the following format of strings separated by commas:

Index ColumnName
0 apple,peach,orange,pear,
1 orange, pear,apple
2 pear
3 peach,apple
4 orange

The actual number of rows will be greater than 10,000.

I want to expand the DataFrame and sort the DataFrame by row 0.

My expected output is below, where None is of type NoneType:

Index 0 1 2 3
0 apple peach orange pear
1 apple None orange pear
2 None None None pear
3 apple peach None None
4 None None orange None

I have expanded the data using the following code:

df = df['ColumnName'].str.split(',', expand=True)  # Expand initial DataFrame

However, I am unable to sort or reorder the data as desired despite trying various combinations of df.sort_values().


Solution

  • Here is another way:

    s = df['columnName'].str.strip(',').str.split(', ?').explode()
    
    s.set_axis(pd.MultiIndex.from_frame(s.groupby(s).ngroup().reset_index())).unstack()
    

    Output:

    0          0       1      2     3
    index                            
    0      apple  orange  peach  pear
    1      apple  orange    NaN  pear
    2        NaN     NaN    NaN  pear
    3      apple     NaN  peach   NaN
    4        NaN  orange    NaN   NaN