pythonpandasdataframesorting

How to order dataframe based on substrings in column?


I have a dataframe with string rows. I want to sort the entire dataframe based on the strings in this column. However, there are some rows that contains a substring that is the text in another row so it is messing up the order. My dataframe looks like:

  col1           col2        col3       col4
Animal           Tiger       Cat         Dog
Name             Adam       Grace       Julia
Street Name1    Pine St    Crown St    Palm Ave
Street Name2    Grey St    Tree St     New St
Color           Green        Blue       Yellow
Interest         Yes         No           Yes
Low Interest     No          No           Yes
High Interest    Yes         No           Yes
City2             x          z             y
City1             m          r             t

I want to sort it so it looks like:

  col1           col2        col3       col4
Name             Adam       Grace       Julia
Street Name1    Pine St    Crown St    Palm Ave
Street Name2    Grey St    Tree St     New St   
City1             m          r             t
City2             x          z             y
Interest         Yes         No           Yes
High Interest    Yes         No           Yes    
Low Interest     No          No           Yes
Animal           Tiger       Cat         Dog
Color           Green        Blue       Yellow

I tried using :

order = ['Name', 'Street Name', 'City', 'Interest','High Interest','Low Interest', 'Animal', 'Color']
df['order'] = df['col1'].apply(order)
df = df.sort_values(by = 'order').drop(columns = 'order')

However, this created an issue where 'Street Name' rows were coming before 'Name' since 'Name' is in both. Like:

     col1           col2        col3       col4
    Street Name1    Pine St    Crown St    Palm Ave
    Street Name2    Grey St    Tree St     New St   
    Name             Adam       Grace       Julia

How can I order this data frame so that it is in the proper order even if it is present substring for another row?

UPDATE: I can't explicitly list 'Street Name1', 'Street Name2' since the number at the end could change. For different scenarios, it could be also 'Street Name1', 'Street Name2'.... 'Street Name7'. So I can't explicitly define the numbers.


Solution

  • You could use str.extract with a crafted regex to get the leading parts of the names, then convert to Categorical and sort with sort_values/numpy.lexsort:

    import re
    
    pattern = '|'.join(map(re.escape, sorted(order, key=len, reverse=True)))
    
    idx = pd.Series(pd.Categorical(df['col1'].str.extract(f'({pattern})',
                                                          expand=False),
                                   categories=order, ordered=True),
                    index=df.index).sort_values(kind='stable').index
    
    out = df.reindex(idx)
    

    If you also want to sort based on the part after the keywords as secondary key:

    import re
    import numpy
    
    order = ['Name', 'Street Name', 'City', 'Interest','High Interest','Low Interest', 'Animal', 'Color']
    
    pattern = '|'.join(map(re.escape, sorted(order, key=len, reverse=True)))
    tmp = df['col1'].str.extract(f'({pattern})(.*)')
    
    out = df.iloc[np.lexsort([pd.to_numeric(tmp[1], errors='coerce'),
                              pd.Categorical(tmp[0], categories=order,
                                             ordered=True).codes])]
    

    Output:

                col1     col2      col3      col4
    1           Name     Adam     Grace     Julia
    2   Street Name1  Pine St  Crown St  Palm Ave
    3   Street Name2  Grey St   Tree St    New St
    9          City1        m         r         t
    8          City2        x         z         y
    5       Interest      Yes        No       Yes
    7  High Interest      Yes        No       Yes
    6   Low Interest       No        No       Yes
    0         Animal    Tiger       Cat       Dog
    4          Color    Green      Blue    Yellow
    

    Intermediate tmp:

                   0  1
    0         Animal   
    1           Name   
    2    Street Name  1
    3    Street Name  2
    4          Color   
    5       Interest   
    6   Low Interest   
    7  High Interest   
    8           City  2
    9           City  1