I have a dataframe with string rows. I want to sort the entire dataframe based on the strings in this column. However, there are some rows that contains a substring that is the text in another row so it is messing up the order. My dataframe looks like:
col1 col2 col3 col4
Animal Tiger Cat Dog
Name Adam Grace Julia
Street Name1 Pine St Crown St Palm Ave
Street Name2 Grey St Tree St New St
Color Green Blue Yellow
Interest Yes No Yes
Low Interest No No Yes
High Interest Yes No Yes
City2 x z y
City1 m r t
I want to sort it so it looks like:
col1 col2 col3 col4
Name Adam Grace Julia
Street Name1 Pine St Crown St Palm Ave
Street Name2 Grey St Tree St New St
City1 m r t
City2 x z y
Interest Yes No Yes
High Interest Yes No Yes
Low Interest No No Yes
Animal Tiger Cat Dog
Color Green Blue Yellow
I tried using :
order = ['Name', 'Street Name', 'City', 'Interest','High Interest','Low Interest', 'Animal', 'Color']
df['order'] = df['col1'].apply(order)
df = df.sort_values(by = 'order').drop(columns = 'order')
However, this created an issue where 'Street Name' rows were coming before 'Name' since 'Name' is in both. Like:
col1 col2 col3 col4
Street Name1 Pine St Crown St Palm Ave
Street Name2 Grey St Tree St New St
Name Adam Grace Julia
How can I order this data frame so that it is in the proper order even if it is present substring for another row?
UPDATE: I can't explicitly list 'Street Name1', 'Street Name2' since the number at the end could change. For different scenarios, it could be also 'Street Name1', 'Street Name2'.... 'Street Name7'. So I can't explicitly define the numbers.
You could use str.extract
with a crafted regex to get the leading parts of the names, then convert to Categorical
and sort with sort_values
/numpy.lexsort
:
import re
pattern = '|'.join(map(re.escape, sorted(order, key=len, reverse=True)))
idx = pd.Series(pd.Categorical(df['col1'].str.extract(f'({pattern})',
expand=False),
categories=order, ordered=True),
index=df.index).sort_values(kind='stable').index
out = df.reindex(idx)
If you also want to sort based on the part after the keywords as secondary key:
import re
import numpy
order = ['Name', 'Street Name', 'City', 'Interest','High Interest','Low Interest', 'Animal', 'Color']
pattern = '|'.join(map(re.escape, sorted(order, key=len, reverse=True)))
tmp = df['col1'].str.extract(f'({pattern})(.*)')
out = df.iloc[np.lexsort([pd.to_numeric(tmp[1], errors='coerce'),
pd.Categorical(tmp[0], categories=order,
ordered=True).codes])]
Output:
col1 col2 col3 col4
1 Name Adam Grace Julia
2 Street Name1 Pine St Crown St Palm Ave
3 Street Name2 Grey St Tree St New St
9 City1 m r t
8 City2 x z y
5 Interest Yes No Yes
7 High Interest Yes No Yes
6 Low Interest No No Yes
0 Animal Tiger Cat Dog
4 Color Green Blue Yellow
Intermediate tmp
:
0 1
0 Animal
1 Name
2 Street Name 1
3 Street Name 2
4 Color
5 Interest
6 Low Interest
7 High Interest
8 City 2
9 City 1