I want to sort a pandas frame by multiple columns. The constraint I have which is giving me trouble ist, that one of the columns (the first) needs to be natural sorted, so I tried the following:
sortedFrame = inFrame.sort_values(by=['Col_Arg', 'Col_Step'],
key=lambda x:np.argsort(index_natsorted(inFrame['Col_Arg'])))
but this code results in the frame only being sorted by Col_Arg. E.g. the input frame
Col_Arg | Col_Step |
---|---|
1 First | 20 |
2 Second | 10 |
1 First | 10 |
results in
Col_Arg | Col_Step |
---|---|
1 First | 20 |
1 First | 10 |
2 Second | 10 |
You can imagine Col_Arg as an indexed headline. Inside that indexed headline are steps to execute (Col_2). Since Col_Arg is an string which cannot be transformed in an integer, I want to use natsort, which is working fine to sort for Col_Arg alone, but is not working with multiple column names. The easy way is just to introduce an additional index for the headlines. Then I could just easily use:
sortedFrame = inFrame.sort_values(['Col_Arg_Idx', 'Col_2'])
Since I am quite new to python and pandas I am curious and I want to understand what is my misconception and how you would do it, since I think I should be possible. I can imagine that it has to do with the usage of key:
key: keycallable, optional Apply the key function to the values before sorting. This is similar to the key argument in the builtin sorted() function, with the notable difference that this key function should be vectorized. It should expect a Series and return a Series with the same shape as the input. It will be applied to each column in by independently.
But that does not mean it is applied on all... I am a little confused.
In order to introduce intermediate steps later, the indexes are initially incremented by 10 starting from 10.
Thanks in advance.
So the problem with your code is that when you use multiple columns inside pandas.DataFrame.sort_values
, alongside the key
parameter, pandas takes each series in the order you've defined inside the parameter by=["col1", "col2"]
and calls that function passing the column values to it in the order they appeared before calling sort_values
.
For example, let's define a simple function that only prints the arguments it receives, and use it as our key
parameter:
import pandas as pd
import numpy as np
from natsort import index_natsorted, natsort_keygen, natsorted
# == Sample DataFrame ============================
df = pd.DataFrame(
{
'Col_Arg': ['First', 'Second', 'First', 'Third', 'Fourth', 'Tenth'],
'Col_Step': [20, 10, 10, 30, 20, 5]
}
)
# == Simple Key Function ============================
def print_values(x):
print(x)
return x
df.sort_values(
by=["Col_Arg", "Col_Step"],
key=lambda x: print_values(x)
)
# Returns:
"""
0 1 First
1 2 Second
2 1 First
3 3 Third
4 4 Fourth
5 10 Tenth
Name: Col_Arg, dtype: object
0 20
1 10
2 10
3 30
4 20
5 5
Name: Col_Step, dtype: int64
"""
So basically pandas.DataFrame.sort_values
passes each column as a series, to your function, and it expects your function to do some transformation to make the column "sortable". Here's the parameter description from pandas documentation:
key
: callable
, optional
Description:
Apply the key
function to the values before sorting. This is similar to the key argument in the builtin sorted()
function, with the notable difference that this key function should be vectorized. It should expect a Series and return a Series with the same shape as the input. It will be applied to each column in by independently.
In other words, if you want to sort both columns in the same pandas.DataFrame.sort_values
operation, you need to pass in a function that's able to convert 'Col_Arg'
to a numeric form, while returning 'Col_Step'
unmodified. Additionally, by using inFrame
in key=lambda x:np.argsort(index_natsorted(inFrame['Col_Arg']))
instead of passing x
, the key function will sort values based on the inFrame
indexes in the order they existed prior to calling the sort_values
function. Here's an example:
df.sort_values(
by=["Col_Arg", "Col_Step"],
key=lambda x: print_values(np.argsort(index_natsorted(df["Col_Step"])))
)
# Prints:
"""
[3 1 2 5 4 0]
[3 1 2 5 4 0]
"""
So, the first time the key
function gets called it sorts the dataframe indexes using [3 1 2 5 4 0]
, then it applies the same order as it did before, but now all indexes have already been moved, so it ends up ruining the sort operation.
As stated previously, the key
function takes each column value in the order they exist prior to the sort operation. So we need to create a function that converts 'Col_Arg'
values to numbers, instead of trying to sort inside the key function. There's a package called number-parser that does that for you. To install it, run the following code:
pip install number-parser
Then, you can create a function to use inside key
like so:
import numpy as np
import pandas as pd
from number_parser import parse_ordinal
def custom_sort(col: pd.Series) -> pd.Series:
if col.name == "Col_Arg":
return col.apply(parse_ordinal)
return col
df.sort_values(
by=["Col_Arg", "Col_Step"],
key=custom_sort
)
# Returns:
"""
Col_Arg Col_Step
2 First 10
0 First 20
1 Second 10
3 Third 30
4 Fourth 20
5 Tenth 5
"""
Solution 2: another option would be to do something like this:
import pandas as pd
import numpy as np
from natsort import index_natsorted, natsort_keygen, natsorted
df.sort_values(
by="Col_Arg", key=lambda col: np.argsort(index_natsorted(col))
).groupby("Col_Arg", as_index=False).apply(
lambda grp: grp.sort_values("Col_Step")
).reset_index(
drop=True
)
# Returns:
"""
Col_Arg Col_Step
0 First 10
1 First 20
2 Fourth 20
3 Second 10
4 Tenth 5
5 Third 30
"""