pythonpandas

When applying `Series.clip` to ordered categorical data, I get a wrong result


pandas 1.5.1 numpy 1.23.4 python 3.10.7

Let's say we have a series of ordered categorical data:

s = pd.Series(['zero','one','two','three','four','five','six'], dtype='category')
s = s.cat.reorder_categories(['zero','one','two','three','four','five','six'], ordered=True)
print(s)

Here's an output of the code above:

0     zero
1      one
2      two
3    three
4     four
5     five
6      six
dtype: category
Categories (7, object): ['zero' < 'one' < 'two' < 'three' < 'four' < 'five' < 'six']

If I apply clip with only one of lower= or upper= parameter I get the expected output. But if both of them are passed then the output is somewhat strange:

df = pd.concat([s, s.clip(lower='two'), s.clip(upper='four'), s.clip(lower='two', upper='four')], axis=1)
print(df)

Output, note the last column:

       0      1      2     3
0   zero    two   zero  four
1    one    two    one  four
2    two    two    two  four
3  three  three  three   two
4   four   four   four   two
5   five   five   four   two
6    six    six   four   two

I expected to see in the last column a series ['two','two','two','three','four','four',four'], not the one in the output above.

Also I noticed that replacing of the values for lower=..., upper=... doesn't change the result. See:

pd.concat([s.clip(lower='two',upper='four'), s.clip(lower='four',upper='two')], axis=1)

with output:

      0     1
0  four  four
1  four  four
2  four  four
3   two   two
4   two   two
5   two   two
6   two   two

Why clipping of the ordered categorical data behaves like that?


P.S. Bug report at GitHub: https://github.com/pandas-dev/pandas/issues/49217


Solution

  • This looks like a bug. In the source there is a check that the upper/lower bounds are in the correct order. However I think this is mistakenly using the string values of the categories ('four' < 'two'):

       8095         # GH 2747 (arguments were reversed)
       8096         if (
       8097             lower is not None
       8098             and upper is not None
       8099             and is_scalar(lower)
       8100             and is_scalar(upper)
       8101         ):
       8102             lower, upper = min(lower, upper), max(lower, upper)  # <- HERE
    

    Indeed using zfour in place of four:

    s = pd.Series(['zero','one','two','three','zfour','five','six'], dtype='category')
    s = s.cat.reorder_categories(['zero','one','two','three','zfour','five','six'], ordered=True)
    s.clip(lower='two', upper='zfour')
    

    output:

    0      two
    1      two
    2      two
    3    three
    4    zfour
    5    zfour
    6    zfour
    dtype: category
    Categories (7, object): ['zero' < 'one' < 'two' < 'three' < 'zfour' < 'five' < 'six']