pythonpandasapply

How to assign scores to each value in pandas columns based on percentile range, getting `Truth value of a Series is ambiguous.` error


I need to assign scores to each of the values in many columns of a pandas dataframe, depending on the percentile score range each value falls between.

I have created a function:

import pandas as pd
import numpy as np

def get_percentiles(x, percentile_array):
    percentile_array = np.sort(np.array(percentile_array))
    if x < x.quantile(percentile_array[0]) < 0:
        return 1
    elif (x >= x.quantile(percentile_array[0]) & (x < x.quantile(percentile_array[1]):
        return 2
    elif (x >= x.quantile(percentile_array[1]) & (x < x.quantile(percentile_array[2]):
        return 3
    elif (x >= x.quantile(percentile_array[2]) & (x < x.quantile(percentile_array[3]):
        return 4
    else:
        return 5

Sample data:

df = pd.DataFrame({'col1' : [1,10,5,9,15,4],
                   'col2' : [4,10,15,19,3,2],
                   'col3' : [10,5,6,9,1,24]})

When I try to run the function using apply:

percentile_array = [0.05, 0.25, 0.5, 0.75]

df.apply(lambda x : get_percentiles(x, percentile_array), result_type = 'expand')

I get below error:

Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

The expected output is new dataframe with 3 columns that has the scores between 1 and 5 depending on which percentile range each value in each column falls under


Solution

  • IIUC, you could use rank to compute the percentile (per column), then pandas.cut to bin the values to your reference:

    percentile_array = [0.05, 0.25, 0.5, 0.75]
    bins = [-np.inf]+percentile_array+[np.inf]
    labels = [1, 2, 3, 4, 5]
    
    out = (df.rank(pct=True)
             .apply(lambda c: pd.cut(c, bins=bins, labels=labels))
          )
    

    Alternatively, with numpy.searchsorted:

    percentile_array = [0.05, 0.25, 0.5, 0.75]
    bins = [-np.inf]+percentile_array+[np.inf]
    
    out = pd.DataFrame(np.searchsorted(bins, df.rank(pct=True)),
                       columns=df.columns, index=df.index)
    

    Output:

      col1 col2 col3
    0    2    3    5
    1    5    4    3
    2    3    5    3
    3    4    5    4
    4    5    3    2
    5    3    2    5
    

    Intermediate:

    df.rank(pct=True)
    
           col1      col2      col3
    0  0.166667  0.500000  0.833333
    1  0.833333  0.666667  0.333333
    2  0.500000  0.833333  0.500000
    3  0.666667  1.000000  0.666667
    4  1.000000  0.333333  0.166667
    5  0.333333  0.166667  1.000000
    

    For the original function to work, you would have needed something like:

    def get_percentiles(x, percentile_array):
        percentile_array = np.sort(np.array(percentile_array))
        m1 = x < x.quantile(percentile_array[0])
        m2 = (x >= x.quantile(percentile_array[0])) & (x < x.quantile(percentile_array[1]))
        m3 = (x >= x.quantile(percentile_array[1])) & (x < x.quantile(percentile_array[2]))
        m4 = (x >= x.quantile(percentile_array[2])) & (x < x.quantile(percentile_array[3]))
    
        return np.select([m1, m2, m3, m4], [1, 2, 3, 4], 5)