pythonpandas

How to map scores from one table to another when the cell contains operators


I performed OLS regression on a dataset and I have the predicted Diagnostic_Score but the mapping table (norms) can have two operators - e.g. >= and <=. Is there a way to map the predicted score to the percentile?

My first thought was to map the scores that I can match and the ones that do not match I know must to be associated with a percentile that has an operator in the Diagnostic_Score column. I could then use numpy.select and create the conditions and choices. Does that approach make sense or is there an easier way to map the test percentile to the predicted score without having to manually create a 108 conditions and 108 choices for numpy.select?

Here is a sample

import pandas as pd

d = {'percentile': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 2, 10: 2, 11: 2, 12: 2, 13: 2, 14: 2, 15: 2, 16: 2, 17: 2}, 'Subject': {0: 'Math', 1: 'Math', 2: 'Math', 3: 'Math', 4: 'Math', 5: 'Math', 6: 'Math', 7: 'Math', 8: 'Math', 9: 'Math', 10: 'Math', 11: 'Math', 12: 'Math', 13: 'Math', 14: 'Math', 15: 'Math', 16: 'Math', 17: 'Math'}, 'Term': {0: 'Fall', 1: 'Fall', 2: 'Fall', 3: 'Fall', 4: 'Fall', 5: 'Fall', 6: 'Fall', 7: 'Fall', 8: 'Fall', 9: 'Fall', 10: 'Fall', 11: 'Fall', 12: 'Fall', 13: 'Fall', 14: 'Fall', 15: 'Fall', 16: 'Fall', 17: 'Fall'}, 'Grade_Level': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 0, 10: 1, 11: 2, 12: 3, 13: 4, 14: 5, 15: 6, 16: 7, 17: 8}, 'Diagnostic_Score': {0: '<=296', 1: '<=322', 2: '<=352', 3: '<=372', 4: '<=390', 5: '<=405', 6: '<=412', 7: '<=423', 8: '<=434', 9: '297', 10: '323', 11: '353', 12: '373', 13: '391', 14: '406', 15: '413', 16: '424', 17: '435'}}

norms = pd.DataFrame(d)

df = pd.DataFrame({'Term': ['Fall', 'Fall', 'Fall', 'Fall'],
               'Subject': ['Math', 'Math', 'Math', 'Math'],
               'Grade_Level': [0, 3, 5, 7],
               'Predicted Score': [290, 300, 406, 424]})

My expected output is

   Term Subject  Grade_Level  Predicted Score  Percentile
0  Fall    Math            0              290           1
1  Fall    Math            3              300           1
2  Fall    Math            5              406           2
3  Fall    Math            7              424           2

norms table

    percentile Subject  Term  Grade_Level Diagnostic_Score
0            1    Math  Fall            0            <=296
1            1    Math  Fall            1            <=322
2            1    Math  Fall            2            <=352
3            1    Math  Fall            3            <=372
4            1    Math  Fall            4            <=390
5            1    Math  Fall            5            <=405
6            1    Math  Fall            6            <=412
7            1    Math  Fall            7            <=423
8            1    Math  Fall            8            <=434
9            2    Math  Fall            0              297
10           2    Math  Fall            1              323
11           2    Math  Fall            2              353
12           2    Math  Fall            3              373
13           2    Math  Fall            4              391
14           2    Math  Fall            5              406
15           2    Math  Fall            6              413
16           2    Math  Fall            7              424
17           2    Math  Fall            8              435
...         
            99    Math  Spring          8            >=585

Solution

  • You can extract the prefix and perform a merge and a merge_asof:

    # add group/score
    norms[['group', 'Predicted Score']] = (
        norms['Diagnostic_Score']
        .astype(str)
        .str.extract(r'([<>]=|)(\d+)')
        .astype({1: 'int'})
    )
    # ensure scores are sorted (for the merge_asof)
    norms.sort_values(by='Predicted Score', inplace=True)
    
    # define groups
    low   = norms['group'].eq('<=')
    high  = norms['group'].eq('>=')
    exact = ~(low|high)
    
    # merge
    s_ex = df.merge(norms[exact], on=['Subject', 'Term', 'Predicted Score'], how='left')['percentile']
    s_lo = pd.merge_asof(df.reset_index().sort_values(by='Predicted Score'),
                         norms[low], by=['Subject', 'Term'], on=['Predicted Score'],
                         direction='forward',
                         ).set_index('index')['percentile']
    s_hi = pd.merge_asof(df.reset_index().sort_values(by='Predicted Score'),
                         norms[high], by=['Subject', 'Term'], on=['Predicted Score'],
                         direction='backward'
                         ).set_index('index')['percentile']
    
    # combine
    df['Percentile'] = s_ex.fillna(s_lo).fillna(s_hi).astype(norms['percentile'].dtype)
    

    You can actually simplify to two merge_asof:

    low   = norms['group'].ne('>=')
    s_lo = pd.merge_asof(df.reset_index().sort_values(by='Predicted Score'),
                         norms[low], by=['Subject', 'Term'], on=['Predicted Score'],
                         direction='forward',
                         ).set_index('index')['percentile']
    s_hi = pd.merge_asof(df.reset_index().sort_values(by='Predicted Score'),
                         norms[high], by=['Subject', 'Term'], on=['Predicted Score'],
                         direction='backward'
                         ).set_index('index')['percentile']
    
    df['Percentile'] = s_lo.fillna(s_hi).astype(norms['percentile'].dtype)
    

    Output:

       Term Subject  Grade_Level  Predicted Score  Percentile
    0  Fall    Math            0              290           1
    1  Fall    Math            3              300           1
    2  Fall    Math            5              406           2
    3  Fall    Math            7              424           2