I performed OLS regression on a dataset and I have the predicted Diagnostic_Score
but the mapping table (norms
) can have two operators - e.g. >=
and <=
. Is there a way to map the predicted score to the percentile?
My first thought was to map the scores that I can match and the ones that do not match I know must to be associated with a percentile that has an operator in the Diagnostic_Score
column. I could then use numpy.select
and create the conditions and choices. Does that approach make sense or is there an easier way to map the test percentile to the predicted score without having to manually create a 108 conditions and 108 choices for numpy.select
?
Here is a sample
import pandas as pd
d = {'percentile': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 2, 10: 2, 11: 2, 12: 2, 13: 2, 14: 2, 15: 2, 16: 2, 17: 2}, 'Subject': {0: 'Math', 1: 'Math', 2: 'Math', 3: 'Math', 4: 'Math', 5: 'Math', 6: 'Math', 7: 'Math', 8: 'Math', 9: 'Math', 10: 'Math', 11: 'Math', 12: 'Math', 13: 'Math', 14: 'Math', 15: 'Math', 16: 'Math', 17: 'Math'}, 'Term': {0: 'Fall', 1: 'Fall', 2: 'Fall', 3: 'Fall', 4: 'Fall', 5: 'Fall', 6: 'Fall', 7: 'Fall', 8: 'Fall', 9: 'Fall', 10: 'Fall', 11: 'Fall', 12: 'Fall', 13: 'Fall', 14: 'Fall', 15: 'Fall', 16: 'Fall', 17: 'Fall'}, 'Grade_Level': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 0, 10: 1, 11: 2, 12: 3, 13: 4, 14: 5, 15: 6, 16: 7, 17: 8}, 'Diagnostic_Score': {0: '<=296', 1: '<=322', 2: '<=352', 3: '<=372', 4: '<=390', 5: '<=405', 6: '<=412', 7: '<=423', 8: '<=434', 9: '297', 10: '323', 11: '353', 12: '373', 13: '391', 14: '406', 15: '413', 16: '424', 17: '435'}}
norms = pd.DataFrame(d)
df = pd.DataFrame({'Term': ['Fall', 'Fall', 'Fall', 'Fall'],
'Subject': ['Math', 'Math', 'Math', 'Math'],
'Grade_Level': [0, 3, 5, 7],
'Predicted Score': [290, 300, 406, 424]})
My expected output is
Term Subject Grade_Level Predicted Score Percentile
0 Fall Math 0 290 1
1 Fall Math 3 300 1
2 Fall Math 5 406 2
3 Fall Math 7 424 2
norms table
percentile Subject Term Grade_Level Diagnostic_Score
0 1 Math Fall 0 <=296
1 1 Math Fall 1 <=322
2 1 Math Fall 2 <=352
3 1 Math Fall 3 <=372
4 1 Math Fall 4 <=390
5 1 Math Fall 5 <=405
6 1 Math Fall 6 <=412
7 1 Math Fall 7 <=423
8 1 Math Fall 8 <=434
9 2 Math Fall 0 297
10 2 Math Fall 1 323
11 2 Math Fall 2 353
12 2 Math Fall 3 373
13 2 Math Fall 4 391
14 2 Math Fall 5 406
15 2 Math Fall 6 413
16 2 Math Fall 7 424
17 2 Math Fall 8 435
...
99 Math Spring 8 >=585
You can extract
the prefix and perform a merge
and a merge_asof
:
# add group/score
norms[['group', 'Predicted Score']] = (
norms['Diagnostic_Score']
.astype(str)
.str.extract(r'([<>]=|)(\d+)')
.astype({1: 'int'})
)
# ensure scores are sorted (for the merge_asof)
norms.sort_values(by='Predicted Score', inplace=True)
# define groups
low = norms['group'].eq('<=')
high = norms['group'].eq('>=')
exact = ~(low|high)
# merge
s_ex = df.merge(norms[exact], on=['Subject', 'Term', 'Predicted Score'], how='left')['percentile']
s_lo = pd.merge_asof(df.reset_index().sort_values(by='Predicted Score'),
norms[low], by=['Subject', 'Term'], on=['Predicted Score'],
direction='forward',
).set_index('index')['percentile']
s_hi = pd.merge_asof(df.reset_index().sort_values(by='Predicted Score'),
norms[high], by=['Subject', 'Term'], on=['Predicted Score'],
direction='backward'
).set_index('index')['percentile']
# combine
df['Percentile'] = s_ex.fillna(s_lo).fillna(s_hi).astype(norms['percentile'].dtype)
You can actually simplify to two merge_asof
:
low = norms['group'].ne('>=')
s_lo = pd.merge_asof(df.reset_index().sort_values(by='Predicted Score'),
norms[low], by=['Subject', 'Term'], on=['Predicted Score'],
direction='forward',
).set_index('index')['percentile']
s_hi = pd.merge_asof(df.reset_index().sort_values(by='Predicted Score'),
norms[high], by=['Subject', 'Term'], on=['Predicted Score'],
direction='backward'
).set_index('index')['percentile']
df['Percentile'] = s_lo.fillna(s_hi).astype(norms['percentile'].dtype)
Output:
Term Subject Grade_Level Predicted Score Percentile
0 Fall Math 0 290 1
1 Fall Math 3 300 1
2 Fall Math 5 406 2
3 Fall Math 7 424 2