I have the table like below which is stored in a DataFrame. I want to add the tolerance_level = [0, 1, 5, 10, 20, 30, 50, 100, 200, 300, 500, 700] and get the count of ID's for columns A, B, C that fall under the tolerance_level
For example: in the below table count of ID's for column A having tolerance_level of 10% = 2 count of ID's for column A having tolerance_level of 0% = 1 count of ID's for columns A having tolerance_level of 100% = 1 and 700% =1
And count of ID's for column B having tolerance_level of 30% = 2 and so on..... And if a columns has % which is not defined in the tolerance_level then the ID falls under the nearest tolerance_level. For example, if a columns has 900% then it will be in >700% tolerance_level
ID | A | B | C |
---|---|---|---|
1 | 0% | 1% | 5% |
3 | 10% | 30% | 50% |
6 | 100% | 300% | 500% |
7 | 700% | 900% | 50% |
10 | 10% | 30% | 50% |
So, the result would be something like
tolerance_level | A | B | C |
---|---|---|---|
0% | 17 | 100 | 50 |
1% | 10 | 50 | 70 |
5% | 60 | 80 | 40 |
Use merge_asof
for add new column filled by nearest values from helper DataFrame
and count output by crosstab
:
df = pd.DataFrame({'ID': [1, 3, 6, 7, 10],
'A': ['0%', '17%', '108%', '700%', '10%'],
'B': ['1%', '30%', '299%', '900%', '30%'],
'C': ['5%', '50%', '500%', '51%', pd.Timestamp('12-10-2000')]})
print (df)
ID A B C
0 1 0% 1% 5%
1 3 17% 30% 50%
2 6 108% 299% 500%
3 7 700% 900% 51%
4 10 10% 30% 2000-12-10 00:00:00
tolerance_level = [0, 1, 5, 10, 20, 30, 50, 100, 200, 300, 500, 700]
df1 = pd.DataFrame({'tolerance_level':tolerance_level})
df1['tolerance_level'] = df1['tolerance_level'].astype('float')
df2 = df.melt('ID', value_name='tol')
df2['tol'] = pd.to_numeric(df2['tol'].str.rstrip('%'), errors='coerce')
df3 = pd.merge_asof(df2.dropna(subset=['tol']).sort_values('tol'),
df1,
left_on='tol',
right_on='tolerance_level',
direction='nearest')
out = (pd.crosstab(df3['tolerance_level'], df3['variable'])
.rename_axis(columns=None)
.rename(lambda x: f"{x}%")
.reset_index())
print (out)
tolerance_level A B C
0 0% 1 0 0
1 1% 0 1 0
2 5% 0 0 1
3 10% 1 0 0
4 20% 1 0 0
5 30% 0 2 0
6 50% 0 0 3
7 100% 1 0 0
8 300% 0 1 0
9 500% 0 0 1
10 700% 1 1 0