My question is nearly identical to Finding frequency of each value in all categorical columns across a dataframe, but I need the probabilities, instead of the frequencies. We can use the same example dataframe:
df = pd.DataFrame(
{'sub_code': ['CSE01', 'CSE01', 'CSE01', 'CSE02', 'CSE03', 'CSE04',
'CSE05', 'CSE06'],
'stud_level': [101, 101, 101, 101, 101, 101, 101, 101],
'grade': ['STA', 'STA', 'PSA', 'STA', 'STA', 'SSA', 'PSA', 'QSA']})
I tried adapting this answer in the following way:
out = (df.select_dtypes(object)
.melt(var_name="Variable", value_name="Class")
.value_counts(dropna=False, normalize=True)
.reset_index(name="Probability")
.sort_values(by=['Variable', 'Class'], ascending=[True, True])
.reset_index(drop=True))
However, the code doesn't work, because the sum of the class probabilities for each variable is not 1. What am I doing wrong?
IIUC, you need to to groupby
on 'Variable'
after melting. This will give you the probabilities for each value in 'Class'
in relation to each value in 'Variable'
.
out = (
df.select_dtypes(object)
.melt(var_name="Variable", value_name="Class")
.groupby("Variable") # add this line
.value_counts(dropna=False, normalize=True)
.reset_index(name="Probability")
.sort_values(by=["Variable", "Class"], ascending=[True, True])
.reset_index(drop=True)
)
Variable Class Probability
0 grade PSA 0.250
1 grade QSA 0.125
2 grade SSA 0.125
3 grade STA 0.500
4 sub_code CSE01 0.375
5 sub_code CSE02 0.125
6 sub_code CSE03 0.125
7 sub_code CSE04 0.125
8 sub_code CSE05 0.125
9 sub_code CSE06 0.125