I have a dataset with model scores in 3 categories (high, medium and low). The table looks like below:
| Score |
| ------- |
| high |
| high |
| high |
| low |
| low |
| low |
| medium |
| medium |
| medium |
I want to randomly assign these scores into 4 groups. control
, treatment 1
, treatment 2
, treatment 3
. control
group should have 20% of the observations and the rest 80% has to be divided into the other 3 equal sized groups. However, i want the distribution of scores (high, medium and low) in each group to be the equal. How can i solve this using python?
PS: This is just a representation of the actual table, but it will have a lot more observations than this.
You can try groupby.transform
:
cats = [ 'control', 'treatment 1', 'treatment 2', 'treatment 3']
probs = [.2, .8/3, .8/3, .8/3]
(df.groupby('Score')['Score']
.transform(lambda x: np.random.choice(cats, size=len(x), p=probs, replace=True)
)