I have a dataframe that looks like
Class_ID Student_ID feature
1 4 31
1 4 86
1 4 2
1 2 11
1 2 0
5 3 2
5 9 3
5 9 2
and I would like to count the number of times a student appear in each Class_ID
, so the desired outcome looks like this:
Class_ID Student_ID feature count
1 4 31 3
1 4 86 3
1 4 2 3
1 2 11 2
1 2 0 2
5 3 2 1
5 9 3 2
5 9 2 2
and here's how I did it:
df['dummy'] = 1
df['count'] = df.groupby(['Class_ID', 'Student_ID'], group_keys=False)['dummy'].transform(lambda x: x.sum())
It works fine, but my actual dataframe is rather large (~ 1M rows), and the code is quite slow, so i would like to ask is there any quicker way/ better way to do it? Thanks.
You can avoid creating a dummy column and transform
with "size"
instead:
df["count"] = df.groupby(["Class_ID", "Student_ID"]).transform("size")
On my machine it's about 3/4x times faster for 1 million rows.
Output:
Class_ID Student_ID feature count
0 1 4 31 3
1 1 4 86 3
2 1 4 2 3
3 1 2 11 2
4 1 2 0 2
5 5 3 2 1
6 5 9 3 2
7 5 9 2 2