pythonpandasdataframegroup-by

Pandas Dataframe groupby count number of rows quickly


I have a dataframe that looks like

Class_ID  Student_ID  feature
1         4           31
1         4           86
1         4           2
1         2           11
1         2           0
5         3           2
5         9           3
5         9           2

and I would like to count the number of times a student appear in each Class_ID, so the desired outcome looks like this:

Class_ID  Student_ID  feature  count
1         4           31       3
1         4           86       3
1         4           2        3
1         2           11       2
1         2           0        2
5         3           2        1
5         9           3        2
5         9           2        2

and here's how I did it:

df['dummy'] = 1
df['count'] = df.groupby(['Class_ID', 'Student_ID'], group_keys=False)['dummy'].transform(lambda x: x.sum())

It works fine, but my actual dataframe is rather large (~ 1M rows), and the code is quite slow, so i would like to ask is there any quicker way/ better way to do it? Thanks.


Solution

  • You can avoid creating a dummy column and transform with "size" instead:

    df["count"] = df.groupby(["Class_ID", "Student_ID"]).transform("size")
    

    On my machine it's about 3/4x times faster for 1 million rows.

    Output:

       Class_ID  Student_ID  feature  count
    0         1           4       31      3
    1         1           4       86      3
    2         1           4        2      3
    3         1           2       11      2
    4         1           2        0      2
    5         5           3        2      1
    6         5           9        3      2
    7         5           9        2      2