pythonpandasdataframerunning-count

Enumerate each row for each group in a DataFrame


In pandas, how can I add a new column which enumerates rows based on a given grouping?

For instance, assume the following DataFrame:

import pandas as pd
import numpy as np

a_list = ['A', 'B', 'C', 'A', 'A', 'C', 'B', 'B', 'A', 'C']
df = pd.DataFrame({'col_a': a_list, 'col_b': range(10)})
df
  col_a  col_b
0     A      0
1     B      1
2     C      2
3     A      3
4     A      4
5     C      5
6     B      6
7     B      7
8     A      8
9     C      9

I'd like to add a col_c that gives me the Nth row of the "group" based on a grouping of col_a and sorting of col_b.

Desired output:

  col_a  col_b  col_c
0     A      0      1
3     A      3      2
4     A      4      3
8     A      8      4
1     B      1      1
6     B      6      2
7     B      7      3
2     C      2      1
5     C      5      2
9     C      9      3

I'm struggling to get to col_c. You can get to the proper grouping and sorting with .sort_index(by=['col_a', 'col_b']), it's now a matter of getting to that new column and labeling each row.


Solution

  • There's cumcount, for precisely this case:

    df['col_c'] = g.cumcount()
    

    As it says in the docs:

    Number each item in each group from 0 to the length of that group - 1.


    Original answer (before cumcount was defined).

    You could create a helper function to do this:

    def add_col_c(x):
        x['col_c'] = np.arange(len(x))
        return x
    

    First sort by column col_a:

    In [11]: df.sort('col_a', inplace=True)
    

    then apply this function across each group:

    In [12]: g = df.groupby('col_a', as_index=False)
    
    In [13]: g.apply(add_col_c)
    Out[13]:
      col_a  col_b  col_c
    3     A      3      0
    8     A      8      1
    0     A      0      2
    4     A      4      3
    6     B      6      0
    1     B      1      1
    7     B      7      2
    9     C      9      0
    2     C      2      1
    5     C      5      2
    

    In order to get 1,2,... you couls use np.arange(1, len(x) + 1).