pythonpandasdataframe

How to create a new pandas column with increasing sequence id, but retain same value within each group


I have a pandas dataframe that looks like the one below:

df=pd.DataFrame({'hourOfDay':[5,5,8,8,13,13],
                 'category':['pageA','pageB','pageA','pageB','pageA','pageB'],
                })

    hourOfDay   category
0   5           pageA
1   5           pageB
2   8           pageA
3   8           pageB
4   13          pageA
5   13          pageB

Now, what I want is to create a new column with a monotonically increasing id. This id should be having same value within a group (hourOfDay). I'm giving the example of the expected dataframe below.

    hourOfDay   category    index
0           5   pageA       1
1           5   pageB       1
2           8   pageA       2
3           8   pageB       2
4          13   pageA       3
5          13   pageB       3

For now, we can assume that the category column can have only two values for simplicity, but it can be extended later. If I group by the hourOfDay, each separate page category within that group should get the same value assigned to it. I can do it by making two separate dataframe out of the main dataframe (filtered by category), sort it and create a new column using the df.groupby("hourOfDay").cumcount() method and then finally merge the two dataframe. But this approach seems way too convoluted. So, I was wondering if there's a simpler way of achieving the same thing.


Solution

  • If need same index per hourOfDay use GroupBy.ngroup:

    df['index'] = df.groupby('hourOfDay', sort=True).ngroup() + 1
    

    Or factorize:

    df = df.sort_values('hourOfDay')
    df['index'] = pd.factorize(df['hourOfDay'])[0] + 1