pythonpandasdataframerunning-count

Identify Duplicated rows with Additional Column


I have the following Dataframe:

   PplNum  RoomNum  Value
0       1        0    265
1       1       12    170
2       2        0    297
3       2       12     85
4       2        0     41
5       2       12    144

Generally the PplNum and RoomNum is generated like this, and it will always follow this format:

for ppl in [1,2,2]:
    for room in [0, 12]:
        print(ppl, room)
1 0
1 12
2 0
2 12
2 0
2 12

But now what I would like to achieve is to mark those duplicates combinations of PplNum and RoomNum so that I can know which combinationss are the first occurrence, which are the second occurrence and so on... So the expected output Dataframe will be like this:

    PplNum  RoomNum  Value  C
0       1        0    265  1
1       1       12    170  1
2       2        0    297  1
3       2       12     85  1
4       2        0     41  2
5       2       12    144  2

Solution

  • you can do it using groupby() together with cumcount() function:

    In [102]: df['C'] = df.groupby(['PplNum','RoomNum']).cumcount() + 1
    
    In [103]: df
    Out[103]:
       PplNum  RoomNum  Value  C
    0       1        0    265  1
    1       1       12    170  1
    2       2        0    297  1
    3       2       12     85  1
    4       2        0     41  2
    5       2       12    144  2
    

    Explanation:

    In [101]: df.groupby(['PplNum','RoomNum']).cumcount() + 1
    Out[101]:
    0    1
    1    1
    2    1
    3    1
    4    2
    5    2
    dtype: int64
    

    Group-By examples