pythonpy-datatable

Pydatatable enumerate rows within each group


Given the following datatable

DT = dt.Frame({'A':['A','A','A','B','B','B'],
               'B':['a','a','b','a','a','a'],
               })

I'd like to create column 'C', which numbers the rows within each group in columns A and B like this:

     A   B   C
 0   A   a   1
 1   A   a   2
 2   A   b   1
 3   B   a   1
 4   B   a   2
 5   B   a   3

According to this thread for pandas cumcount() or rank() would be options, but it does not seem to be defined for pydatatable:

DT = DT[:, f[:].extend({'C': cumcount()}),by(f.A,f.B)]
DT = DT[:, f[:].extend({'C': rank(f.B)}),by(f.A,f.B)]

a) How can I number the rows within groups?

b) Is there a comprehensive resource with all the currently available functions for pydatatable?


Solution

  • Update:

    Datatable now has a cumcount function in dev :

    DT[:, {'C':dt.cumcount() + 1}, by('A', 'B')]
    
       | A      B          C
       | str32  str32  int64
    -- + -----  -----  -----
     0 | A      a          1
     1 | A      a          2
     2 | A      b          1
     3 | B      a          1
     4 | B      a          2
     5 | B      a          3
    [6 rows x 3 columns]
    

    Old answer:

    This is a hack, in time there should be an in-built way to do cumulative count, or even take advantage of itertools or other performant tools within python while still being very fast :

    Step 1 : Get count of columns A and B and export to list

    result = DT[:, dt.count(), by("A","B")][:,'count'].to_list()
    

    Step 2 : Use a combination of itertools chain and list comprehension to get the cumulative counts :

    from itertools import chain
    
    cumcount = chain.from_iterable([i+1 for i in range(n)] for n in result[0])
    

    Step 3: Assign result back to DT

    DT['C'] = dt.Frame(tuple(cumcount))
    
    print(DT)
    
    
        A       B   C
       ▪▪▪▪   ▪▪▪▪  ▪▪▪▪
    0   A      a    1
    1   A      a    2
    2   A      b    1
    3   B      a    1
    4   B      a    2
    5   B      a    3
    6 rows × 3 columns