Given the following datatable
DT = dt.Frame({'A':['A','A','A','B','B','B'],
'B':['a','a','b','a','a','a'],
})
I'd like to create column 'C', which numbers the rows within each group in columns A and B like this:
A B C
0 A a 1
1 A a 2
2 A b 1
3 B a 1
4 B a 2
5 B a 3
According to this thread for pandas cumcount() or rank() would be options, but it does not seem to be defined for pydatatable:
DT = DT[:, f[:].extend({'C': cumcount()}),by(f.A,f.B)]
DT = DT[:, f[:].extend({'C': rank(f.B)}),by(f.A,f.B)]
a) How can I number the rows within groups?
b) Is there a comprehensive resource with all the currently available functions for pydatatable?
Update:
Datatable now has a cumcount function in dev :
DT[:, {'C':dt.cumcount() + 1}, by('A', 'B')]
| A B C
| str32 str32 int64
-- + ----- ----- -----
0 | A a 1
1 | A a 2
2 | A b 1
3 | B a 1
4 | B a 2
5 | B a 3
[6 rows x 3 columns]
Old answer:
This is a hack, in time there should be an in-built way to do cumulative count, or even take advantage of itertools
or other performant tools within python while still being very fast :
Step 1 : Get count of columns A
and B
and export to list
result = DT[:, dt.count(), by("A","B")][:,'count'].to_list()
Step 2 : Use a combination of itertools chain and list comprehension
to get the cumulative counts :
from itertools import chain
cumcount = chain.from_iterable([i+1 for i in range(n)] for n in result[0])
Step 3: Assign result back to DT
DT['C'] = dt.Frame(tuple(cumcount))
print(DT)
A B C
▪▪▪▪ ▪▪▪▪ ▪▪▪▪
0 A a 1
1 A a 2
2 A b 1
3 B a 1
4 B a 2
5 B a 3
6 rows × 3 columns