I have a pandas DataFrame
like this:
d={'gen':['A','A','A','A','B','B','B','B','C','D','D','D','D','D','D','D','D','D','D'], 'diff':pd.Series([1,1,1,1,2,1,1,1,1,1,1,1,1,2,2,1,1,1], index=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17])}
wk = pd.DataFrame(data=d, index=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18])
my goal is to count how many occurences the gen
have based on diff
with some criterias:
diff
is 1, andgen
at index i
equal to gen
at index i+1
, andwith this code, I can achieve what I want:
k=0
j=0
z={}
for i in range(wk.shape[0]):
if wk['diff'][i] == 1:
if wk['gen'][i] == wk['gen'][i+1]:
if j == 0:
j+=2
if j%2==0:
k+=1
if j>=2:
j+=1
z[wk['gen'][i]] = k
if wk['gen'][i] != wk['gen'][i+1]:
j=0
k=0
and the result of dictionary z
is: {'A': 2, 'B': 1, 'D': 4}
but when I used a bigger data (more than 410,000 records), the counter didn't always start from 0 when the gen
at index i
is not equal to gen
at index i+1
. What is wrong with my code?
Count the consecutive 1s per group with groupby.count
, perform a floordiv
by 2 (equivalent to your x/2 if x%2==0 else (x-1)/2
), and aggregate again with groupby.sum
before converting to_dict
:
group = wk['diff'].ne(wk.groupby('gen')['diff'].shift()).cumsum()
m = wk['diff'].eq(1)
out = (wk[m].groupby(['gen', group]) # keep only 1s and group
['diff'].count().floordiv(2) # count and floor division
.groupby(level='gen').sum() # sum per "gen" group
.loc[lambda x: x>0].to_dict() # only counts > 0 and convert to dict
)
Output:
{'A': 2, 'B': 1, 'D': 3}
group
and m
:
gen diff group m
0 A 1.0 1 True
1 A 1.0 1 True
2 A 1.0 1 True
3 A 1.0 1 True
4 B 2.0 2 False
5 B 1.0 3 True
6 B 1.0 3 True
7 B 1.0 3 True
8 C 1.0 4 True
9 D 1.0 5 True
10 D 1.0 5 True
11 D 1.0 5 True
12 D 1.0 5 True
13 D 2.0 6 False
14 D 2.0 6 False
15 D 1.0 7 True
16 D 1.0 7 True
17 D 1.0 7 True
18 D NaN 8 False