pythonpandasdataframe

Counting based on criterias in pandas


I have a pandas DataFrame like this:

d={'gen':['A','A','A','A','B','B','B','B','C','D','D','D','D','D','D','D','D','D','D'], 'diff':pd.Series([1,1,1,1,2,1,1,1,1,1,1,1,1,2,2,1,1,1], index=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17])}
wk = pd.DataFrame(data=d, index=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18])

my goal is to count how many occurences the gen have based on diff with some criterias:

  1. Count if diff is 1, and
  2. gen at index i equal to gen at index i+1, and
  3. If there's a consecutive 1's, then the count will be like this: if (number of consecutive 1) %2 == 0: count = number of consecutive/2, if not: count = (number of consecutive - 1) /2

with this code, I can achieve what I want:

k=0
j=0
z={}
for i in range(wk.shape[0]):
    if wk['diff'][i] == 1:
        if wk['gen'][i] == wk['gen'][i+1]:
            if j == 0:
                j+=2
            if j%2==0:
                k+=1                
            if j>=2:
                j+=1
            z[wk['gen'][i]] = k
        if wk['gen'][i] != wk['gen'][i+1]:
            j=0
            k=0

and the result of dictionary z is: {'A': 2, 'B': 1, 'D': 4}

but when I used a bigger data (more than 410,000 records), the counter didn't always start from 0 when the gen at index i is not equal to gen at index i+1. What is wrong with my code?


Solution

  • Count the consecutive 1s per group with groupby.count, perform a floordiv by 2 (equivalent to your x/2 if x%2==0 else (x-1)/2), and aggregate again with groupby.sum before converting to_dict:

    group = wk['diff'].ne(wk.groupby('gen')['diff'].shift()).cumsum()
    m = wk['diff'].eq(1)
    
    out = (wk[m].groupby(['gen', group])      # keep only 1s and group
                 ['diff'].count().floordiv(2) # count and floor division
                .groupby(level='gen').sum()   # sum per "gen" group
                .loc[lambda x: x>0].to_dict() # only counts > 0 and convert to dict
          )
    

    Output:

    {'A': 2, 'B': 1, 'D': 3}
    
    Intermediates

    group and m:

       gen  diff  group      m
    0    A   1.0      1   True
    1    A   1.0      1   True
    2    A   1.0      1   True
    3    A   1.0      1   True
    4    B   2.0      2  False
    5    B   1.0      3   True
    6    B   1.0      3   True
    7    B   1.0      3   True
    8    C   1.0      4   True
    9    D   1.0      5   True
    10   D   1.0      5   True
    11   D   1.0      5   True
    12   D   1.0      5   True
    13   D   2.0      6  False
    14   D   2.0      6  False
    15   D   1.0      7   True
    16   D   1.0      7   True
    17   D   1.0      7   True
    18   D   NaN      8  False