pythonpandasnumpy

Count first consecutive matches on a group


I am quite new to Pandas, I am trying to count the total of the first consecutive instances of color from this DataFrame

    car   color
0   audi  black
1   audi  black
2   audi   blue
3   audi  black
4    bmw   blue
5    bmw  green
6    bmw   blue
7    bmw   blue
8   fiat  green
9   fiat  green
10  fiat  green
11  fiat   blue

Thanks to jezrael I have it so it counts the cumulative number of times the first color appears with this:

import pandas as pd

df = pd.DataFrame(data={
  'car': ['audi', 'audi', 'audi', 'audi', 'bmw', 'bmw', 'bmw', 'bmw', 'fiat', 'fiat', 'fiat', 'fiat'],'color': ['black', 'black', 'blue', 'black', 'blue', 'green', 'blue', 'blue', 'green', 'green', 'green', 'blue']
})

df1 = (df.groupby('car')['color']
          .transform('first')
          .eq(df['color'])
          .view('i1')
          .groupby(df['car'])
          .sum()
          .reset_index(name='colour_cars'))

print(df1)

And it works well for counting the total

    car  colour_cars
0  audi            3
1   bmw            3
2  fiat            3

But it turns out what I really need is to count the first consecutive sum, so it should be

    car  colour_cars
0  audi            2
1   bmw            1
2  fiat            3

I have tried to use an apply function to stop the series .sum() if a False is encounter by .eq, any help to find a way to break the count once a False is returned from the .eq would be greatly appreciated.


Solution

  • Use:

    df = (df.groupby(['car', df.color.ne(df.color.shift()).cumsum()])
            .size()
            .reset_index(level=1, drop=True)
            .reset_index(name='colour_cars')
            .drop_duplicates('car'))
    
    print (df)
        car  colour_cars
    0  audi            2
    3   bmw            1
    6  fiat            3
    

    Details:

    Create helper consecutive Series for test consecutive values of color column, pass to GroupBy.size, remove first level created from helper function by DataFrame.reset_index, convert index to columns by second reset_index and last get first rows per cars by DataFrame.drop_duplicates:

    print (df.color.ne(df.color.shift()).cumsum())
    0     1
    1     1
    2     2
    3     3
    4     4
    5     5
    6     6
    7     6
    8     7
    9     7
    10    7
    11    8
    Name: color, dtype: int32