I am quite new to Pandas, I am trying to count the total of the first consecutive instances of color from this DataFrame
car color
0 audi black
1 audi black
2 audi blue
3 audi black
4 bmw blue
5 bmw green
6 bmw blue
7 bmw blue
8 fiat green
9 fiat green
10 fiat green
11 fiat blue
Thanks to jezrael I have it so it counts the cumulative number of times the first color appears with this:
import pandas as pd
df = pd.DataFrame(data={
'car': ['audi', 'audi', 'audi', 'audi', 'bmw', 'bmw', 'bmw', 'bmw', 'fiat', 'fiat', 'fiat', 'fiat'],'color': ['black', 'black', 'blue', 'black', 'blue', 'green', 'blue', 'blue', 'green', 'green', 'green', 'blue']
})
df1 = (df.groupby('car')['color']
.transform('first')
.eq(df['color'])
.view('i1')
.groupby(df['car'])
.sum()
.reset_index(name='colour_cars'))
print(df1)
And it works well for counting the total
car colour_cars
0 audi 3
1 bmw 3
2 fiat 3
But it turns out what I really need is to count the first consecutive sum, so it should be
car colour_cars
0 audi 2
1 bmw 1
2 fiat 3
I have tried to use an apply function to stop the series .sum()
if a False is encounter by .eq
, any help to find a way to break the count once a False is returned from the .eq
would be greatly appreciated.
Use:
df = (df.groupby(['car', df.color.ne(df.color.shift()).cumsum()])
.size()
.reset_index(level=1, drop=True)
.reset_index(name='colour_cars')
.drop_duplicates('car'))
print (df)
car colour_cars
0 audi 2
3 bmw 1
6 fiat 3
Details:
Create helper consecutive Series
for test consecutive values of color
column, pass to GroupBy.size
, remove first level created from helper function by DataFrame.reset_index
, convert index to columns by second reset_index
and last get first rows per cars by DataFrame.drop_duplicates
:
print (df.color.ne(df.color.shift()).cumsum())
0 1
1 1
2 2
3 3
4 4
5 5
6 6
7 6
8 7
9 7
10 7
11 8
Name: color, dtype: int32