My DataFrame is:
import pandas as pd
df = pd.DataFrame(
{
'x': ['a', 'a', 'a','b', 'b','c', 'c', 'c',],
'y': list(range(8))
}
)
And this is the expected output. I want to create column z
:
x y z
0 a 0 NaN
1 a 1 NaN
2 a 2 NaN
3 b 3 3
4 b 4 NaN
5 c 5 NaN
6 c 6 NaN
7 c 7 NaN
The logic is:
I want to find the first row after the first group of duplicated rows. For example in column x
, the value a
is the first duplicated value. I want to find one row after the a
values end. And then put the y
of that row for z
column.
This is my attempt that did not give me the output:
m = (df.x.duplicated())
out = df[m]
One option, using a custom mask:
# flag rows after the first group
m = df['x'].ne(df['x'].iat[0]).cummax()
# pick the first one
out = df[m & ~m.shift(fill_value=False)]
If your first value is always a
and you want to find the first non-a
you could also use:
m2 = df['x'].eq('a')
out = df[m2.shift(fill_value=False) & ~m2]
Or, if you're sure there is at least one row after the leading a
s:
out = df.loc[[df['x'].ne('a').idxmax()]]
Output:
x y
3 b 3
Some intermediates (all approaches):
x y m ~m.shift(fill_value=False) m2 m2.shift(fill_value=False) df['x'].ne('a')
0 a 0 False True True False False
1 a 1 False True True True False
2 a 2 False True True True False
3 b 3 True True False True True
4 b 4 True False False False True
5 c 5 True False False False True
6 c 6 True False False False True
7 c 7 True False False False True