I have the following dataframe:
timeseries = pd.date_range("2018-01-01", periods=10, freq="m")
df = pd.DataFrame(data = ["a","a","b","a","a","c","c","c","a","a"], index = timeseries, columns = ['var'])
print(df)
var
2018-01-31 a
2018-02-28 a
2018-03-31 b
2018-04-30 a
2018-05-31 a
2018-06-30 c
2018-07-31 c
2018-08-31 c
2018-09-30 a
2018-10-31 a
I want to extract the minimum and maximum dates for each uninterrupted sequence of values in column "var" and assign them as additional columns. Desired result is:
var min Date max Date
1/31/2018 a 1/31/2018 2/28/2018
2/28/2018 a 1/31/2018 2/28/2018
3/31/2018 b 3/31/2018 3/31/2018
4/30/2018 a 4/30/2018 5/31/2018
5/31/2018 a 4/30/2018 5/31/2018
6/30/2018 c 6/30/2018 8/31/2018
7/31/2018 c 6/30/2018 8/31/2018
8/31/2018 c 6/30/2018 8/31/2018
9/30/2018 a 9/30/2018 10/31/2018
10/31/2018 a 9/30/2018 10/31/2018
for example in column "var" the first time value "a" appears is on 1/31/2018 and then it goes again "a" on 2/28/2018 before it is interrupted by "b". So for min and max dates we will have 1/31/2018 and 2/28/2018.
I thought I can achieve that with groupby operation but didn't manage since groupby aggregates all "a" or "b" or "c" etc.
You need to convert your index to_series
, then form a custom grouper (with shift
+cumsum
) and use a groupby.transform
:
# group successive values
group = df['var'].ne(df['var'].shift()).cumsum()
# form grouper and get transform
t = df.index.to_series().groupby(group).transform
# compute min/max
df['min Date'] = t('min')
df['max Date'] = t('max')
Output:
var min Date max Date
2018-01-31 a 2018-01-31 2018-02-28
2018-02-28 a 2018-01-31 2018-02-28
2018-03-31 b 2018-03-31 2018-03-31
2018-04-30 a 2018-04-30 2018-05-31
2018-05-31 a 2018-04-30 2018-05-31
2018-06-30 c 2018-06-30 2018-08-31
2018-07-31 c 2018-06-30 2018-08-31
2018-08-31 c 2018-06-30 2018-08-31
2018-09-30 a 2018-09-30 2018-10-31
2018-10-31 a 2018-09-30 2018-10-31