I have a pandas dataframe from 2007 to 2017. The data is like this:
date closing_price
2007-12-03 728.73
2007-12-04 728.83
2007-12-05 728.83
2007-12-07 728.93
2007-12-10 728.22
2007-12-11 728.50
2007-12-12 728.51
2007-12-13 728.65
2007-12-14 728.65
2007-12-17 728.70
2007-12-18 728.73
2007-12-19 728.73
2007-12-20 728.73
2007-12-21 728.52
2007-12-24 728.52
2007-12-26 728.90
2007-12-27 728.90
2007-12-28 728.91
2008-01-05 728.88
2008-01-08 728.86
2008-01-09 728.84
2008-01-10 728.85
2008-01-11 728.85
2008-01-15 728.86
2008-01-16 728.89
As you can see, some days are missing for each month. I want to take the first and last 'available' days of each month, and calculate the difference of their closing_price, and put the results in a new dataframe. For example for the first month, the days will be 2007-12-03 and 2007-12-28, and the closing prices would be 728.73 and 728.91, so the result would be 0.18. How can I do this?
you can group df by month and apply a function to do it. Notice the to_period, this function convert DataFrame from DatetimeIndex to PeriodIndex with desired frequency.
def calculate(x):
start_closing_price = x.loc[x.index.min(), "closing_price"]
end_closing_price = x.loc[x.index.max(), "closing_price"]
return end_closing_price-start_closing_price
result = df.groupby(df["date"].dt.to_period("M")).apply(calculate)
# result
date
2007-12 0.18
2008-01 0.01
Freq: M, dtype: float64