I have a dataframe with income for a certain period of time. The period is given by a start date and an end date (For example 2023-04-01 and 2023-06-30). The period can vary between 3, 6 and 12 months. My goal is to bring everything to the same period, but I did not find an easy way to do it.
This is how I am doing it now:
df["date_diff"] = (df.date_to.dt.to_period("M") - df.date.dt.to_period("M")).apply(lambda x: x.n) + 1
date_diff
columndate
new_rows = []
for _,row in df.iterrows():
for i in range(row["date_diff"]):
new_row = row.copy()
new_row["new_date"] = row["date"] + relativedelta(months=i) + pd.offsets.MonthEnd()
new_row["new_net_revenues"] = row["net_revenues"] / row["date_diff"]
new_rows.append(new_row)
new_df = pd.DataFrame(new_rows)
This obviously takes a hell lot of time for larger dataframes. I wonder if there is a much faster solution with builtin funtions?
Essentially it is kind of an explode
to the months in the period.
I also tried to just concat the rows according to the number of months in the period and do the math afterwards, but still not very elegant.
Edit:
The code above transforms this:
net_revenues date date_to date_diff
0 0.02 2023-04-01 2023-06-30 3
1 0.01 2023-04-01 2023-06-30 3
2 0.02 2023-01-01 2023-03-31 3
into this:
new_net_revenues new_date
0 0.006667 2023-04-30
0 0.006667 2023-05-31
0 0.006667 2023-06-30
1 0.003333 2023-04-30
1 0.003333 2023-05-31
1 0.003333 2023-06-30
2 0.006667 2023-01-31
2 0.006667 2023-02-28
2 0.006667 2023-03-31
3 0.003333 2023-01-31
3 0.003333 2023-02-28
3 0.003333 2023-03-31
You can create new rows with Index.repeat
, add months by counter by GroupBy.cumcount
and create datetimes by Series.dt.to_timestamp
:
df = pd.DataFrame({'date':['2023-04-01', '2023-01-01'],
'date_to':['2023-06-30', '2023-06-30'],
}).apply(pd.to_datetime).assign(net_revenues=[100,20])
print (df)
date date_to net_revenues
0 2023-04-01 2023-06-30 100
1 2023-01-01 2023-06-30 20
periods = df.date.dt.to_period("M")
diff = (df.date_to.dt.to_period("M") - periods).apply(lambda x: x.n) + 1
new_df = df.loc[df.index.repeat(diff)]
new_df['new_date'] = (periods + new_df.groupby(level=0).cumcount()).dt.to_timestamp("M")
new_df["new_net_revenues"] = new_df["net_revenues"] / diff
print (new_df)
date date_to net_revenues new_date new_net_revenues
0 2023-04-01 2023-06-30 100 2023-04-30 33.333333
0 2023-04-01 2023-06-30 100 2023-05-31 33.333333
0 2023-04-01 2023-06-30 100 2023-06-30 33.333333
1 2023-01-01 2023-06-30 20 2023-01-31 3.333333
1 2023-01-01 2023-06-30 20 2023-02-28 3.333333
1 2023-01-01 2023-06-30 20 2023-03-31 3.333333
1 2023-01-01 2023-06-30 20 2023-04-30 3.333333
1 2023-01-01 2023-06-30 20 2023-05-31 3.333333
1 2023-01-01 2023-06-30 20 2023-06-30 3.333333