pythonpandasdataframedatetime

Pandas Dataframe with mixed periods - 1M, 6M and 12M - to Dataframe with one period only


I have a dataframe with income for a certain period of time. The period is given by a start date and an end date (For example 2023-04-01 and 2023-06-30). The period can vary between 3, 6 and 12 months. My goal is to bring everything to the same period, but I did not find an easy way to do it.

This is how I am doing it now:

        new_rows = []
        for _,row in df.iterrows():
            for i in range(row["date_diff"]):
                new_row = row.copy()
                new_row["new_date"] = row["date"] + relativedelta(months=i) + pd.offsets.MonthEnd()
                new_row["new_net_revenues"] = row["net_revenues"] / row["date_diff"]
                new_rows.append(new_row)
        new_df = pd.DataFrame(new_rows)

This obviously takes a hell lot of time for larger dataframes. I wonder if there is a much faster solution with builtin funtions? Essentially it is kind of an explode to the months in the period. I also tried to just concat the rows according to the number of months in the period and do the math afterwards, but still not very elegant.

Edit:

The code above transforms this:

   net_revenues       date    date_to  date_diff
0          0.02 2023-04-01 2023-06-30          3
1          0.01 2023-04-01 2023-06-30          3
2          0.02 2023-01-01 2023-03-31          3

into this:

   new_net_revenues   new_date
0          0.006667 2023-04-30
0          0.006667 2023-05-31
0          0.006667 2023-06-30
1          0.003333 2023-04-30
1          0.003333 2023-05-31
1          0.003333 2023-06-30
2          0.006667 2023-01-31
2          0.006667 2023-02-28
2          0.006667 2023-03-31
3          0.003333 2023-01-31
3          0.003333 2023-02-28
3          0.003333 2023-03-31

Solution

  • You can create new rows with Index.repeat, add months by counter by GroupBy.cumcount and create datetimes by Series.dt.to_timestamp:

    df = pd.DataFrame({'date':['2023-04-01', '2023-01-01'],
                       'date_to':['2023-06-30', '2023-06-30'], 
                       }).apply(pd.to_datetime).assign(net_revenues=[100,20])
    
    print (df)
            date    date_to  net_revenues
    0 2023-04-01 2023-06-30           100
    1 2023-01-01 2023-06-30            20
    
    periods = df.date.dt.to_period("M")
    diff = (df.date_to.dt.to_period("M") - periods).apply(lambda x: x.n) + 1
    new_df = df.loc[df.index.repeat(diff)]
    new_df['new_date'] = (periods + new_df.groupby(level=0).cumcount()).dt.to_timestamp("M")
    new_df["new_net_revenues"] = new_df["net_revenues"] / diff
    print (new_df)
            date    date_to  net_revenues   new_date  new_net_revenues
    0 2023-04-01 2023-06-30           100 2023-04-30         33.333333
    0 2023-04-01 2023-06-30           100 2023-05-31         33.333333
    0 2023-04-01 2023-06-30           100 2023-06-30         33.333333
    1 2023-01-01 2023-06-30            20 2023-01-31          3.333333
    1 2023-01-01 2023-06-30            20 2023-02-28          3.333333
    1 2023-01-01 2023-06-30            20 2023-03-31          3.333333
    1 2023-01-01 2023-06-30            20 2023-04-30          3.333333
    1 2023-01-01 2023-06-30            20 2023-05-31          3.333333
    1 2023-01-01 2023-06-30            20 2023-06-30          3.333333