pythonpandasgroup-byresampling

Pandas dataframe: resampling time intervals and dividing values proportionally?


I have the following pandas dataframe:

index start_time end_time amount
foo 2023-03-11 09:45:27 2023-03-11 09:58:39 48
bar 2023-03-11 09:59:00 2023-03-11 010:09:00 20

I'm hoping to split the data by hourly intervals, such that amounts would be:

interval amount
2023-03-11 09:00-10:00 50
2023-03-11 10:00-11:00 18

The logic being:

From what I understand, dataframe's .resample method is best suited for splitting by intervals, however:

I'm guessing this is a common need when working with time series in Python. Before I go building out something convoluted...is there any built-in/simple way to tackle it?


Solution

  • Here is one convoluted way to answer your interesting question using Pandas Timestamp.floor, Timestamp.ceil, and explode:

    import pandas as pd
    
    df = pd.DataFrame(
        {
            "index": ["foo", "bar"],
            "start_time": ["2023-03-11 09:45:27", "2023-03-11 09:59:00"],
            "end_time": ["2023-03-11 09:58:39", "2023-03-11 10:09:00"],
            "amount": [48, 20],
        }
    )
    for col in ("start_time", "end_time"):
        df[col] = pd.to_datetime(df[col], infer_datetime_format=True)
    
    # Find intervals
    df["interval"] = df.apply(
        lambda x: [[x["start_time"].floor("H"), x["start_time"].ceil("H")]]
        if (x["end_time"].ceil("H").hour - x["start_time"].floor("H").hour) == 1
        else [
            [x["start_time"].floor("H"), x["end_time"].floor("H")],
            [x["start_time"].ceil("H"), x["end_time"].ceil("H")],
        ],
        axis=1,
    )
    
    # Divide values
    df["amount"] = df.apply(
        lambda x: x["amount"]
        if (x["end_time"].ceil("H").hour - x["start_time"].floor("H").hour) == 1
        else [
            int(
                x["amount"]
                * 60
                * (60 - x["start_time"].minute)
                / (x["end_time"] - x["start_time"]).total_seconds()
            ),
            int(
                x["amount"]
                - x["amount"]
                * 60
                * (60 - x["start_time"].minute)
                / (x["end_time"] - x["start_time"]).total_seconds()
            ),
        ],
        axis=1,
    )
    
    # Deal with lists of intervals
    tmp = df.loc[df["interval"].apply(len) == 2].explode(["interval", "amount"])
    other = df.loc[~df.index.isin(tmp.index), :].pipe(
        lambda df_: df_.assign(interval=df_["interval"].apply(lambda x: x[0]))
    )
    df = pd.concat([other, tmp])
    
    # Compute final values
    df = (
        df.assign(interval=df["interval"].apply(tuple))
        .groupby("interval")
        .agg({"amount": sum})
    )
    

    Then:

    print(df)
    # Output
                                               amount
    interval
    (2023-03-11 09:00:00, 2023-03-11 10:00:00)     50
    (2023-03-11 10:00:00, 2023-03-11 11:00:00)     18