pandasfor-loopdatetimevectorizationperiod

How can I vectorize a for-loop running over a pandas Periodindex where I need to sort datetimes into appropriate period?


I have a Dataframe "timeseries" which has datetimes as its index and I have a PeriodIndex "on":

import numpy as np
import pandas as pd


timeseries = pd.DataFrame(
        index=pd.DatetimeIndex(
            [
                "2000-01-01 12:00:00Z",
                "2000-01-01 13:00:00Z",
                "2000-01-01 14:00:00Z",
                "2000-01-02 13:00:00Z",
                "2000-01-02 18:00:00Z",
                "2000-01-03 14:00:00Z",
                "2000-01-03 20:00:00Z",
                "2000-01-04 13:00:00Z",
            ]
        ),
        data={
            "value1": [6.0, 5.0, 3.0, 7.0, 4.0, 4.0, 5.0, 3.0],
        },
    )
on = pd.PeriodIndex(
    ["2000-01-01", "2000-01-02", "2000-01-04", "2000-01-05"], freq="D"
    )

I would like to add a column to "timeseries" that contains the period in "on" that each respective datetime is in:

                           value1      period
2000-01-01 12:00:00+00:00     6.0  2000-01-01
2000-01-01 13:00:00+00:00     5.0  2000-01-01
2000-01-01 14:00:00+00:00     3.0  2000-01-01
2000-01-02 13:00:00+00:00     7.0  2000-01-02
2000-01-02 18:00:00+00:00     4.0  2000-01-02
2000-01-03 14:00:00+00:00     4.0         NaN
2000-01-03 20:00:00+00:00     5.0         NaN
2000-01-04 13:00:00+00:00     3.0  2000-01-04

So far I have achieved this with a for-loop:

    timeseries["period"] = np.NaN
    for period in on:
        datetimes_in_period = timeseries.index[
            (timeseries.index >= period.start_time.tz_localize("UTC"))
            & (timeseries.index <= period.end_time.tz_localize("UTC"))
        ]
        timeseries["period"].loc[datetimes_in_period] = period

For efficiency's sake I want to avoid loops in Python. How can I vectorize this code?


Solution

  • You can still use .merge_asof as @rorshan suggested.

    If you create a dataframe of the start/end intervals:

    df_on = pd.DataFrame({
       "period":     on,
       "start_time": on.start_time.tz_localize("UTC"), 
       "end_time":   on.end_time.tz_localize("UTC"),
    })
    
    df = pd.merge_asof(timeseries, df_on, left_index=True, right_on="start_time")
    
    # blank out period when not inside
    df["period"] = df["period"].where((df.index >= df["start_time"]) & (df.index <= df["end_time"]))
    
    >>> df[["value1", "period"]]
                               value1      period
    2000-01-01 12:00:00+00:00     6.0  2000-01-01
    2000-01-01 13:00:00+00:00     5.0  2000-01-01
    2000-01-01 14:00:00+00:00     3.0  2000-01-01
    2000-01-02 13:00:00+00:00     7.0  2000-01-02
    2000-01-02 18:00:00+00:00     4.0  2000-01-02
    2000-01-03 14:00:00+00:00     4.0         NaT
    2000-01-03 20:00:00+00:00     5.0         NaT
    2000-01-04 13:00:00+00:00     3.0  2000-01-04