pythonpandasdatetimedroplines-of-code

In python, drop a time range period


I would like to drop 30 minutes of data after column is -1.

I have a large dataset and I need to remove 30 minutes from the dataframe after the column 'diff' == -1

I have the following code:


df12_3.head() 

   index           Timestamp  Status_flag  diff
0      0 2020-03-02 10:40:00            0  -1.0
1      1 2020-03-02 10:41:00            0   0.0
2      2 2020-03-02 10:42:00            0   0.0
3      3 2020-03-02 10:43:00            0   0.0
4      4 2020-03-02 10:44:00            0   0.0

idx =df12_3['Timestamp'][df12_3['diff'] == -1]

idx.head(1)
0      2020-03-02 10:40:00

halfhour = datetime.timedelta(minutes=30)

idx2=df12_3['Timestamp'][df12_3['diff'] == -1]+halfhour

idx2.head(1)

0   2020-03-02 11:10:00

df12_3.loc[(df12_3.index < idx.iloc[0] ) | (df12_3.index > idx2.iloc[0])]

This removes the first 30 minute range. Is there a way to set this up so I can remove every 30minutes from when column diff == -1. I would be also happy to remove by number of rows e.g. when diff==-1 remove the next 30 rows.


Solution

  • You can use a groupby.transform operation and boolean indexing:

    df['Timestamp'] = pd.to_datetime(df['Timestamp'])
    
    out = df.loc[df.groupby(df['diff'].eq(-1).cumsum())['Timestamp']
                   .transform('first').add(pd.Timedelta('30min'))
                   .le(df['Timestamp'])
                ]
    

    Intermediates (with slightly different input for clarity):

       index           Timestamp  Status_flag  diff  cumsum               first              +30min  >Timestamp
    0      0 2020-03-02 10:40:00            0  -1.0       1 2020-03-02 10:40:00 2020-03-02 11:10:00       False
    1      1 2020-03-02 10:41:00            0   0.0       1 2020-03-02 10:40:00 2020-03-02 11:10:00       False
    2      2 2020-03-02 10:42:00            0   0.0       1 2020-03-02 10:40:00 2020-03-02 11:10:00       False
    3      3 2020-03-02 11:03:00            0   0.0       1 2020-03-02 10:40:00 2020-03-02 11:10:00       False
    4      4 2020-03-02 11:14:00            0   0.0       1 2020-03-02 10:40:00 2020-03-02 11:10:00        True
    

    Alternative:

    out = df.loc[df.groupby(df['diff'].eq(-1).cumsum())['Timestamp']
                   .transform('first').rsub(df['Timestamp'])
                   .gt(pd.Timedelta('30min'))
                ]