[SOLVED] In python, drop a time range period

In python, drop a time range period

I would like to drop 30 minutes of data after column is -1.

I have a large dataset and I need to remove 30 minutes from the dataframe after the column 'diff' == -1

I have the following code:


df12_3.head() 

   index           Timestamp  Status_flag  diff
0      0 2020-03-02 10:40:00            0  -1.0
1      1 2020-03-02 10:41:00            0   0.0
2      2 2020-03-02 10:42:00            0   0.0
3      3 2020-03-02 10:43:00            0   0.0
4      4 2020-03-02 10:44:00            0   0.0

idx =df12_3['Timestamp'][df12_3['diff'] == -1]

idx.head(1)
0      2020-03-02 10:40:00

halfhour = datetime.timedelta(minutes=30)

idx2=df12_3['Timestamp'][df12_3['diff'] == -1]+halfhour

idx2.head(1)

0   2020-03-02 11:10:00

df12_3.loc[(df12_3.index < idx.iloc[0] ) | (df12_3.index > idx2.iloc[0])]

This removes the first 30 minute range. Is there a way to set this up so I can remove every 30minutes from when column diff == -1. I would be also happy to remove by number of rows e.g. when diff==-1 remove the next 30 rows.

Solution

You can use a groupby.transform operation and boolean indexing:

df['Timestamp'] = pd.to_datetime(df['Timestamp'])

out = df.loc[df.groupby(df['diff'].eq(-1).cumsum())['Timestamp']
               .transform('first').add(pd.Timedelta('30min'))
               .le(df['Timestamp'])
            ]

Intermediates (with slightly different input for clarity):

   index           Timestamp  Status_flag  diff  cumsum               first              +30min  >Timestamp
0      0 2020-03-02 10:40:00            0  -1.0       1 2020-03-02 10:40:00 2020-03-02 11:10:00       False
1      1 2020-03-02 10:41:00            0   0.0       1 2020-03-02 10:40:00 2020-03-02 11:10:00       False
2      2 2020-03-02 10:42:00            0   0.0       1 2020-03-02 10:40:00 2020-03-02 11:10:00       False
3      3 2020-03-02 11:03:00            0   0.0       1 2020-03-02 10:40:00 2020-03-02 11:10:00       False
4      4 2020-03-02 11:14:00            0   0.0       1 2020-03-02 10:40:00 2020-03-02 11:10:00        True

Alternative:

out = df.loc[df.groupby(df['diff'].eq(-1).cumsum())['Timestamp']
               .transform('first').rsub(df['Timestamp'])
               .gt(pd.Timedelta('30min'))
            ]