I would like to drop 30 minutes of data after column is -1.
I have a large dataset and I need to remove 30 minutes from the dataframe after the column 'diff' == -1
I have the following code:
df12_3.head()
index Timestamp Status_flag diff
0 0 2020-03-02 10:40:00 0 -1.0
1 1 2020-03-02 10:41:00 0 0.0
2 2 2020-03-02 10:42:00 0 0.0
3 3 2020-03-02 10:43:00 0 0.0
4 4 2020-03-02 10:44:00 0 0.0
idx =df12_3['Timestamp'][df12_3['diff'] == -1]
idx.head(1)
0 2020-03-02 10:40:00
halfhour = datetime.timedelta(minutes=30)
idx2=df12_3['Timestamp'][df12_3['diff'] == -1]+halfhour
idx2.head(1)
0 2020-03-02 11:10:00
df12_3.loc[(df12_3.index < idx.iloc[0] ) | (df12_3.index > idx2.iloc[0])]
This removes the first 30 minute range. Is there a way to set this up so I can remove every 30minutes from when column diff == -1. I would be also happy to remove by number of rows e.g. when diff==-1 remove the next 30 rows.
You can use a groupby.transform
operation and boolean indexing:
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
out = df.loc[df.groupby(df['diff'].eq(-1).cumsum())['Timestamp']
.transform('first').add(pd.Timedelta('30min'))
.le(df['Timestamp'])
]
Intermediates (with slightly different input for clarity):
index Timestamp Status_flag diff cumsum first +30min >Timestamp
0 0 2020-03-02 10:40:00 0 -1.0 1 2020-03-02 10:40:00 2020-03-02 11:10:00 False
1 1 2020-03-02 10:41:00 0 0.0 1 2020-03-02 10:40:00 2020-03-02 11:10:00 False
2 2 2020-03-02 10:42:00 0 0.0 1 2020-03-02 10:40:00 2020-03-02 11:10:00 False
3 3 2020-03-02 11:03:00 0 0.0 1 2020-03-02 10:40:00 2020-03-02 11:10:00 False
4 4 2020-03-02 11:14:00 0 0.0 1 2020-03-02 10:40:00 2020-03-02 11:10:00 True
Alternative:
out = df.loc[df.groupby(df['diff'].eq(-1).cumsum())['Timestamp']
.transform('first').rsub(df['Timestamp'])
.gt(pd.Timedelta('30min'))
]