Say we have a pandas data frame like below
ID Name DateTime Days
------------------------------------------
1 AAA 2022-06-22 10:00:05 12
2 BBB 2022-06-22 10:02:00
3 CCC 2022-06-22 10:04:00 16
4 DDD 2022-06-22 10:05:00
5 EEE 2022-06-22 10:05:59 18
so need to fill the Days column with nearest Value depends on datetime like for row #2 nearest value is in row #1 as the difference in time is less than the row #3. The result should be like below table. other rules like
ID Name DateTime Days
------------------------------------------
1 AAA 2022-06-22 10:00:05 12
2 BBB 2022-06-22 10:02:00 12
3 CCC 2022-06-22 10:04:00 16
4 DDD 2022-06-22 10:05:00 18
5 EEE 2022-06-22 10:05:59 18
Tried to do it in for loop but, taking more time as I have process 30k records. Is there any other way of doing it? Edit:
import pandas as pd
import numpy as np
from datetime import datetime
df = pd.DataFrame([{"Id": 1, "Name": "AAA", "DateTime":pd.Timestamp('2022-06-22 10:00:05'), "Days": 12},
{"Id": 2, "Name": "BBB", "DateTime":pd.Timestamp('2022-06-22 10:02:00'), "Days": None},
{"Id": 3, "Name": "CCC", "DateTime":pd.Timestamp('2022-06-22 10:04:00'), "Days": 16},
{"Id": 4, "Name": "DDD", "DateTime":pd.Timestamp('2022-06-22 10:05:00'), "Days": None},
{"Id": 5, "Name": "EEE", "DateTime":pd.Timestamp('2022-06-22 10:05:59'), "Days": 18}])
df['TimeUpShift'] = df['DateTime'].shift(1)
df['TimeDownShift'] = df['DateTime'].shift(-1)
Not able to proceed after this
for loop logic which serving the same
if df['Days'][0] == '' or np.isnan(df['Days'][0]):
index = df['Days'].first_valid_index()
df['Days'][0] = df.loc[index].Days if index is not None else None
if df['Days'][len(df) - 1] == '' or np.isnan(df['Days'][len(df) - 1]):
index = df['Days'].last_valid_index()
df['Days'][len(df) - 1] = df.loc[index].Days if index is not None else None
for i in range(1, len(df) - 1, 1):
if df['Days'][i] == '' or np.isnan(df['Days'][i]):
prevrow = pd.DataFrame()
nextrow = pd.DataFrame()
# backward search
for pi in range(i - 1, -1, -1):
if not df['Days'][pi] == '' or not np.isnan(df['Days'][pi]):
prevrow = df.loc[pi]
break
# forward search
for ni in range(i + 1, len(df) + 1, 1):
if not df['Days'][ni] == '' or not np.isnan(df['Days'][ni]):
nextrow = df.loc[ni]
break
if df['DateTime'][i] - prevrow['DateTime'] <= nextrow['DateTime'] - df['DateTime'][i]:
df['Days'][i] = prevrow['Days']
else:
df['Days'][i] = nextrow['Days']
With merge_asof
you can merge (match) to the nearest DateTime
:
pd.merge_asof(df,df.dropna(), on='DateTime', direction='nearest', suffixes=('_x', '')) \
[['Id', 'Name', 'DateTime', 'Days']]
Id Name DateTime Days
0 1 AAA 2022-06-22 10:00:05 12.0
1 1 AAA 2022-06-22 10:02:00 12.0
2 3 CCC 2022-06-22 10:04:00 16.0
3 5 EEE 2022-06-22 10:05:00 18.0
4 5 EEE 2022-06-22 10:05:59 18.0
asof merge within a given tolerance
You can specify a merge tolerance with the tolerance=
parameter specifying a TimeDelta
:
pd.merge_asof(df,df.dropna(), on='DateTime', direction='nearest', tolerance=pd.Timedelta('1m'), suffixes=('', '_y')) \
.drop(columns=['Days','Id_y', 'Name_y']).rename(columns={'Days_y':'Days'})
Result:
Id Name DateTime Days
0 1 AAA 2022-06-22 10:00:05 12.0
1 2 BBB 2022-06-22 10:02:00 NaN
2 3 CCC 2022-06-22 10:04:00 16.0
3 4 DDD 2022-06-22 10:05:00 18.0
4 5 EEE 2022-06-22 10:05:59 18.0
The example above shows merge/matching for the nearest time if that time is within 1 minute. For more details on TimeDelta
see: https://pandas.pydata.org/docs/user_guide/timedeltas.html