Am getting the following issue "pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 3036-12-31 00:00:00, at position 45100"
I dont want to do the following as this will coerce all errors to NaT (not a time)
s = pd.to_datetime(s, errors='coerce')
Is there no way to keep the dates and not convert them to NaT/nan/null?
this is the function causing the error
def remove_duplicates_based_on_keydate(df, id_col, date_col):
df = df.copy()
df.loc[:, date_col] = pd.to_datetime(df[date_col])
# Sort by the date column in descending order
df_sorted = df.sort_values(by=date_col, ascending=False)
# Drop duplicates, keeping the first occurrence (latest date)
df_unique = df_sorted.drop_duplicates(subset=id_col, keep='first')
# Sort again by id_col and reset index
df_unique = df_unique.sort_values(by=id_col).reset_index(drop=True)
return df_unique
You can't have dates above Timestamp('2262-04-11 23:47:16.854775807')
:
Assuming dates like:
df = pd.DataFrame({'date': ['2036-12-31 00:00:00', '3036-12-31 00:00:00']})
You could convert to periods with PeriodIndex
df['periods'] = pd.PeriodIndex(df['date'].str.extract(r'^(\d{4}-\d\d-\d\d)',
expand=False), freq='D')
Output:
date periods
0 2036-12-31 00:00:00 2036-12-31
1 3036-12-31 00:00:00 3036-12-31
Or for a precision to the second:
df = pd.DataFrame({'date': ['2036-12-31 00:00:00', '3036-12-31 01:00:00']})
df['periods'] = pd.PeriodIndex(df['date'], freq='s')
Output:
date periods
0 2036-12-31 00:00:00 2036-12-31 00:00:00
1 3036-12-31 01:00:00 3036-12-31 01:00:00
Alternatively, first convert to datetime with errors='coerce'
, then to object and fillna
with the missing dates:
pd.to_datetime(df['date'], errors='coerce').astype(object).fillna(df['date'])
Note however that this will remain an object column with mixed strings and timestamps, which is not really useful in pandas.
df['mixed'] = pd.to_datetime(df['date'], errors='coerce').astype(object).fillna(df['date'])
print(df['mixed'].tolist())
# [Timestamp('2036-12-31 00:00:00'), '3036-12-31 00:00:00']