pythonpandas

pandas OutOfBoundsDatetime out of scope issue


Am getting the following issue "pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 3036-12-31 00:00:00, at position 45100"

I dont want to do the following as this will coerce all errors to NaT (not a time)

s = pd.to_datetime(s, errors='coerce')

Is there no way to keep the dates and not convert them to NaT/nan/null?

this is the function causing the error

def remove_duplicates_based_on_keydate(df, id_col, date_col):
    df = df.copy()
    df.loc[:, date_col] = pd.to_datetime(df[date_col])

    # Sort by the date column in descending order
    df_sorted = df.sort_values(by=date_col, ascending=False)

    # Drop duplicates, keeping the first occurrence (latest date)
    df_unique = df_sorted.drop_duplicates(subset=id_col, keep='first')

    # Sort again by id_col and reset index
    df_unique = df_unique.sort_values(by=id_col).reset_index(drop=True)

    return df_unique

Solution

  • You can't have dates above Timestamp('2262-04-11 23:47:16.854775807'):

    Assuming dates like:

    df = pd.DataFrame({'date': ['2036-12-31 00:00:00', '3036-12-31 00:00:00']})
    

    You could convert to periods with PeriodIndex

    df['periods'] = pd.PeriodIndex(df['date'].str.extract(r'^(\d{4}-\d\d-\d\d)',
                                   expand=False), freq='D')
    

    Output:

                      date     periods
    0  2036-12-31 00:00:00  2036-12-31
    1  3036-12-31 00:00:00  3036-12-31
    

    Or for a precision to the second:

    df = pd.DataFrame({'date': ['2036-12-31 00:00:00', '3036-12-31 01:00:00']})
    
    df['periods'] = pd.PeriodIndex(df['date'], freq='s')
    

    Output:

                      date              periods
    0  2036-12-31 00:00:00  2036-12-31 00:00:00
    1  3036-12-31 01:00:00  3036-12-31 01:00:00
    

    Alternatively, first convert to datetime with errors='coerce', then to object and fillna with the missing dates:

    pd.to_datetime(df['date'], errors='coerce').astype(object).fillna(df['date'])
    

    Note however that this will remain an object column with mixed strings and timestamps, which is not really useful in pandas.

    df['mixed'] = pd.to_datetime(df['date'], errors='coerce').astype(object).fillna(df['date'])
    print(df['mixed'].tolist())
    # [Timestamp('2036-12-31 00:00:00'), '3036-12-31 00:00:00']