pythonpandasdataframedatetime

How to find corrupted date entry when date column type is (object)


I am trying to convert a column of time stamps that are currently of type object to proper datetime. When using pd.to_datetime() on the column I got this error

OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2821-11-10, at position 1818

I just want to find that value and correct it ie. should be 2021-11-10. But I can't find a way to search for that entry. The column is too large to physically look at every entry. Any tips?


Solution

  • Convert the dates to datatime and coerce errors. Then filter rows where an error occurred to either be fixed or dropped or further manipulated.

    import pandas as pd
    
    data = {
        'Event': ['Meeting A', 'Meeting B', 'Presentation C', 'Workshop D'],
        'Date': ['2023-01-15', '2923-01-18', '2023-02-01', '2023-02-05']
    }
    
    df = pd.DataFrame(data)
    
    df['dt_date'] = pd.to_datetime(df['Date'], errors = 'coerce') # Note: '2923-01-18' throws error without coerce
    
    df[df['dt_date'].isna()]
    
    

    df with datetime column

    Event Date dt_date
    0 Meeting A 2023-01-15 2023-01-15 00:00:00
    1 Meeting B 2923-01-18 NaT
    2 Presentation C 2023-02-01 2023-02-01 00:00:00
    3 Workshop D 2023-02-05 2023-02-05 00:00:00

    returned df from isna query

    Event Date dt_date
    1 Meeting B 2923-01-18 NaT

    More targeted option

    Based on details, exact value causing error is 2821-11-10. You can get that row with df[df['Date'] == "2821-11-10"] or change that value with a replace function like df['date_clean'] = df['Date'].replace("2821-11-10", "2021-11-10").

    I recommend taking the first approach to identify how many errors exist in the data and then handling the fixing/replacing.