pandasdataframedatetime

Remove rows where pd.to_datetime fails and create separate DF based on failed rows


CONTEXT

I am trying to parse dates in a DataFrame and want to catch all rows that aren't able to be converted and then log those rows/ID's. I want to ignore any rows that fail to parse because it is a null or empty value and only include those that fail such as '|199/142/4' in the example DataFrame below.

CURRENT CODE

# Create a dataframe with float values
data = {
'ID': [1, 2, 3, 4, 5],
'Date': ['09/02/1999', '11/12/2001', '', '', '|199/142/4']
}

df = pd.DataFrame(data)

original_null = df[col].isnull()
    try:
        df['Date'] = pd.to_datetime(df['Date'], format='mixed', errors='coerce').dt.strftime('%m/%d/%Y')
        invalid_dates = df[df['Date'].isnull()]
        invalid_dates = invalid_dates[~invalid_dates.index.isin(original_null[original_null].index)]
        if not invalid_dates.empty:
            with open(data_path + client_alias + '_removed_ids.log', 'w') as log_file:
                    log_file.write(f"Removed IDs for column {col}:\n")
                    log_file.write(invalid_dates['ID'].to_string(index=False) + '\n')
                
                
            df = df[~df['ID'].isin(invalid_dates['ID'])]
    except Exception as e:
        None

EXPECTED OUTCOME

A DataFrame that only contains ID 5 and the corresponding date in a string value that could not be parsed.

ID    Date
5     '|199/142/4'

Solution

  • A possible solution:

    m = pd.to_datetime(df['Date'], format='%d/%m/%Y', errors='coerce').isna()
    df[m & df['Date'].ne('')]
    

    All dates that are not parsable will be converted to NaT with pd.to_datetime. Thus, we need to look at the NaT to identify the ones that do not correspond to empty dates ('').

    Output:

       ID        Date
    4   5  |199/142/4