CONTEXT
I am trying to parse dates in a DataFrame and want to catch all rows that aren't able to be converted and then log those rows/ID's. I want to ignore any rows that fail to parse because it is a null or empty value and only include those that fail such as '|199/142/4' in the example DataFrame below.
CURRENT CODE
# Create a dataframe with float values
data = {
'ID': [1, 2, 3, 4, 5],
'Date': ['09/02/1999', '11/12/2001', '', '', '|199/142/4']
}
df = pd.DataFrame(data)
original_null = df[col].isnull()
try:
df['Date'] = pd.to_datetime(df['Date'], format='mixed', errors='coerce').dt.strftime('%m/%d/%Y')
invalid_dates = df[df['Date'].isnull()]
invalid_dates = invalid_dates[~invalid_dates.index.isin(original_null[original_null].index)]
if not invalid_dates.empty:
with open(data_path + client_alias + '_removed_ids.log', 'w') as log_file:
log_file.write(f"Removed IDs for column {col}:\n")
log_file.write(invalid_dates['ID'].to_string(index=False) + '\n')
df = df[~df['ID'].isin(invalid_dates['ID'])]
except Exception as e:
None
EXPECTED OUTCOME
A DataFrame that only contains ID 5 and the corresponding date in a string value that could not be parsed.
ID Date
5 '|199/142/4'
A possible solution:
m = pd.to_datetime(df['Date'], format='%d/%m/%Y', errors='coerce').isna()
df[m & df['Date'].ne('')]
All dates that are not parsable will be converted to NaT
with pd.to_datetime
. Thus, we need to look at the NaT
to identify the ones that do not correspond to empty dates (''
).
Output:
ID Date
4 5 |199/142/4