My DF looks like below
x date_from cleaned_date
1 21 JUNE 23.59 2024-06-23
2 18TH JUN 23:59 2024-06-18
3 01TH JULY (23.59 HRS) 2024-07-01
4 28th June 2023 2023-06-28
5 5TH MAY 2023 2023-05-05
6 JUNE 27, 2023 2023-06-27
I wrote a code that extract the correct date from date_from column but somehow it add 'year' from nothing. There is no information about year in some cases but it still try to add it. I wrote a code that deals with it and it replace the whole value in cleaned_date column with None if there is no year information. It looks like below:
df.loc[(~df['date_from'].astype('str').str.contains('2025')) &
(~df['date_from'].astype('str').str.contains('2024')) &
(df['date_from'].astype('str') != 'nan') &
(~df['date_from'].astype('str').str.contains('2023')) &
(~df['date_from'].astype('str').str.contains('2022')) &
(~df['date_from'].astype('str').str.contains('2021')) &
(~df['date_from'].astype('str').str.contains('2020')) &
(~df['date_from'].astype('str').str.contains('2019')), 'cleaned_date'] = None
Unfortunately I have more years to check so is there a way to use for instance for loop? Do you have any idea?
Regards
You could replace your many conditions with a single regex:
df.loc[~df['date_from'].str.contains('nan|(?:2019|202[0-5])'),
'cleaned_date'] = None
Output:
x date_from cleaned_date
0 1 21 JUNE 23.59 None
1 2 18TH JUN 23:59 None
2 3 01TH JULY (23.59 HRS) None
3 4 28th June 2023 2023-06-28
4 5 5TH MAY 2023 2023-05-05
5 6 JUNE 27, 2023 2023-06-27
nan # match nan
| # or
(?:2019|202[0-5]) # match 2019 or 2020-2025
If you insist on the many conditions, this is doable with numpy.logical_or.reduce
:
vals = ['nan', '2019', '2020', '2021', '2022', '2023', '2024', '2025']
cond = ~np.logical_or.reduce([df['date_from'].astype('str').str.contains(val)
for val in vals])
df.loc[cond, 'cleaned_date'] = None