pythonpandasdataframefor-loop

Python- Iterate over data frame in pandas and replace value that does not contains string from list


My DF looks like below

x  date_from             cleaned_date
1  21 JUNE 23.59          2024-06-23
2  18TH JUN 23:59        2024-06-18   
3  01TH JULY (23.59 HRS) 2024-07-01
4  28th June 2023        2023-06-28
5  5TH MAY 2023          2023-05-05
6  JUNE 27, 2023         2023-06-27

I wrote a code that extract the correct date from date_from column but somehow it add 'year' from nothing. There is no information about year in some cases but it still try to add it. I wrote a code that deals with it and it replace the whole value in cleaned_date column with None if there is no year information. It looks like below:

df.loc[(~df['date_from'].astype('str').str.contains('2025')) &
                (~df['date_from'].astype('str').str.contains('2024')) &
                 (df['date_from'].astype('str') != 'nan') & 
                 (~df['date_from'].astype('str').str.contains('2023')) &
                 (~df['date_from'].astype('str').str.contains('2022')) &
                 (~df['date_from'].astype('str').str.contains('2021')) &
                 (~df['date_from'].astype('str').str.contains('2020')) &
                 (~df['date_from'].astype('str').str.contains('2019')),  'cleaned_date'] = None

Unfortunately I have more years to check so is there a way to use for instance for loop? Do you have any idea?

Regards


Solution

  • You could replace your many conditions with a single regex:

    df.loc[~df['date_from'].str.contains('nan|(?:2019|202[0-5])'),
           'cleaned_date'] = None
    

    Output:

       x              date_from cleaned_date
    0  1          21 JUNE 23.59         None
    1  2         18TH JUN 23:59         None
    2  3  01TH JULY (23.59 HRS)         None
    3  4         28th June 2023   2023-06-28
    4  5           5TH MAY 2023   2023-05-05
    5  6          JUNE 27, 2023   2023-06-27
    

    regex demo

    nan                # match nan
    |                  # or
    (?:2019|202[0-5])  # match 2019 or 2020-2025
    

    If you insist on the many conditions, this is doable with numpy.logical_or.reduce:

    vals = ['nan', '2019', '2020', '2021', '2022', '2023', '2024', '2025']
    
    cond = ~np.logical_or.reduce([df['date_from'].astype('str').str.contains(val)
                                  for val in vals])
    
    df.loc[cond, 'cleaned_date'] = None