pythonpandasdataframe

Merge two dataframes on date, taking row based on most recent date


I have two linked dataframes which contain data on the status of something but both the data sources are poor. I have the status (e.g. 1=lost, 2=found, 3=unknown) and the date if found.

For instance, say dfA is so:

ID, status, dateFound, registerDate
1,  1,      NaN ,      5/3/24
2,  2,      1/1/24,    1/1/24
3,  2,      2/1/24,    3/1/24

and 'dfB' is so:

ID, thing_status, dateFound, date_registered
1,  2,            6/4/24,    23/4/24
2,  1,            NaN,       24/12/23
3,  2,            4/1/24,    5/1/24

I wish to create a new dataframe taking the last row from either dfA or dfB based on the latest registerDate/date_registered.

i.e. the ideal output would be:

ID, status, dateFound, registerDate
1,  2,      6/4/24,    23/4/24
2,  2,      1/1/24,    1/1/24
3,  2,      4/1/24,    5/1/24

I am currently doing this using a for-loop but suspect there is a simpler way.


Solution

  • Concat, sort, drop duplicates

    A more pandas-idiomatic and efficient method is to:

    1. Concatenate dfA and dfB.

    2. Sort the combined data by 'ID' and 'registerDate' (ensuring the latest dates come first for each ID).

    3. Drop duplicates, keeping the first occurrence for each 'ID'.

    This leverages pandas' optimized functions effectively.

    Prerequisite steps

    # Change string dates to datetime
    def convert_dates(df, columns, date_format='%d/%m/%y'):
        for col in columns:
            df[col] = pd.to_datetime(df[col], format=date_format, errors='coerce')
    
    convert_dates(dfA, ['dateFound', 'registerDate'])
    convert_dates(dfB, ['dateFound', 'date_registered'])
    
    # Rename columns in dfB
    dfB_renamed = dfB.rename(columns={
        'thing_status': 'status',
        'date_registered': 'registerDate'
    })
    

    Code

    # 1. concat
    combined_df = pd.concat([dfA, dfB_renamed], ignore_index=True)
    
    # 2. Sort
    combined_df_sorted = combined_df.sort_values(
        by=['ID', 'registerDate'],
        ascending=[True, False])
    
    # 3. drop_duplicates
    final_df = combined_df_sorted.drop_duplicates(subset=['ID'], keep='first')
    
    final_df
    

    Output:

    ID  status  dateFound registerDate
     1       2 2024-04-06   2024-04-23
     2       2 2024-01-01   2024-01-01
     3       2 2024-01-04   2024-01-05
    

    Setup for reference

    import pandas as pd
    import numpy as np
    
    # Sample dfA
    data_a = {
        'ID': [1, 2, 3],
        'status': [1, 2, 2],
        'dateFound': [np.nan, '1/1/24', '2/1/24'],
        'registerDate': ['5/3/24', '1/1/24', '3/1/24']
    }
    dfA = pd.DataFrame(data_a)
    
    # Sample dfB
    data_b = {
        'ID': [1, 2, 3],
        'thing_status': [2, 1, 2],
        'dateFound': ['6/4/24', np.nan, '4/1/24'],
        'date_registered': ['23/4/24', '24/12/23', '5/1/24']
    }
    dfB = pd.DataFrame(data_b)