I have two linked dataframes which contain data on the status of something but both the data sources are poor. I have the status (e.g. 1=lost, 2=found, 3=unknown) and the date if found.
For instance, say dfA
is so:
ID, status, dateFound, registerDate
1, 1, NaN , 5/3/24
2, 2, 1/1/24, 1/1/24
3, 2, 2/1/24, 3/1/24
and 'dfB' is so:
ID, thing_status, dateFound, date_registered
1, 2, 6/4/24, 23/4/24
2, 1, NaN, 24/12/23
3, 2, 4/1/24, 5/1/24
I wish to create a new dataframe taking the last row from either dfA
or dfB
based on the latest registerDate/date_registered.
i.e. the ideal output would be:
ID, status, dateFound, registerDate
1, 2, 6/4/24, 23/4/24
2, 2, 1/1/24, 1/1/24
3, 2, 4/1/24, 5/1/24
I am currently doing this using a for-loop but suspect there is a simpler way.
A more pandas-idiomatic and efficient method is to:
Concatenate dfA
and dfB
.
Sort the combined data by 'ID' and 'registerDate' (ensuring the latest dates come first for each ID).
Drop duplicates, keeping the first occurrence for each 'ID'.
This leverages pandas' optimized functions effectively.
# Change string dates to datetime
def convert_dates(df, columns, date_format='%d/%m/%y'):
for col in columns:
df[col] = pd.to_datetime(df[col], format=date_format, errors='coerce')
convert_dates(dfA, ['dateFound', 'registerDate'])
convert_dates(dfB, ['dateFound', 'date_registered'])
# Rename columns in dfB
dfB_renamed = dfB.rename(columns={
'thing_status': 'status',
'date_registered': 'registerDate'
})
# 1. concat
combined_df = pd.concat([dfA, dfB_renamed], ignore_index=True)
# 2. Sort
combined_df_sorted = combined_df.sort_values(
by=['ID', 'registerDate'],
ascending=[True, False])
# 3. drop_duplicates
final_df = combined_df_sorted.drop_duplicates(subset=['ID'], keep='first')
final_df
Output:
ID status dateFound registerDate
1 2 2024-04-06 2024-04-23
2 2 2024-01-01 2024-01-01
3 2 2024-01-04 2024-01-05
import pandas as pd
import numpy as np
# Sample dfA
data_a = {
'ID': [1, 2, 3],
'status': [1, 2, 2],
'dateFound': [np.nan, '1/1/24', '2/1/24'],
'registerDate': ['5/3/24', '1/1/24', '3/1/24']
}
dfA = pd.DataFrame(data_a)
# Sample dfB
data_b = {
'ID': [1, 2, 3],
'thing_status': [2, 1, 2],
'dateFound': ['6/4/24', np.nan, '4/1/24'],
'date_registered': ['23/4/24', '24/12/23', '5/1/24']
}
dfB = pd.DataFrame(data_b)