Thanks in advance for any help you can provide. I have a dataset containing some healthcare data and am trying my hand at using python for EDA/regression modeling on the set. I have one date column [date_of_incident] with a lot of the data missing or incorrect. I also have a [treatment_date] column that has accurate information. I have converted both columns to datetime and created a new column: [dt_diff]=[treatment_date]-[date_of_incident] to find out how many days between the two columns.
I want to use the average of [dt_diff] to impute new dates in the [date_of_incident] column. Basically [new_date_of_incident]=[treatment_date]-[dt_diff].mean(), but I do not want to replace all of the dates in the column. Just the missing or incorrect ones.
For the sake of an example, say the average of [dt_diff] is 7 days. Case A's [date_of_incident] is NaN and has a [treatment_date] of 5/17/2025, Case B's [date_of_incident] is 6/30/1965 and has a [treatment_date] of 5/20/2025. What is the best way for me to change Case A's [date_of_incident] to 5/10/2025 and Case B's [date_of_incident] to 5/13/2025, but for well over 1,000 rows? The dataset is not large enough for me to drop these rows and that column is important to the goal of the analysis.
I calculated the average of [dt_diff] and just don't know how to proceed. I'm a super noob to python coding.
Assuming that you are using pandas
.
Using the following as an example:
>>> df
date_of_incident treatment_date
0 2025-02-20 2025-04-29
1 NaT 2025-05-22
2 NaT 2025-05-27
3 2025-03-14 2025-04-26
4 NaT 2025-05-11
5 2025-02-16 2025-04-23
6 NaT 2025-05-04
7 NaT 2025-05-20
8 2025-03-18 2025-05-28
9 2025-02-22 2025-05-21
A possible solution could be to calculate the average difference like you mentioned, and then create a copy of the two relevant columns, and set/overwrite the "date_of_incident" column to the "treatment_date" value minus the average.
>>> subdf = df[df["date_of_incident"].notna()]
>>> avg_diff = (subdf["treatment_date"] - subdf["date_of_incident"]).mean().round('d')
>>> avg_diff
Timedelta('67 days 00:00:00')
>>> df_copy = df[["date_of_incident", "treatment_date"]].copy()
>>> df_copy["date_of_incident"] = df_copy["treatment_date"] + avg_diff
Then use the update
method on the original DataFrame with the overwrite
keyword argument set to False
to update the missing values with the calculated value from the copy.
>>> df_copy["date_of_incident"] = df_copy["treatment_date"] + avg_diff
>>> df.update(df_copy, overwrite=False)
>>> df
date_of_incident treatment_date
0 2025-02-20 2025-04-29
1 2025-07-28 2025-05-22
2 2025-08-02 2025-05-27
3 2025-03-14 2025-04-26
4 2025-07-17 2025-05-11
5 2025-02-16 2025-04-23
6 2025-07-10 2025-05-04
7 2025-07-26 2025-05-20
8 2025-03-18 2025-05-28
9 2025-02-22 2025-05-21
By using update
with overwrite=False
you are preserving the original values in the column while filling in any missing values that are found in the copy.