python datetime regression logistic-regression

How can I impute missing date values by using the average difference between two date columns?

Thanks in advance for any help you can provide. I have a dataset containing some healthcare data and am trying my hand at using python for EDA/regression modeling on the set. I have one date column [date_of_incident] with a lot of the data missing or incorrect. I also have a [treatment_date] column that has accurate information. I have converted both columns to datetime and created a new column: [dt_diff]=[treatment_date]-[date_of_incident] to find out how many days between the two columns.

I want to use the average of [dt_diff] to impute new dates in the [date_of_incident] column. Basically [new_date_of_incident]=[treatment_date]-[dt_diff].mean(), but I do not want to replace all of the dates in the column. Just the missing or incorrect ones.

For the sake of an example, say the average of [dt_diff] is 7 days. Case A's [date_of_incident] is NaN and has a [treatment_date] of 5/17/2025, Case B's [date_of_incident] is 6/30/1965 and has a [treatment_date] of 5/20/2025. What is the best way for me to change Case A's [date_of_incident] to 5/10/2025 and Case B's [date_of_incident] to 5/13/2025, but for well over 1,000 rows? The dataset is not large enough for me to drop these rows and that column is important to the goal of the analysis.

I calculated the average of [dt_diff] and just don't know how to proceed. I'm a super noob to python coding.

Solution

Assuming that you are using pandas.

Using the following as an example:

>>> df
  date_of_incident treatment_date
0       2025-02-20     2025-04-29
1              NaT     2025-05-22
2              NaT     2025-05-27
3       2025-03-14     2025-04-26
4              NaT     2025-05-11
5       2025-02-16     2025-04-23
6              NaT     2025-05-04
7              NaT     2025-05-20
8       2025-03-18     2025-05-28
9       2025-02-22     2025-05-21

A possible solution could be to calculate the average difference like you mentioned, and then create a copy of the two relevant columns, and set/overwrite the "date_of_incident" column to the "treatment_date" value minus the average.

>>> subdf = df[df["date_of_incident"].notna()]
>>> avg_diff = (subdf["treatment_date"] - subdf["date_of_incident"]).mean().round('d')

>>> avg_diff
Timedelta('67 days 00:00:00')

>>> df_copy = df[["date_of_incident", "treatment_date"]].copy()
>>> df_copy["date_of_incident"] = df_copy["treatment_date"] + avg_diff

Then use the update method on the original DataFrame with the overwrite keyword argument set to False to update the missing values with the calculated value from the copy.

>>> df_copy["date_of_incident"] = df_copy["treatment_date"] + avg_diff
>>> df.update(df_copy, overwrite=False)
>>> df
  date_of_incident treatment_date
0       2025-02-20     2025-04-29
1       2025-07-28     2025-05-22
2       2025-08-02     2025-05-27
3       2025-03-14     2025-04-26
4       2025-07-17     2025-05-11
5       2025-02-16     2025-04-23
6       2025-07-10     2025-05-04
7       2025-07-26     2025-05-20
8       2025-03-18     2025-05-28
9       2025-02-22     2025-05-21

By using update with overwrite=False you are preserving the original values in the column while filling in any missing values that are found in the copy.