pythondatetimeregressionlogistic-regression

How can I impute missing date values by using the average difference between two date columns?


Thanks in advance for any help you can provide. I have a dataset containing some healthcare data and am trying my hand at using python for EDA/regression modeling on the set. I have one date column [date_of_incident] with a lot of the data missing or incorrect. I also have a [treatment_date] column that has accurate information. I have converted both columns to datetime and created a new column: [dt_diff]=[treatment_date]-[date_of_incident] to find out how many days between the two columns.

I want to use the average of [dt_diff] to impute new dates in the [date_of_incident] column. Basically [new_date_of_incident]=[treatment_date]-[dt_diff].mean(), but I do not want to replace all of the dates in the column. Just the missing or incorrect ones.

For the sake of an example, say the average of [dt_diff] is 7 days. Case A's [date_of_incident] is NaN and has a [treatment_date] of 5/17/2025, Case B's [date_of_incident] is 6/30/1965 and has a [treatment_date] of 5/20/2025. What is the best way for me to change Case A's [date_of_incident] to 5/10/2025 and Case B's [date_of_incident] to 5/13/2025, but for well over 1,000 rows? The dataset is not large enough for me to drop these rows and that column is important to the goal of the analysis.

I calculated the average of [dt_diff] and just don't know how to proceed. I'm a super noob to python coding.


Solution

  • Assuming that you are using pandas.

    Using the following as an example:

    >>> df
      date_of_incident treatment_date
    0       2025-02-20     2025-04-29
    1              NaT     2025-05-22
    2              NaT     2025-05-27
    3       2025-03-14     2025-04-26
    4              NaT     2025-05-11
    5       2025-02-16     2025-04-23
    6              NaT     2025-05-04
    7              NaT     2025-05-20
    8       2025-03-18     2025-05-28
    9       2025-02-22     2025-05-21
    

    A possible solution could be to calculate the average difference like you mentioned, and then create a copy of the two relevant columns, and set/overwrite the "date_of_incident" column to the "treatment_date" value minus the average.

    >>> subdf = df[df["date_of_incident"].notna()]
    >>> avg_diff = (subdf["treatment_date"] - subdf["date_of_incident"]).mean().round('d')
    
    >>> avg_diff
    Timedelta('67 days 00:00:00')
    
    >>> df_copy = df[["date_of_incident", "treatment_date"]].copy()
    >>> df_copy["date_of_incident"] = df_copy["treatment_date"] + avg_diff
    

    Then use the update method on the original DataFrame with the overwrite keyword argument set to False to update the missing values with the calculated value from the copy.

    >>> df_copy["date_of_incident"] = df_copy["treatment_date"] + avg_diff
    >>> df.update(df_copy, overwrite=False)
    >>> df
      date_of_incident treatment_date
    0       2025-02-20     2025-04-29
    1       2025-07-28     2025-05-22
    2       2025-08-02     2025-05-27
    3       2025-03-14     2025-04-26
    4       2025-07-17     2025-05-11
    5       2025-02-16     2025-04-23
    6       2025-07-10     2025-05-04
    7       2025-07-26     2025-05-20
    8       2025-03-18     2025-05-28
    9       2025-02-22     2025-05-21
    

    By using update with overwrite=False you are preserving the original values in the column while filling in any missing values that are found in the copy.