pythonpandastransformation

2 date columns comparison to indicate whether a record occured after another


I have a dataframe where I want to return the number (proportion) of patinets that have had a subsequent follow up after diagnosis of disease.

Original DF (1 Patient example)

| patient_id | app_date   | diag_date  | cancer_yn |
|------------|------------|------------|-----------|
| 1          | 2024-01-11 | NaT        | NaN       |
| 1          | 2024-03-14 | 2024-03-14 | 1         |
| 1          | 2024-04-09 | NaT        | NaN       |
| 1          | 2024-09-09 | NaT        | NaN       |

Intermediate DF (Indicates appointment record per patient was a follow up of diagnosis date or not)

| patient_id | app_date   | diag_date  | cancer_yn | fup_yn |
|------------|------------|------------|-----------|--------|
| 1          | 2024-01-11 | NaT        | NaN       | 0      |
| 1          | 2024-03-14 | 2024-03-14 | 1         | 0      |
| 1          | 2024-04-09 | NaT        | NaN       | 1      |
| 1          | 2024-09-09 | NaT        | NaN       | 1      |

Summarised DF (Collapsed, by groupby on patient_id and value_counts() or something similiar applied

| patient_with_fup | count |
|------------------|-------|
| 1                | 24    |
| 0                | 67    |

Original DF (2nd Example) -

Hoping to implement what solution above had done at patient level but for departments, say a patient can have diag_yn across multiple departments but I wnat to check has the patient had follow up like before but for each department.

So, Patient 1 would be recorded twice in example, 1 for Radiology and again for Respiratory as it had a follow up appt for both departments.

dept         | patient_id | app_date   | diag_date  | diag_yn   |
|------------|------------|------------|------------|-----------|
Radiology    | 1          | 2024-01-11 | NaT        | NaN       |
Radiology    | 1          | 2024-03-14 | 2024-03-14 | 1         |
Radiology    | 1          | 2024-04-09 | NaT        | NaN       |
Radiology    | 1          | 2024-09-09 | NaT        | NaN       |
Respiratory  | 1          | 2024-02-11 | NaT        | NaN       |
Respiratory  | 1          | 2024-04-14 | 2024-04-14 | 1         |
Respiratory  | 1          | 2024-06-09 | NaT        | NaN       |
Respiratory  | 1          | 2024-09-09 | NaT        | NaN       |
Respiratory  | 2          | 2024-01-11 | NaT        | NaN       |
Respiratory  | 2          | 2024-03-14 | 2024-03-14 | 1         |
Respiratory  | 2          | 2024-04-09 | NaT        | NaN       |
Respiratory  | 2          | 2024-09-09 | NaT        | NaN       |

Output (2nd Example)

dept        | patient_with_fup | count |
Radiology   |------------------|-------|
            | 1                | 1     |
            | 0                | 0     |
Respiratory |------------------|-------|
            | 1                | 2     |
            | 0                | 0     |

You can see the 2nd record indicates the appointment where diagnosis was made (diag_date is available and the same as app_date), this patient has had subsequent appointments, I want to flag that this is the case (say follow_ups == 1.

I'm finding it hard to understand how I can groupby different patients and apply value_counts() on a flag indicating a patient has had follow ups after a diagnosis appointment. Suggestions around based way to reshape the data and generate the flag would be great.


Solution

  • Assuming you have a typo in your data and that 2022-03-14 is 2024-03-14, you can identify the subsequent appointment with groupby.transform:

    # ensure datetime
    df[['app_date', 'diag_date']] = df[['app_date', 'diag_date']
                                       ].apply(pd.to_datetime)
    
    df['fup_yn'] = (df.groupby('patient_id')['diag_date']
                      .transform('first').lt(df['app_date'])
                      .astype(int)
                   )
    

    Output:

       patient_id   app_date  diag_date  cancer_yn  fup_yn
    0           1 2024-01-11        NaT        NaN       0
    1           1 2024-03-14 2024-03-14        1.0       0
    2           1 2024-04-09        NaT        NaN       1
    3           1 2024-09-09        NaT        NaN       1
    

    For the final output, you don't really need this intermediate, you could use directly:

    (df.groupby('patient_id')
       .apply(lambda g: g['app_date'].gt(next(iter(g['diag_date'].dropna()),
                                              pd.NaT)).any(),
              include_groups=False)
       .astype(int).value_counts()
       .reindex([0, 1], fill_value=0).rename_axis('fup_yn')
       .reset_index()
    )
    

    Output:

       fup_yn  count
    0       0      0
    1       1      1
    

    Here is a more complete example for clarity:

    # part 1
       patient_id   app_date  diag_date  cancer_yn  fup_yn
    0           1 2024-01-11        NaT        NaN       0
    1           1 2024-03-14 2024-03-14        1.0       0
    2           1 2024-04-09        NaT        NaN       1
    3           1 2024-09-09        NaT        NaN       1
    4           2 2023-01-11        NaT        NaN       0
    5           2 2023-03-14 2023-03-14        1.0       0
    6           3 2022-04-09 2022-05-14        NaN       0
    7           3 2022-09-09        NaT        NaN       1
    
    
    # part 2
       fup_yn  count
    0       0      1
    1       1      2