2 date columns comparison to indicate whether a record occured after another

I have a dataframe where I want to return the number (proportion) of patinets that have had a subsequent follow up after diagnosis of disease.

Original DF (1 Patient example)

| patient_id | app_date   | diag_date  | cancer_yn |
|------------|------------|------------|-----------|
| 1          | 2024-01-11 | NaT        | NaN       |
| 1          | 2024-03-14 | 2024-03-14 | 1         |
| 1          | 2024-04-09 | NaT        | NaN       |
| 1          | 2024-09-09 | NaT        | NaN       |

Intermediate DF (Indicates appointment record per patient was a follow up of diagnosis date or not)

| patient_id | app_date   | diag_date  | cancer_yn | fup_yn |
|------------|------------|------------|-----------|--------|
| 1          | 2024-01-11 | NaT        | NaN       | 0      |
| 1          | 2024-03-14 | 2024-03-14 | 1         | 0      |
| 1          | 2024-04-09 | NaT        | NaN       | 1      |
| 1          | 2024-09-09 | NaT        | NaN       | 1      |

Summarised DF (Collapsed, by groupby on patient_id and value_counts() or something similiar applied

| patient_with_fup | count |
|------------------|-------|
| 1                | 24    |
| 0                | 67    |

Original DF (2nd Example) -

Hoping to implement what solution above had done at patient level but for departments, say a patient can have diag_yn across multiple departments but I wnat to check has the patient had follow up like before but for each department.

So, Patient 1 would be recorded twice in example, 1 for Radiology and again for Respiratory as it had a follow up appt for both departments.

dept         | patient_id | app_date   | diag_date  | diag_yn   |
|------------|------------|------------|------------|-----------|
Radiology    | 1          | 2024-01-11 | NaT        | NaN       |
Radiology    | 1          | 2024-03-14 | 2024-03-14 | 1         |
Radiology    | 1          | 2024-04-09 | NaT        | NaN       |
Radiology    | 1          | 2024-09-09 | NaT        | NaN       |
Respiratory  | 1          | 2024-02-11 | NaT        | NaN       |
Respiratory  | 1          | 2024-04-14 | 2024-04-14 | 1         |
Respiratory  | 1          | 2024-06-09 | NaT        | NaN       |
Respiratory  | 1          | 2024-09-09 | NaT        | NaN       |
Respiratory  | 2          | 2024-01-11 | NaT        | NaN       |
Respiratory  | 2          | 2024-03-14 | 2024-03-14 | 1         |
Respiratory  | 2          | 2024-04-09 | NaT        | NaN       |
Respiratory  | 2          | 2024-09-09 | NaT        | NaN       |

Output (2nd Example)

dept        | patient_with_fup | count |
Radiology   |------------------|-------|
            | 1                | 1     |
            | 0                | 0     |
Respiratory |------------------|-------|
            | 1                | 2     |
            | 0                | 0     |

You can see the 2nd record indicates the appointment where diagnosis was made (diag_date is available and the same as app_date), this patient has had subsequent appointments, I want to flag that this is the case (say follow_ups == 1.

I'm finding it hard to understand how I can groupby different patients and apply value_counts() on a flag indicating a patient has had follow ups after a diagnosis appointment. Suggestions around based way to reshape the data and generate the flag would be great.

Solution

Assuming you have a typo in your data and that 2022-03-14 is 2024-03-14, you can identify the subsequent appointment with groupby.transform:

# ensure datetime
df[['app_date', 'diag_date']] = df[['app_date', 'diag_date']
                                   ].apply(pd.to_datetime)

df['fup_yn'] = (df.groupby('patient_id')['diag_date']
                  .transform('first').lt(df['app_date'])
                  .astype(int)
               )

Output:

   patient_id   app_date  diag_date  cancer_yn  fup_yn
0           1 2024-01-11        NaT        NaN       0
1           1 2024-03-14 2024-03-14        1.0       0
2           1 2024-04-09        NaT        NaN       1
3           1 2024-09-09        NaT        NaN       1

For the final output, you don't really need this intermediate, you could use directly:

(df.groupby('patient_id')
   .apply(lambda g: g['app_date'].gt(next(iter(g['diag_date'].dropna()),
                                          pd.NaT)).any(),
          include_groups=False)
   .astype(int).value_counts()
   .reindex([0, 1], fill_value=0).rename_axis('fup_yn')
   .reset_index()
)

Output:

   fup_yn  count
0       0      0
1       1      1

Here is a more complete example for clarity:

# part 1
   patient_id   app_date  diag_date  cancer_yn  fup_yn
0           1 2024-01-11        NaT        NaN       0
1           1 2024-03-14 2024-03-14        1.0       0
2           1 2024-04-09        NaT        NaN       1
3           1 2024-09-09        NaT        NaN       1
4           2 2023-01-11        NaT        NaN       0
5           2 2023-03-14 2023-03-14        1.0       0
6           3 2022-04-09 2022-05-14        NaN       0
7           3 2022-09-09        NaT        NaN       1


# part 2
   fup_yn  count
0       0      1
1       1      2