Multiple groupings and aggregation after taking subsequent dates values to check a condition

I have a dataframe where I want to return the number (proportion) of patients that have had a subsequent follow up after diagnosis of diseaseacross departments.

Example; a patient can have diag_yn across multiple departments but I want to check has the patient had follow up like before but for each department.

So, Patient 1 would be recorded twice in example, 1 for Radiology and again for Respiratory as it had a follow up appt for both departments.

Original DF

dept         | patient_id | app_date   | diag_date  | diag_yn   |
|------------|------------|------------|------------|-----------|
Radiology    | 1          | 2024-01-11 | NaT        | NaN       |
Radiology    | 1          | 2024-03-14 | 2024-03-14 | 1         |
Radiology    | 1          | 2024-04-09 | NaT        | NaN       |
Radiology    | 1          | 2024-09-09 | NaT        | NaN       |
Respiratory  | 1          | 2024-02-11 | NaT        | NaN       |
Respiratory  | 1          | 2024-04-14 | 2024-04-14 | 1         |
Respiratory  | 1          | 2024-06-09 | NaT        | NaN       |
Respiratory  | 1          | 2024-09-09 | NaT        | NaN       |
Respiratory  | 2          | 2024-01-11 | NaT        | NaN       |
Respiratory  | 2          | 2024-03-14 | 2024-03-14 | 1         |
Respiratory  | 2          | 2024-04-09 | NaT        | NaN       |
Respiratory  | 2          | 2024-09-09 | NaT        | NaN       |

Desired Output

dept        | patient_with_fup | count |
Radiology   |------------------|-------|
            | 1                | 1     |
            | 0                | 0     |
Respiratory |------------------|-------|
            | 1                | 2     |
            | 0                | 0     |

Have been provided the following code but before the dept column was added to the DF.

 (df.groupby('patient_id')
   .apply(lambda g: g['app_date'].gt(next(iter(g['diag_date'].dropna()),
                                          pd.NaT)).any(),
          include_groups=False)
   .astype(int).value_counts()
   .reindex([0, 1], fill_value=0).rename_axis('fup_yn')
   .reset_index()
)

Solution

You could start like the original approach, using the two columns as group, then compute the counts with pivot_table:

(df.groupby(['patient_id', 'dept'])
   .apply(lambda g: g['app_date'].gt(next(iter(g['diag_date'].dropna()),
                                          pd.NaT)).any(),
          include_groups=False)
   .astype(int).reset_index(name='patient_with_fup')
   .pivot_table(index='patient_with_fup', columns='dept',
                values='patient_id', aggfunc='count')
   .reindex([0, 1], fill_value=0)
   .unstack().reset_index(name='count')
)

Output:

          dept  patient_with_fup  count
0    Radiology                 0      1
1    Radiology                 1      0
2  Respiratory                 0      2
3  Respiratory                 1      0

Example with more groups:

# let's add a new column
df['unit'] = ['A']*11+['B']

groups = ['dept', 'unit']

(df.groupby(['patient_id']+groups)
   .apply(lambda g: g['app_date'].gt(next(iter(g['diag_date'].dropna()),
                                          pd.NaT)).any(),
          include_groups=False)
   .astype(int).reset_index(name='patient_with_fup')
   .pivot_table(index='patient_with_fup', columns=groups,
                values='patient_id', aggfunc='count')
   .reindex([0, 1], fill_value=0)
   .unstack().reset_index(name='count')
)

Output:

          dept unit  patient_with_fup  count
0    Radiology    A                 0      1
1    Radiology    A                 1      0
2  Respiratory    A                 0      2
3  Respiratory    A                 1      0
4  Respiratory    B                 0      1
5  Respiratory    B                 1      0