pythonpandasdataframe

Multiple groupings and aggregation after taking subsequent dates values to check a condition


I have a dataframe where I want to return the number (proportion) of patients that have had a subsequent follow up after diagnosis of diseaseacross departments.

Example; a patient can have diag_yn across multiple departments but I want to check has the patient had follow up like before but for each department.

So, Patient 1 would be recorded twice in example, 1 for Radiology and again for Respiratory as it had a follow up appt for both departments.

Original DF

dept         | patient_id | app_date   | diag_date  | diag_yn   |
|------------|------------|------------|------------|-----------|
Radiology    | 1          | 2024-01-11 | NaT        | NaN       |
Radiology    | 1          | 2024-03-14 | 2024-03-14 | 1         |
Radiology    | 1          | 2024-04-09 | NaT        | NaN       |
Radiology    | 1          | 2024-09-09 | NaT        | NaN       |
Respiratory  | 1          | 2024-02-11 | NaT        | NaN       |
Respiratory  | 1          | 2024-04-14 | 2024-04-14 | 1         |
Respiratory  | 1          | 2024-06-09 | NaT        | NaN       |
Respiratory  | 1          | 2024-09-09 | NaT        | NaN       |
Respiratory  | 2          | 2024-01-11 | NaT        | NaN       |
Respiratory  | 2          | 2024-03-14 | 2024-03-14 | 1         |
Respiratory  | 2          | 2024-04-09 | NaT        | NaN       |
Respiratory  | 2          | 2024-09-09 | NaT        | NaN       |

Desired Output

dept        | patient_with_fup | count |
Radiology   |------------------|-------|
            | 1                | 1     |
            | 0                | 0     |
Respiratory |------------------|-------|
            | 1                | 2     |
            | 0                | 0     |

Have been provided the following code but before the dept column was added to the DF.

 (df.groupby('patient_id')
   .apply(lambda g: g['app_date'].gt(next(iter(g['diag_date'].dropna()),
                                          pd.NaT)).any(),
          include_groups=False)
   .astype(int).value_counts()
   .reindex([0, 1], fill_value=0).rename_axis('fup_yn')
   .reset_index()
)

Solution

  • You could start like the original approach, using the two columns as group, then compute the counts with pivot_table:

    (df.groupby(['patient_id', 'dept'])
       .apply(lambda g: g['app_date'].gt(next(iter(g['diag_date'].dropna()),
                                              pd.NaT)).any(),
              include_groups=False)
       .astype(int).reset_index(name='patient_with_fup')
       .pivot_table(index='patient_with_fup', columns='dept',
                    values='patient_id', aggfunc='count')
       .reindex([0, 1], fill_value=0)
       .unstack().reset_index(name='count')
    )
    

    Output:

              dept  patient_with_fup  count
    0    Radiology                 0      1
    1    Radiology                 1      0
    2  Respiratory                 0      2
    3  Respiratory                 1      0
    

    Example with more groups:

    # let's add a new column
    df['unit'] = ['A']*11+['B']
    
    groups = ['dept', 'unit']
    
    (df.groupby(['patient_id']+groups)
       .apply(lambda g: g['app_date'].gt(next(iter(g['diag_date'].dropna()),
                                              pd.NaT)).any(),
              include_groups=False)
       .astype(int).reset_index(name='patient_with_fup')
       .pivot_table(index='patient_with_fup', columns=groups,
                    values='patient_id', aggfunc='count')
       .reindex([0, 1], fill_value=0)
       .unstack().reset_index(name='count')
    )
    

    Output:

              dept unit  patient_with_fup  count
    0    Radiology    A                 0      1
    1    Radiology    A                 1      0
    2  Respiratory    A                 0      2
    3  Respiratory    A                 1      0
    4  Respiratory    B                 0      1
    5  Respiratory    B                 1      0