I have a dataframe where I want to return the number (proportion) of patinets that have had a subsequent follow up after diagnosis of disease.
Original DF (1 Patient example)
| patient_id | app_date | diag_date | cancer_yn |
|------------|------------|------------|-----------|
| 1 | 2024-01-11 | NaT | NaN |
| 1 | 2024-03-14 | 2024-03-14 | 1 |
| 1 | 2024-04-09 | NaT | NaN |
| 1 | 2024-09-09 | NaT | NaN |
Intermediate DF (Indicates appointment record per patient was a follow up of diagnosis date or not)
| patient_id | app_date | diag_date | cancer_yn | fup_yn |
|------------|------------|------------|-----------|--------|
| 1 | 2024-01-11 | NaT | NaN | 0 |
| 1 | 2024-03-14 | 2024-03-14 | 1 | 0 |
| 1 | 2024-04-09 | NaT | NaN | 1 |
| 1 | 2024-09-09 | NaT | NaN | 1 |
Summarised DF (Collapsed, by groupby on patient_id
and value_counts()
or something similiar applied
| patient_with_fup | count |
|------------------|-------|
| 1 | 24 |
| 0 | 67 |
Original DF (2nd Example) -
Hoping to implement what solution above had done at patient level but for departments, say a patient can have diag_yn
across multiple departments but I wnat to check has the patient had follow up like before but for each department.
So, Patient 1 would be recorded twice in example, 1 for Radiology and again for Respiratory as it had a follow up appt for both departments.
dept | patient_id | app_date | diag_date | diag_yn |
|------------|------------|------------|------------|-----------|
Radiology | 1 | 2024-01-11 | NaT | NaN |
Radiology | 1 | 2024-03-14 | 2024-03-14 | 1 |
Radiology | 1 | 2024-04-09 | NaT | NaN |
Radiology | 1 | 2024-09-09 | NaT | NaN |
Respiratory | 1 | 2024-02-11 | NaT | NaN |
Respiratory | 1 | 2024-04-14 | 2024-04-14 | 1 |
Respiratory | 1 | 2024-06-09 | NaT | NaN |
Respiratory | 1 | 2024-09-09 | NaT | NaN |
Respiratory | 2 | 2024-01-11 | NaT | NaN |
Respiratory | 2 | 2024-03-14 | 2024-03-14 | 1 |
Respiratory | 2 | 2024-04-09 | NaT | NaN |
Respiratory | 2 | 2024-09-09 | NaT | NaN |
Output (2nd Example)
dept | patient_with_fup | count |
Radiology |------------------|-------|
| 1 | 1 |
| 0 | 0 |
Respiratory |------------------|-------|
| 1 | 2 |
| 0 | 0 |
You can see the 2nd record indicates the appointment where diagnosis was made (diag_date
is available and the same as app_date
), this patient has had subsequent appointments, I want to flag that this is the case (say follow_ups == 1
.
I'm finding it hard to understand how I can groupby different patients and apply value_counts() on a flag indicating a patient has had follow ups after a diagnosis appointment. Suggestions around based way to reshape the data and generate the flag would be great.
Assuming you have a typo in your data and that 2022-03-14
is 2024-03-14
, you can identify the subsequent appointment with groupby.transform
:
# ensure datetime
df[['app_date', 'diag_date']] = df[['app_date', 'diag_date']
].apply(pd.to_datetime)
df['fup_yn'] = (df.groupby('patient_id')['diag_date']
.transform('first').lt(df['app_date'])
.astype(int)
)
Output:
patient_id app_date diag_date cancer_yn fup_yn
0 1 2024-01-11 NaT NaN 0
1 1 2024-03-14 2024-03-14 1.0 0
2 1 2024-04-09 NaT NaN 1
3 1 2024-09-09 NaT NaN 1
For the final output, you don't really need this intermediate, you could use directly:
(df.groupby('patient_id')
.apply(lambda g: g['app_date'].gt(next(iter(g['diag_date'].dropna()),
pd.NaT)).any(),
include_groups=False)
.astype(int).value_counts()
.reindex([0, 1], fill_value=0).rename_axis('fup_yn')
.reset_index()
)
Output:
fup_yn count
0 0 0
1 1 1
Here is a more complete example for clarity:
# part 1
patient_id app_date diag_date cancer_yn fup_yn
0 1 2024-01-11 NaT NaN 0
1 1 2024-03-14 2024-03-14 1.0 0
2 1 2024-04-09 NaT NaN 1
3 1 2024-09-09 NaT NaN 1
4 2 2023-01-11 NaT NaN 0
5 2 2023-03-14 2023-03-14 1.0 0
6 3 2022-04-09 2022-05-14 NaN 0
7 3 2022-09-09 NaT NaN 1
# part 2
fup_yn count
0 0 1
1 1 2