I have an already sorted dataframe of panel data below, where a certain diagnosisCode
is mapped to a positive or false value in hasDisease
. I have 2 arrays which indicate whether a code is positive or negative
positiveDiagnosisCodes = [A] # corresponds to hasDisease = 1
negativeDiagnosisCodes = [B, C, D] # corresponds to hasDisease = 0
ID | Date | diagnosisCode | hasDisease |
---|---|---|---|
2 | 01/01/2020 | A | 1 |
2 | 02/01/2020 | A | 1 |
2 | 03/01/2020 | B | 0 |
2 | 04/01/2020 | B | 0 |
2 | 07/01/2020 | A | 1 |
3 | 02/01/2020 | B | 0 |
3 | 03/01/2020 | A | 1 |
3 | 07/01/2020 | A | 1 |
3 | 11/01/2020 | B | 0 |
3 | 15/01/2020 | A | 1 |
3 | 18/01/2020 | A | 1 |
3 | 19/01/2020 | A | 1 |
3 | 20/01/2020 | B | 0 |
3 | 21/01/2020 | C | 0 |
3 | 22/01/2020 | A | 1 |
3 | 25/01/2020 | A | 1 |
3 | 26/01/2020 | D | 0 |
3 | 28/01/2020 | D | 0 |
I am interested in finding the time it takes for a patient show signs of no disease. Specifically, I want to find the time difference between the
hasDisease
= 1 andhasDisease
= 0*However, this is only valid for negativeDiagnosisCodes
not previously recorded.
Hence in the example above, the time difference (in days for simplicity, but actual data will be in hours) of ID
2 is 2 days, since the earliest timestamp where Disease
= 1 is 01/01/2020
and the earliest timestamp where Disease
= 0 is 03/01/2020
.
For ID
3, there are 2 results - 18 days and 23 days. We take the time difference of 21/01/2020
and 03/01/2020
, where the patient is first diagnosed with code C
. We also take the time difference of 26/01/2020
and 03/01/2020
, where the patient is first diagnosed with code D
. We ignore the entries on 11/01/2020
and 20/01/2020
since the patient was already diagnosed with code B
on 02/01/2020
.
I hope to get an output like this:
ID | Time Difference | diagnosisCode |
---|---|---|
2 | 2 days | B |
3 | 18 days | C |
3 | 23 days | D |
How can I achieve this efficiently? I have tried using pandas groupby() but I am stuck on the aggregation part. Do I need to use more complex data structures like graphs?
First create dataframe using group by to find earliest date for each ID and DiagnosisCode.
import pandas as pd
import numpy as np
from datetime import timedelta
df = df.groupby(['ID', 'diagnosisCode'])[['Date']].min().unstack(1).reset_index()
df = df.droplevel(0, axis = 1)
df.rename(columns={ '': "ID" }, inplace = True)
After that, calculate difference date for each diagnosiscode and put nan value if difference is negative value.
df['Diff B'] = df.apply(lambda x: x['B'] - x['A'] if x['B'] - x['A'] > timedelta(0) else np.nan, axis=1)
df['Diff C'] = df.apply(lambda x: x['C'] - x['A'] if x['C'] - x['A'] > timedelta(0) else np.nan, axis=1)
df['Diff D'] = df.apply(lambda x: x['D'] - x['A'] if x['D'] - x['A'] > timedelta(0) else np.nan, axis=1)
alternatives
mask_B = df["B"] - df["A"] > timedelta(0)
df.loc[mask_B ,'Diff B'] = df['B'] - df2['A']
mask_C = df["C"] - df["A"] > timedelta(0)
df.loc[mask_C ,'Diff C'] = df['C'] - df['A']
mask_D = df["D"] - df["A"] > timedelta(0)
df.loc[mask_D ,'Diff D'] = df['D'] - df['A']
Then, melt table into desired output.
result = pd.melt(df, id_vars='ID', value_vars=['Diff B', 'Diff C', 'Diff D']).dropna()
Thank you