python pandas dataframe algorithm panel-data

Find time difference of first instance of a value and first instance of another value in dataframe

I have an already sorted dataframe of panel data below, where a certain diagnosisCode is mapped to a positive or false value in hasDisease. I have 2 arrays which indicate whether a code is positive or negative

positiveDiagnosisCodes = [A] # corresponds to hasDisease = 1
negativeDiagnosisCodes = [B, C, D] # corresponds to hasDisease = 0

ID	Date	diagnosisCode	hasDisease
2	01/01/2020	A	1
2	02/01/2020	A	1
2	03/01/2020	B	0
2	04/01/2020	B	0
2	07/01/2020	A	1
3	02/01/2020	B	0
3	03/01/2020	A	1
3	07/01/2020	A	1
3	11/01/2020	B	0
3	15/01/2020	A	1
3	18/01/2020	A	1
3	19/01/2020	A	1
3	20/01/2020	B	0
3	21/01/2020	C	0
3	22/01/2020	A	1
3	25/01/2020	A	1
3	26/01/2020	D	0
3	28/01/2020	D	0

I am interested in finding the time it takes for a patient show signs of no disease. Specifically, I want to find the time difference between the

the earliest timestamp where hasDisease = 1 and
*the earliest timestamp where hasDisease = 0

*However, this is only valid for negativeDiagnosisCodes not previously recorded.

Hence in the example above, the time difference (in days for simplicity, but actual data will be in hours) of ID 2 is 2 days, since the earliest timestamp where Disease = 1 is 01/01/2020 and the earliest timestamp where Disease = 0 is 03/01/2020.

For ID 3, there are 2 results - 18 days and 23 days. We take the time difference of 21/01/2020 and 03/01/2020, where the patient is first diagnosed with code C. We also take the time difference of 26/01/2020 and 03/01/2020, where the patient is first diagnosed with code D. We ignore the entries on 11/01/2020 and 20/01/2020 since the patient was already diagnosed with code B on 02/01/2020.

I hope to get an output like this:

ID	Time Difference	diagnosisCode
2	2 days	B
3	18 days	C
3	23 days	D

How can I achieve this efficiently? I have tried using pandas groupby() but I am stuck on the aggregation part. Do I need to use more complex data structures like graphs?

Solution

First create dataframe using group by to find earliest date for each ID and DiagnosisCode.

import pandas as pd
import numpy as np
from datetime import timedelta 
df = df.groupby(['ID', 'diagnosisCode'])[['Date']].min().unstack(1).reset_index()
df = df.droplevel(0, axis = 1)
df.rename(columns={ '': "ID" }, inplace = True)

After that, calculate difference date for each diagnosiscode and put nan value if difference is negative value.

df['Diff B'] = df.apply(lambda x: x['B'] - x['A'] if x['B'] - x['A'] > timedelta(0) else np.nan, axis=1)
df['Diff C'] = df.apply(lambda x: x['C'] - x['A'] if x['C'] - x['A'] > timedelta(0) else np.nan, axis=1)
df['Diff D'] = df.apply(lambda x: x['D'] - x['A'] if x['D'] - x['A'] > timedelta(0) else np.nan, axis=1)

alternatives

mask_B = df["B"] - df["A"] > timedelta(0) 
df.loc[mask_B ,'Diff B'] = df['B'] - df2['A'] 
mask_C = df["C"] - df["A"] > timedelta(0) 
df.loc[mask_C ,'Diff C'] = df['C'] - df['A'] 
mask_D = df["D"] - df["A"] > timedelta(0) 
df.loc[mask_D ,'Diff D'] = df['D'] - df['A']

Then, melt table into desired output.

result = pd.melt(df, id_vars='ID', value_vars=['Diff B', 'Diff C', 'Diff D']).dropna()

Thank you