pythonpandasdataframealgorithmpanel-data

Find time difference of first instance of a value and first instance of another value in dataframe


I have an already sorted dataframe of panel data below, where a certain diagnosisCode is mapped to a positive or false value in hasDisease. I have 2 arrays which indicate whether a code is positive or negative

positiveDiagnosisCodes = [A] # corresponds to hasDisease = 1
negativeDiagnosisCodes = [B, C, D] # corresponds to hasDisease = 0
ID Date diagnosisCode hasDisease
2 01/01/2020 A 1
2 02/01/2020 A 1
2 03/01/2020 B 0
2 04/01/2020 B 0
2 07/01/2020 A 1
3 02/01/2020 B 0
3 03/01/2020 A 1
3 07/01/2020 A 1
3 11/01/2020 B 0
3 15/01/2020 A 1
3 18/01/2020 A 1
3 19/01/2020 A 1
3 20/01/2020 B 0
3 21/01/2020 C 0
3 22/01/2020 A 1
3 25/01/2020 A 1
3 26/01/2020 D 0
3 28/01/2020 D 0

I am interested in finding the time it takes for a patient show signs of no disease. Specifically, I want to find the time difference between the

  1. the earliest timestamp where hasDisease = 1 and
  2. *the earliest timestamp where hasDisease = 0

*However, this is only valid for negativeDiagnosisCodes not previously recorded.


Hence in the example above, the time difference (in days for simplicity, but actual data will be in hours) of ID 2 is 2 days, since the earliest timestamp where Disease = 1 is 01/01/2020 and the earliest timestamp where Disease = 0 is 03/01/2020.

For ID 3, there are 2 results - 18 days and 23 days. We take the time difference of 21/01/2020 and 03/01/2020, where the patient is first diagnosed with code C. We also take the time difference of 26/01/2020 and 03/01/2020, where the patient is first diagnosed with code D. We ignore the entries on 11/01/2020 and 20/01/2020 since the patient was already diagnosed with code B on 02/01/2020.


I hope to get an output like this:

ID Time Difference diagnosisCode
2 2 days B
3 18 days C
3 23 days D

How can I achieve this efficiently? I have tried using pandas groupby() but I am stuck on the aggregation part. Do I need to use more complex data structures like graphs?


Solution

  • First create dataframe using group by to find earliest date for each ID and DiagnosisCode.

    import pandas as pd
    import numpy as np
    from datetime import timedelta 
    df = df.groupby(['ID', 'diagnosisCode'])[['Date']].min().unstack(1).reset_index()
    df = df.droplevel(0, axis = 1)
    df.rename(columns={ '': "ID" }, inplace = True)
    

    After that, calculate difference date for each diagnosiscode and put nan value if difference is negative value.

    df['Diff B'] = df.apply(lambda x: x['B'] - x['A'] if x['B'] - x['A'] > timedelta(0) else np.nan, axis=1)
    df['Diff C'] = df.apply(lambda x: x['C'] - x['A'] if x['C'] - x['A'] > timedelta(0) else np.nan, axis=1)
    df['Diff D'] = df.apply(lambda x: x['D'] - x['A'] if x['D'] - x['A'] > timedelta(0) else np.nan, axis=1)
    

    alternatives

    mask_B = df["B"] - df["A"] > timedelta(0) 
    df.loc[mask_B ,'Diff B'] = df['B'] - df2['A'] 
    mask_C = df["C"] - df["A"] > timedelta(0) 
    df.loc[mask_C ,'Diff C'] = df['C'] - df['A'] 
    mask_D = df["D"] - df["A"] > timedelta(0) 
    df.loc[mask_D ,'Diff D'] = df['D'] - df['A']
    

    Then, melt table into desired output.

    result = pd.melt(df, id_vars='ID', value_vars=['Diff B', 'Diff C', 'Diff D']).dropna()
    

    Thank you