pythonpandaslifelines

First occurrence of a specific value in a row (prepping for survival analysis) - python


enter image description here

I have the following data (see attached - easier this way). I am trying to find the first occurrence of the value 0 for each customer ID. Then, I plan to use code similar to below to create a Kaplan-Meier curve:

    from lifelines import KaplanMeierFitter

## Example Data 
durations = [5,6,6,2.5,4,4]
event_observed = [1, 0, 0, 1, 1, 1]

## create a kmf object
kmf = KaplanMeierFitter() 

## Fit the data into the model
kmf.fit(durations, event_observed,label='Kaplan Meier Estimate')

## Create an estimate
kmf.plot(ci_show=False) ## ci_show is meant for Confidence interval, since our data set is too tiny, thus i am not showing it.

(this code is from here).

What' the simplest way to do this? Note that I want to ignore the NAs: I have plenty of them and there's no getting around that.

Thanks!


Solution

  • I'm gonna assume that all rows contain at least one non-NaN value.

    One thing we'd have to do first is just ensure that we operate on a dataframe where there is indeed a zero; we can accomplish this with min.

    This will give us a series, and we just have to select on the rows that contain zero:

    df.loc[min_series == 0]
    

    Then, we can use idxmin:

    df.idxmin(1, skipna=True)
    

    This should spit out the period on which the first 0 is encountered (we've guaranteed that all rows contain a 0).

    Then, this should give you what you're looking for!