pythonpandasdataframeconditional-statements

How to filter rows in a DataFrame based on a column condition


I have a large dataset in a pandas DataFrame that looks like this:

HR SBP DBP SepsisLabel PatientID
92 120 80 0 0
98 115 85 0 0
93 125 75 0 0
95 130 90 0 1
102 120 80 1 1
109 115 75 1 1
94 135 100 0 2
97 100 70 0 2
85 120 80 0 2
88 115 75 0 3
93 125 85 1 3
78 130 90 1 3
115 140 110 0 4
102 120 80 0 4
98 140 110 0 4

I want to create a new DataFrame that includes only the rows where SepsisLabel is 1, and therefore excludes rows with PatientIDs that do not have any SepsisLabel of 1.

For example, PatientIDs 0, 2, and 4 do not have any SepsisLabel of 1, so they should not appear in the new DataFrame. Only PatientIDs 1 and 3, which have at least one row with SepsisLabel of 1, should be included.

How can I achieve this in Python using pandas?

Here is my initial attempt, but I am not sure if it is correct or optimal:

# My initial attempt

filtered_df = df[df['SepsisLabel'] == 1]
result_df = df[df['PatientID'].isin(filtered_df['PatientID'].unique())]

Any guidance or improvements would be greatly appreciated.


Solution

  • Use GroupBy.transform with GroupBy.any for test if at least one True per groups and filtering by boolean indexing:

    df1 = df[df['SepsisLabel'].eq(1).groupby(df['PatientID']).transform('any')]
    

    Or filter all groups with 1 and filter them in Series.isin:

    df1 = df[df['PatientID'].isin(df.loc[df['SepsisLabel'].eq(1), 'PatientID'])]
    

    If small data or performance not important is possible use DataFrameGroupBy.filter:

    df1 = df.groupby('PatientID').filter(lambda x: x['SepsisLabel'].eq(1).any())
    

    print (df1)
         HR  SBP  DBP  SepsisLabel  PatientID
    3    95  130   90            0          1
    4   102  120   80            1          1
    5   109  115   75            1          1
    9    88  115   75            0          3
    10   93  125   85            1          3
    11   78  130   90            1          3