I have a large dataset in a pandas DataFrame that looks like this:
HR | SBP | DBP | SepsisLabel | PatientID |
---|---|---|---|---|
92 | 120 | 80 | 0 | 0 |
98 | 115 | 85 | 0 | 0 |
93 | 125 | 75 | 0 | 0 |
95 | 130 | 90 | 0 | 1 |
102 | 120 | 80 | 1 | 1 |
109 | 115 | 75 | 1 | 1 |
94 | 135 | 100 | 0 | 2 |
97 | 100 | 70 | 0 | 2 |
85 | 120 | 80 | 0 | 2 |
88 | 115 | 75 | 0 | 3 |
93 | 125 | 85 | 1 | 3 |
78 | 130 | 90 | 1 | 3 |
115 | 140 | 110 | 0 | 4 |
102 | 120 | 80 | 0 | 4 |
98 | 140 | 110 | 0 | 4 |
I want to create a new DataFrame that includes only the rows where SepsisLabel
is 1, and therefore excludes rows with PatientIDs
that do not have any SepsisLabel
of 1.
For example, PatientIDs
0, 2, and 4 do not have any SepsisLabel
of 1, so they should not appear in the new DataFrame. Only PatientIDs
1 and 3, which have at least one row with SepsisLabel
of 1, should be included.
How can I achieve this in Python using pandas?
Here is my initial attempt, but I am not sure if it is correct or optimal:
# My initial attempt
filtered_df = df[df['SepsisLabel'] == 1]
result_df = df[df['PatientID'].isin(filtered_df['PatientID'].unique())]
Any guidance or improvements would be greatly appreciated.
Use GroupBy.transform
with GroupBy.any
for test if at least one True
per groups and filtering by boolean indexing
:
df1 = df[df['SepsisLabel'].eq(1).groupby(df['PatientID']).transform('any')]
Or filter all groups with 1
and filter them in Series.isin
:
df1 = df[df['PatientID'].isin(df.loc[df['SepsisLabel'].eq(1), 'PatientID'])]
If small data or performance not important is possible use DataFrameGroupBy.filter
:
df1 = df.groupby('PatientID').filter(lambda x: x['SepsisLabel'].eq(1).any())
print (df1)
HR SBP DBP SepsisLabel PatientID
3 95 130 90 0 1
4 102 120 80 1 1
5 109 115 75 1 1
9 88 115 75 0 3
10 93 125 85 1 3
11 78 130 90 1 3