What I have:
df
Name |Vehicle
Dave |Car
Mark |Bike
Steve|Car
Dave |
Steve|
I want to drop duplicates from the Name column but only if the corresponding value in Vehicle column is null. I know I can use
df.dropduplicates(subset=['Name'])
with either Keep =
either 'First' or 'Last'
but what I am looking for is a way to drop duplicates from Name
column where the corresponding value of Vehicle
column is null
. So basically, keep the Name
if the Vehicle
column is NOT null and drop the rest. If a name does not have a duplicate,then keep that row even if the corresponding value in Vehicle
is null.
Many Thanks
I think you need chain 2 masks with bitwise OR
(|
) with Series.notna
and Series.duplicated
:
m1 = df['Vehicle'].notna()
m2 = ~df['Name'].duplicated()
df1 = df[m1 & m2]
print (df1)
Name Vehicle
0 Dave Car
1 Mark Bike
2 Steve Car
If want these operations separately - first remove all NaNs rows and then remove duplicates for avoid testing duplicates in NaN
s rows (if necessary):
df2 = df.dropna(subset=['Vehicle']).drop_duplicates('Name')
print (df2)
Name Vehicle
0 Dave Car
1 Mark Bike
2 Steve Car