I have a dataframe that looks like this
import pandas as pd
all_df = pd.DataFrame(
[
('A','Area1','AA','A B D E','A B','D E'),
('B','Area1','AA','A B D E','A B','D E'),
('C','Area2','BB','C','C','C'),
('E','Area1','CC','A B D E','A B','D E'),
('F','Area3','BB','F G','G','F')
],
columns = ['Name','Area','Type','Group','AA members','CC members']
)
Name Area Type Group AA members CC members
0 A Area1 AA A B D E A B D E
1 B Area1 AA A B D E A B D E
2 C Area2 BB C C C
3 E Area1 CC A B D E A B D E
4 F Area3 BB F G G F
The last row (row 4) is incorrect. Anything that is type BB should only have itself (F) in Group
, AA members
, CC members
. In other words, it should look like this:
4 F Area3 BB F F F
To do this I tried:
Check when Type is BB
and length of Group
is 2:
all_df.loc[all_df['Type']== 'BB']['Group'].str.split().str.len() == 2
Iterate over every row and to find the cases like this
Make a new df with all these rows and make the Group, AA members, CC members = Name
Drop the rows where that happens in all_df
.
Merge 3 back in to all_df
.
Is there a better pandas way to do this?
We can use boolean indexing to replace values here. The key here is that we don't really need to check if the Group value is correct or not; we can simply assign the Name to the latter three columns if Type is BB.
# identify rows where Type is BB
m = all_df['Type'] == 'BB'
# for Type BB rows, replace Group, AA members and CC members values by Name
all_df.loc[m, ['Group', 'AA members', 'CC members']] = all_df.loc[m, 'Name']
The above filters the rows where Type is BB and replaces the values in the relevant columns. Another way is to use mask
to replace values in the relevant columns.
cols = ['Group', 'AA members', 'CC members']
all_df[cols] = all_df[cols].mask(all_df['Type'] == 'BB', all_df['Name'], axis=0)
Both return the desired output:
Name Area Type Group AA members CC members
0 A Area1 AA A B D E A B D E
1 B Area1 AA A B D E A B D E
2 C Area2 BB C C C
3 E Area1 CC A B D E A B D E
4 F Area3 BB F F F