I'm writing a Pandas script to perform data manipulation on an excel file. First, I load two sheets into dataframes. One is the original data df
, the second is a sheet detailing replacements that need to be made in the original data replace
.
The script needs to do two things for each row of df
.
Replace each instance of 'Name'
in df
with 'NameReplace'
(working)
For the same rows in df
, replace a slice of the columns (specified by a list) with the values in the same slice of columns in replace
Reproducible Minimal Example of my current implementation:
import pandas
df = pandas.DataFrame([["John", None, None],["Phil", None, None],["John", None, None],["Bob", None, None]], columns=["Name", "Age", "Height"])
replace = pandas.DataFrame([["John", "Dom", 25, 175],["Phil", "Kevin", 56, 145],["Bob", "Michael", 33, 180]], columns=["Name", "NameReplace", "Age", "Height"])
detailsList = ["Age", "Height"]
for i, row in replace.iterrows():
df.loc[df['Name'] == row['Name'], 'Name'] = row['NameReplace']
df.loc[df['Name'] == row['NameReplace'], detailsList] = row[detailsList]
print(df)
Step 1) is working with this implementation, but the detailsList
columns in df
do not get populated.
The current output is
Name Age Height
0 Dom NaN NaN
1 Kevin NaN NaN
2 Dom NaN NaN
3 Michael NaN NaN
The desired output is
Name Age Height
0 Dom 25 175
1 Kevin 56 145
2 Dom 25 175
3 Michael 33 180
I've been trying for a while now, and cannot seem to make progress. I also don't really get why this doesn't work, so any insight there would be extra appreciated!
Note: Using detailsList
to specify the slice of columns is necessary, as in the real solution I am only operating on a specific slice of the full dataframe, unlike the example I've given.
The problem is the way that pandas tries to assign a series to a whole dataframe. Anyway, here's a simple fix that leads to the intended behavior, taking advantage that pandas does the correct thing when you assign with a numpy array rather than with a series.
for i, row in replace.iterrows():
df.loc[df['Name'] == row['Name'], 'Name'] = row['NameReplace']
df.loc[df['Name'] == row['NameReplace'], detailsList] = row[detailsList].values
Other optimizations:
df['Name'] == row['Name']
mask. In particular, you save some work withfor i, row in replace.iterrows():
mask = df['Name'] == row['Name']
df.loc[mask, 'Name'] = row['NameReplace']
df.loc[mask, detailsList] = row[detailsList].values
df = (df[['Name']].merge(replace, on = 'Name')
.drop(columns='Name')
.rename(columns={'NameReplace':'Name'}))
The catch with this approach is that the rows might end up reordered.
df = (replace.set_index('Name')
.loc[df['Name'].values]
.reset_index(drop=True)
.rename(columns={'NameReplace': 'Name'}))