Consider this df:
data = [{'name': 'Muhammad', 'age_x': 20, 'city_x': 'Karachi', 'age_y': 20, 'city_y': 'Karachi'},
{'name': 'Ali', 'age_x': 19, 'city_x': 'Lahore', 'age_y': 30, 'city_y': None},
{'name': 'Ahmed', 'age_x': 25, 'city_x': 'Islamabad', 'age_y': None, 'city_y': 'Islamabad'}]
df = pd.DataFrame(data)
cols = {'age_x':'age_y', 'city_x':'city_y'}
The idea is to compare multiple column pairs (age_x and age_y, city_x and city_y in this example) and return the column name(s) if the value is different. There are many columns to check that so would be good to use the cols dictionary in the solution. The column names to return are the ones with '_y'. So the expected result is:
name age_x city_x age_y city_y diff
0 Muhammad 20 Karachi 20.0 Karachi None
1 Ali 19 Lahore 30.0 None age_y, city_y
2 Ahmed 25 Islamabad NaN Islamabad age_y
Use apply
on axis=1
and a list comprehension derived from cols
:
df["diff"] = df.apply(
lambda row: ", ".join([y for x, y in cols.items() if row[x] != row[y]]) or None,
axis=1,
)
name age_x city_x age_y city_y diff
0 Muhammad 20 Karachi 20.0 Karachi None
1 Ali 19 Lahore 30.0 None age_y, city_y
2 Ahmed 25 Islamabad NaN Islamabad age_y
You can also get cols
from df
instead of typing it manually:
df = df.sort_index(axis=1) # sort columns if not already sorted
cols = {
x: y
for x, y in zip(df.filter(regex=r"_x$").columns, df.filter(regex=r"_y$").columns)
}
{'age_x': 'age_y', 'city_x': 'city_y'}