Left Dataframe MASTER =
Fname | Lname | Amount |
---|---|---|
John | Smith-Richards | |
David | O'Brien |
Right Dataframe ORDERS =
Fname | Lname | Amount |
---|---|---|
DAVID | OBRIEN | 36 |
john | smith richards | 11 |
Expected output Dataframe =
Fname | Lname | Amount |
---|---|---|
John | Smith-Richards | 11 |
David | O'Brien | 36 |
The tables MASTER and ORDERS are LEFT-OUTER joined with KEY = [Fname, Lname]
Both tables have the same columns but the rows are not in the same order. I want the join operation to ignore case and ignore characters like dashes, spaces and apostrophes (e.g. the tables would be joined on the values "johnsmithrichards" and "davidobrien")
(In practice, the actual tables have many more columns and the resulting output table must pull some values from the MASTER table and others from the ORDERS table, which is why I used mdfcols
and odfcols
respectively to only include those specific columns.)
My attempt at removing just spaces doesn't work:
key = ["Fname", "Lname"]
mdfkey = (mdf["First Name"].str.lower() + mdf["Last Name"].str.lower()).replace(' ', '')
mdfcols = ["Fname", "Lname"]
odfkey = (odf["First Name"].str.lower() + odf["Last Name"].str.lower()).replace(' ', '')
odfCols = ["Amount"]
outputdf = pd.merge(
mdf.loc[:, mdfcols],
odf.loc[:, odfCols],
how='left', left_on=mdfkey, right_on=odfkey)
Use Series.str.replace
for remove non alpha numeric values for both DataFrames, last remove helper column key_0
:
key = ["Fname", "Lname"]
mdfkey = (mdf["Fname"] + mdf["Lname"]).str.replace('[^a-zA-Z]', '', regex=True).str.lower()
mdfcols = ["Fname", "Lname"]
odfkey = (odf["Fname"] + odf["Lname"]).str.replace('[^a-zA-Z]', '', regex=True).str.lower()
odfCols = ["Amount"]
outputdf = pd.merge(
mdf.loc[:, mdfcols],
odf.loc[:, odfCols],
how='left', left_on=mdfkey, right_on=odfkey).drop('key_0', axis=1)
print (outputdf)
Fname Lname Amount
0 John Smith-Richards 11
1 David O'Brien 36
Solution with helper columns:
merging = ['merge']
key = ["Fname", "Lname"]
mdf['merge'] = ((mdf["Fname"] + mdf["Lname"]).str.replace('[^a-zA-Z]', '', regex=True)
.str.lower())
mdfcols = ["Fname", "Lname"]
odf['merge'] = ((odf["Fname"] + odf["Lname"]).str.replace('[^a-zA-Z]', '', regex=True)
.str.lower())
odfCols = ["Amount"]
outputdf = pd.merge(
mdf.loc[:, mdfcols + merging],
odf.loc[:, odfCols + merging],
how='left', on=merging).drop(merging, axis=1)
print (outputdf)
Fname Lname Amount
0 John Smith-Richards 11
1 David O'Brien 36