I have a dataframe that has information about employee's employment info and I am trying to combine with another dataframe that has their Employee ID #.
df
Name SSN
Doe, John A XXXX-XX-1234
Doe, Jane B XXXX-XX-9876
Test, Example XXXX-XX-0192
Employee_Info
First_Name Last_Name SSN EmployeeID
John Doe 999-45-1234 12
JANE DOE 999-45-9876 13
Example Test 999-45-0192 14
My desired output is:
Name SSN EmployeeID
Doe, John A XXX-XX-1234 12
Doe, Jane B XXX-XX-9876 13
Test, Example XXX-XX-0192 14
The df
dataframe actually has the SSN masked except for the last 4 characters. Here is the code I have currently:
df['SSN_Last_4'] = df['SSN4'].str[-4:]
Employee_Info['SSN_Last_4'] = Employee_Info['SSN'].str[-4:]
df2 = pd.merge(df, Employee_Info, on='SSN', how='left')
However because some employees might have the same last 4 digits of SSN, I need to also match based on name. However the caveat is that the Name
in df
is the employee fullname (which might include middle initial) and the case might be different. My original idea was to split the Name on ,
and drop middle initial, and then convert all the name columns to be lowercase and modify the join. However I feel that there are better methods to join the data.
Another possible solution:
pd.merge(
df2.assign(
First_Name=df2['First_Name'].str.upper(),
Last_Name=df2['Last_Name'].str.upper()),
pd.concat([
df1,
df1['Name'].str.replace(r'\s\D$', '', regex=True)
.str.upper().str.split(', ', expand=True)], axis=1),
right_on=[1, 0], left_on=['First_Name', 'Last_Name'],
suffixes=['', '_y'])[df1.columns.to_list() + ['EmployeeID']]
It first modifies df2
using assign
to create uppercase versions of the First_Name
and Last_Name
columns. Then, it constructs an extended version of df1
using concat
, where the Name
column is processed with str.replace()
to remove any trailing single-character initials and then split into separate first and last names using str.split(expand=True)
. The merge()
function is then applied, aligning the transformed name columns (First_Name
and Last_Name
) with the corresponding split names from df1
, using right_on=[1, 0]
and left_on=['First_Name', 'Last_Name']
. Finally, the output retains only the columns from df1
, along with the EmployeeID
column.
The following updates the solution to contemplate the case mentioned by the OP in a comment below:
pd.merge(
df2.assign(
First_Name=df2['First_Name'].str.upper(),
Last_Name=df2['Last_Name'].str.upper(),
aux=df2['SSN'].str.extract(r'.*\-(\d{4})$')),
pd.concat([
df1.assign(aux=df2['SSN'].str.extract(r'.*\-(\d{4})$')),
df1['Name'].str.replace(r'\s\D$', '', regex=True)
.str.upper().str.split(', ', expand=True)], axis=1),
right_on=[1, 0, 'aux'], left_on=['First_Name', 'Last_Name', 'aux'],
suffixes=['', '_y'])[df1.columns.to_list() + ['EmployeeID']]
Output:
Name SSN EmployeeID
0 Doe, John A 999-45-1234 12
1 Doe, Jane B 999-45-9876 13
2 Test, Example 999-45-0192 14