pythonpandas

Pandas Join Two Series Based on Conditions


I have a dataframe that has information about employee's employment info and I am trying to combine with another dataframe that has their Employee ID #.

df

Name             SSN
Doe, John A      XXXX-XX-1234
Doe, Jane B      XXXX-XX-9876
Test, Example    XXXX-XX-0192

Employee_Info

First_Name    Last_Name            SSN     EmployeeID
      John          Doe    999-45-1234             12
      JANE          DOE    999-45-9876             13
   Example         Test    999-45-0192             14

My desired output is:

Name             SSN          EmployeeID
Doe, John A      XXX-XX-1234          12
Doe, Jane B      XXX-XX-9876          13
Test, Example    XXX-XX-0192          14

The df dataframe actually has the SSN masked except for the last 4 characters. Here is the code I have currently:

df['SSN_Last_4'] = df['SSN4'].str[-4:]
Employee_Info['SSN_Last_4'] = Employee_Info['SSN'].str[-4:]
df2 = pd.merge(df, Employee_Info, on='SSN', how='left')

However because some employees might have the same last 4 digits of SSN, I need to also match based on name. However the caveat is that the Name in df is the employee fullname (which might include middle initial) and the case might be different. My original idea was to split the Name on , and drop middle initial, and then convert all the name columns to be lowercase and modify the join. However I feel that there are better methods to join the data.


Solution

  • Another possible solution:

    pd.merge(
        df2.assign(
            First_Name=df2['First_Name'].str.upper(), 
            Last_Name=df2['Last_Name'].str.upper()), 
        pd.concat([
            df1, 
            df1['Name'].str.replace(r'\s\D$', '', regex=True)
            .str.upper().str.split(', ', expand=True)], axis=1), 
        right_on=[1, 0], left_on=['First_Name', 'Last_Name'], 
        suffixes=['', '_y'])[df1.columns.to_list() + ['EmployeeID']]
    

    It first modifies df2 using assign to create uppercase versions of the First_Name and Last_Name columns. Then, it constructs an extended version of df1 using concat, where the Name column is processed with str.replace() to remove any trailing single-character initials and then split into separate first and last names using str.split(expand=True). The merge() function is then applied, aligning the transformed name columns (First_Name and Last_Name) with the corresponding split names from df1, using right_on=[1, 0] and left_on=['First_Name', 'Last_Name']. Finally, the output retains only the columns from df1, along with the EmployeeID column.


    The following updates the solution to contemplate the case mentioned by the OP in a comment below:

    pd.merge(
        df2.assign(
            First_Name=df2['First_Name'].str.upper(), 
            Last_Name=df2['Last_Name'].str.upper(), 
            aux=df2['SSN'].str.extract(r'.*\-(\d{4})$')),
        pd.concat([
            df1.assign(aux=df2['SSN'].str.extract(r'.*\-(\d{4})$')), 
            df1['Name'].str.replace(r'\s\D$', '', regex=True)
            .str.upper().str.split(', ', expand=True)], axis=1), 
        right_on=[1, 0, 'aux'], left_on=['First_Name', 'Last_Name', 'aux'], 
        suffixes=['', '_y'])[df1.columns.to_list() + ['EmployeeID']]
    

    Output:

                Name          SSN  EmployeeID
    0    Doe, John A  999-45-1234          12
    1    Doe, Jane B  999-45-9876          13
    2  Test, Example  999-45-0192          14