pandasconcatenationtransformation

Concatenate values under into a new column


Consider this df:

enter image description here

    data = {'ID': [1071.0, 1072.0, nan, 1074.0, 1076.0, nan, nan, nan, 1077.0], 
'Name Type': ['Primary Name', 'Primary Name', 'Also Known As', 'Primary Name', 'Primary Name', 'Low Quality AKA', 'Low Quality AKA', 'Low Quality AKA', 'Primary Name'], 
'Surname': ['Brown', 'Red', 'R', 'Green', 'Purple', 'Pipi', 'Poopa', 'Peep', 'Orange']}

There are many more columns that have info in the rows with Primary Name but empty with akas. I need to concatenate values under each Primary Name - Surname if they are Low Quality AKA or Also Known as and achieve this dataframe:

enter image description here


Solution

  • This is not how I would recommend setting up the data. I would probably have two datasets with A matching ID and then Left Join the table. That being said, here is the code for your situation.

    import pandas as pd
    import numpy as np
    
    data = {'ID': [1071.0, 1072.0, np.nan, 1074.0, 1076.0, np.nan, np.nan, np.nan, 1077.0], 
            'Name Type': ['Primary Name', 'Primary Name', 'Also Known As', 'Primary Name', 'Primary Name', 'Low Quality AKA', 'Low Quality AKA', 'Low Quality AKA', 'Primary Name'], 
            'Surname': ['Brown', 'Red', 'R', 'Green', 'Purple', 'Pipi', 'Poopa', 'Peep', 'Orange']}
    
    df = pd.DataFrame(data)
    
    # Forward fill the IDs and keep only rows with 'Primary Name'
    df['ID'] = df['ID'].ffill()
    primary_names = df[df['Name Type'] == 'Primary Name'].copy()
    
    # Group by ID and aggregate the surnames for AKA names
    akas = df[df['Name Type'].str.contains('AKA')].groupby('ID')['Surname'].agg(lambda x: ';'.join(x)).reset_index()
    akas.rename(columns={'Surname': 'AKAs'}, inplace=True)
    
    # Combine data
    result = pd.merge(primary_names, akas, on='ID', how='left')
    
    # Put empty string in AKAs col
    result['AKAs'] = result['AKAs'].fillna('')
    
    # Select and reorder columns to match the desired output
    result = result[['ID', 'Name Type', 'Surname', 'AKAs']]
    
    print(result)