pythonlistdataframematchingtextmatching

Can I get Python to compare a list of nicknames with a list of full names?


So first off I have a character data frame that has a column called name and contains the full name for 100+ people.

Eg, Name: Johnathan Jay Smith, Harold Robert Doe, Katie Holt.

Then I have a list of unique nicknames eg, [Mr. Doe, Aunt Katie, John]

It's important to note that they are not in the same order, and that not everyone with a nickname is in the full name list, and not everyone in the full name list is in the nickname list. I will be removing rows that don't have matching values at the end.

My Question: is there a way I can get python to read through these 2 lists item by item and match John with Johnathan Jay Smith for everyone that has a match? Basically if the nickname appears as a part of the whole name, can I add a nickname column to my existing character data frame without doing this manually for over 100 people?

Thank you in advance, I don't even know where to start with this one!


Solution

  • This is very straight forward and does not take spelling variants into account

    from itertools import product
    
    names = ['Johnathan Jay Smith', 'Harold Robert Doe', 'Katie Holt']
    nicknames = ["Mr. Doe", "Aunt Katie", "John"]
    
    def match_nicknames(names, nicknames):
        splitted_names = [n.split(' ') for n in names]
        splitted_nn = [n.split(' ') for n in nicknames]
        matches = []
        for name in splitted_names:
            name_pairs = product(name, splitted_nn)
            matched = filter(lambda x: any([nn in x[0] for nn in x[1]]), name_pairs)
            if matched:
                matches += [(" ".join(name), " ".join(nn)) for name_part, nn in matched]
        return matches
    
    match_nicknames(names, nicknames)
    >> [('Johnathan Jay Smith', 'John'),
        ('Harold Robert Doe', 'Mr. Doe'),
        ('Katie Holt', 'Aunt Katie')]