So first off I have a character data frame that has a column called name and contains the full name for 100+ people.
Eg, Name: Johnathan Jay Smith, Harold Robert Doe, Katie Holt.
Then I have a list of unique nicknames eg, [Mr. Doe, Aunt Katie, John]
It's important to note that they are not in the same order, and that not everyone with a nickname is in the full name list, and not everyone in the full name list is in the nickname list. I will be removing rows that don't have matching values at the end.
My Question: is there a way I can get python to read through these 2 lists item by item and match John with Johnathan Jay Smith for everyone that has a match? Basically if the nickname appears as a part of the whole name, can I add a nickname column to my existing character data frame without doing this manually for over 100 people?
Thank you in advance, I don't even know where to start with this one!
This is very straight forward and does not take spelling variants into account
from itertools import product
names = ['Johnathan Jay Smith', 'Harold Robert Doe', 'Katie Holt']
nicknames = ["Mr. Doe", "Aunt Katie", "John"]
def match_nicknames(names, nicknames):
splitted_names = [n.split(' ') for n in names]
splitted_nn = [n.split(' ') for n in nicknames]
matches = []
for name in splitted_names:
name_pairs = product(name, splitted_nn)
matched = filter(lambda x: any([nn in x[0] for nn in x[1]]), name_pairs)
if matched:
matches += [(" ".join(name), " ".join(nn)) for name_part, nn in matched]
return matches
match_nicknames(names, nicknames)
>> [('Johnathan Jay Smith', 'John'),
('Harold Robert Doe', 'Mr. Doe'),
('Katie Holt', 'Aunt Katie')]