pythonnlplinguistics

How to parse names from raw text


I was wondering if anyone knew of any good libraries or methods of parsing names from raw text.

For example, let's say I've got these as examples: (note sometimes they are capitalized tuples, other times not)

James Vaynerchuck and the rest of the group will be meeting at 1PM. 
Sally Johnson, Jim White and brad burton. 
Mark angleman Happiness, Productivity & blocks. Mark & Evan at 4pm.

My first thought is to load some sort of Part Of Speech tagger (like Pythons NLTK), tag all of the words. Then strip out only nouns, then compare the nouns against a database of known words (ie a literal dictionary), if they aren't in the dictionary, assume they are a name.

Other thoughts would be to delve into machine learning, but that might be beyond the scope of what I need here.

Any thoughts, suggestions or libraries you could point me to would be very helpful.

Thanks!


Solution

  • I don't know why you think you need NLTK just to rule out dictionary words; a simple dictionary (which you might have installed somewhere like /usr/share/dict/words, or you can download one off the internet) is all you need:

    with open('/usr/share/dict/words') as f:
        dictwords = {word.strip() for word in f}
    with open(mypath) as f:
        names = [word for line in f for word in line.rstrip().split()
                 if word.lower() not in dictwords]
    

    Your words list may include names, but if so, it will include them capitalized, so:

        dictwords = {word.strip() for word in f if word.islower()}
    

    Or, if you want to whitelist proper names instead of blacklisting dictionary words:

    with open('/usr/share/dict/propernames') as f:
        namewords = {word.strip() for word in f}
    with open(mypath) as f:
        names = [word for line in f for word in line.rstrip().split()
                 if word.title() in namewords]
    

    But this really isn't going to work. Look at "Jim White" from your example. His last name is obviously going to be in any dictionary, and his first name will be in many (as a short version of "jimmy", as a common romanization of the Arabic letter "jīm", etc.). "Mark" is also a common dictionary word. And the other way around, "Will" is a very common name even though you want to treat it as a word, and "Happiness" is an uncommon name, but at least a few people have it.

    So, to make this work even the slightest bit, you probably want to combine multiple heuristics. First, instead of a word being either always a name or never a name, each word has a probability of being used as a name in some relevant corpus—White may be a name 13.7% of the time, Mark 41.3%, Jim 99.1%, Happiness 0.1%, etc. Next, if it's not the first word in a sentence, but is capitalized, it's much more likely to be a name (how much more? I don't know, you'll need to test and tune for your particular input), and if it's lowercase, it's less likely to be a name. You could bring in more context—for example, you have a lot of full names, so if something is a possible first name and it appears right next to something that's a common last name, it's more likely to be a first name. You could even try to parse the grammar (it's OK if you bail on some sentences; they just won't get any input from the grammar rule), so if two adjacent words only work as part of a sentence one if the second one is a verb, they're probably not a first and last name, even if that same second word could be a noun (and a name) in other contexts. And so on.