pythonnltktext-chunking

NLTK - Replace chunks with specific word


I am working on NLP using nltk. I am using chunking to extract names of people. After chunking I want to replace the chunks with specific strings 'Male' or 'Female'.

My code is :

import nltk

with open('male_names.txt') as f1:
    male = [line.rstrip('\n') for line in f1]
with open('female_names.txt') as f2:
     female = [line.rstrip('\n') for line in f2]

with open("input.txt") as f:
    text = f.read()

words = nltk.word_tokenize(text)
tagged = nltk.pos_tag(words)
chunkregex = r"""Name: {<NNP>+}"""
chunkParser = nltk.RegexpParser(chunkregex)
chunked = chunkParser.parse(tagged)

for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Name'):
    chunk=[]
    for word, pos in subtree:
        chunk.append(word)
        temp = " ".join(chunk)
    **if temp in male:
        subtree = ('Male', pos)
    if temp in female:
        subtree = ('Female', pos)**
    print subtree

print chunked

My input data is :

Captain Jack Sparrow arrives in Port Royal in Jamaica to commandeer a ship. Despite rescuing Elizabeth Swann, the daughter of Governor Weatherby Swann, from drowning, he is jailed for piracy.

The current output is :

(S (Name Captain/NNP Jack/NNP Sparrow/NNP) arrives/VBZ in/IN (Name Port/NNP Royal/NNP) in/IN (Name Jamaica/NNP) to/TO commandeer/VB a/DT ship/NN ./. Despite/IN rescuing/VBG (Name Elizabeth/NNP Swann/NNP) ,/, the/DT daughter/NN of/IN (Name Governor/NNP Weatherby/NNP Swann/NNP) ,/, from/IN drowning/VBG ,/, he/PRP is/VBZ jailed/VBN for/IN piracy/NN ./.)

I want to replace the chunks with 'Male' or 'Female' which should give the output as :

(S Male/NNP arrives/VBZ in/IN (Name Port/NNP Royal/NNP) in/IN (Name Jamaica/NNP) to/TO commandeer/VB a/DT ship/NN ./. Despite/IN rescuing/VBG Female/NNP ,/, the/DT daughter/NN of/IN Male/NNP ,/, from/IN drowning/VBG ,/, he/PRP is/VBZ jailed/VBN for/IN piracy/NN ./.)

The bold part in the code is not doing what it's supposed to. The print subtree statement shows the changes but print chunked does not change.

What am I doing wrong or is there any other way ?
I am new to python and nltk. Any help appreciated.

male and female contain list of names as :

["Captain Jack Sparrow", "Governor Weatherby Swann", "Robin"]

["Elizabeth Swann", "Jenny"]


Solution

  • I don't know if I Understood your problem correctly. NLTK subtree is just normal Python List. So you can carry out normal list operations here as well.Try this code snippet instead of for loop part in your code.

    for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Name'):
        full_name = []
        for word, pos in subtree:
            full_name.append(word)
            st = " ".join(full_name)  # iterate till the variable catches full name as tokenizer segments words.
            if st in male:
                subtree[:] = [("Male",pos)]  # replacing the subtree with our own value
            elif st in female:
                subtree[:] = [("Female",pos)]
    

    output:

    > (S (Name male/NNP) arrives/VBZ in/IN (Name Port/NNP Royal/NNP) in/IN (Name Jamaica/NNP) to/TO commandeer/VB a/DT ship/NN ./. Despite/IN rescuing/VBG (Name female/NNP) ,/, the/DT daughter/NN of/IN (Name male/NNP) ,/, from/IN drowning/VBG ,/, he/PRP is/VBZ jailed/VB for/IN piracy/NN./.)