python-3.xpandasnlpnltktext-chunking

Python (NLTK) - more efficient way to extract noun phrases?


I've got a machine learning task involving a large amount of text data. I want to identify, and extract, noun-phrases in the training text so I can use them for feature construction later on in the pipeline. I've extracted the type of noun-phrases I wanted from text but I'm fairly new to NLTK, so I approached this problem in a way where I can break down each step in list comprehensions like you can see below.

But my real question is, am I reinventing the wheel here? Is there a faster way to do this that I'm not seeing?

import nltk
import pandas as pd

myData = pd.read_excel("\User\train_.xlsx")
texts = myData['message']

# Defining a grammar & Parser
NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}"
chunkr = nltk.RegexpParser(NP)

tokens = [nltk.word_tokenize(i) for i in texts]

tag_list = [nltk.pos_tag(w) for w in tokens]

phrases = [chunkr.parse(sublist) for sublist in tag_list]

leaves = [[subtree.leaves() for subtree in tree.subtrees(filter = lambda t: t.label == 'NP')] for tree in phrases]

flatten the list of lists of lists of tuples that we've ended up with, into just a list of lists of tuples

leaves = [tupls for sublists in leaves for tupls in sublists]

Join the extracted terms into one bigram

nounphrases = [unigram[0][1]+' '+unigram[1][0] in leaves]

Solution

  • Take a look at Why is my NLTK function slow when processing the DataFrame?, there's no need to iterate through all rows multiple times if you don't need intermediate steps.

    With ne_chunk and solution from

    [code]:

    from nltk import word_tokenize, pos_tag, ne_chunk
    from nltk import RegexpParser
    from nltk import Tree
    import pandas as pd
    
    def get_continuous_chunks(text, chunk_func=ne_chunk):
        chunked = chunk_func(pos_tag(word_tokenize(text)))
        continuous_chunk = []
        current_chunk = []
    
        for subtree in chunked:
            if type(subtree) == Tree:
                current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
            elif current_chunk:
                named_entity = " ".join(current_chunk)
                if named_entity not in continuous_chunk:
                    continuous_chunk.append(named_entity)
                    current_chunk = []
            else:
                continue
    
        return continuous_chunk
    
    df = pd.DataFrame({'text':['This is a foo, bar sentence with New York city.', 
                               'Another bar foo Washington DC thingy with Bruce Wayne.']})
    
    df['text'].apply(lambda sent: get_continuous_chunks((sent)))
    

    [out]:

    0                   [New York]
    1    [Washington, Bruce Wayne]
    Name: text, dtype: object
    

    To use the custom RegexpParser :

    from nltk import word_tokenize, pos_tag, ne_chunk
    from nltk import RegexpParser
    from nltk import Tree
    import pandas as pd
    
    # Defining a grammar & Parser
    NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}"
    chunker = RegexpParser(NP)
    
    def get_continuous_chunks(text, chunk_func=ne_chunk):
        chunked = chunk_func(pos_tag(word_tokenize(text)))
        continuous_chunk = []
        current_chunk = []
    
        for subtree in chunked:
            if type(subtree) == Tree:
                current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
            elif current_chunk:
                named_entity = " ".join(current_chunk)
                if named_entity not in continuous_chunk:
                    continuous_chunk.append(named_entity)
                    current_chunk = []
            else:
                continue
    
        return continuous_chunk
    
    
    df = pd.DataFrame({'text':['This is a foo, bar sentence with New York city.', 
                               'Another bar foo Washington DC thingy with Bruce Wayne.']})
    
    
    df['text'].apply(lambda sent: get_continuous_chunks(sent, chunker.parse))
    

    [out]:

    0                  [bar sentence, New York city]
    1    [bar foo Washington DC thingy, Bruce Wayne]
    Name: text, dtype: object