pythonnlpspacynamed-entity-extraction

Extract Named Entities using SpaCy and python lambda


I am using following code to extract Named Entities using lambda.

df['Place'] = df['Text'].apply(lambda x: [entity.text for entity in nlp(x).ents if entity.label_ == 'GPE'])

and

df['Text'].apply(lambda x: ([entity.text for entity in nlp(x).ents if entity.label_ == 'GPE'] or [''])[0])

For a few hundred records it can extract results. But when it comes to thousands of records. It takes pretty much forever. Can someone help me to optimize this line of code?


Solution

  • You may improve by:

    1. Calling nlp.pipe on the whole list of documents
    2. Disabling unnecessary pipes.

    Try:

    import spacy
    nlp = spacy.load("en_core_web_md", disable = ["tagger","parser"])
    
    df = pd.DataFrame({"Text":["this is a text about Germany","this is another about Trump"]})
    
    texts = df["Text"].to_list()
    ents = []
    for doc in nlp.pipe(texts):
        for ent in doc.ents:
            if ent.label_ == "GPE":
                ents.append(ent)
                
    print(ents)
    

    [Germany]