spacyspacy-3

Spacy - return nouns without the grammatical articles


In Spacy, when we request the nouns, the grammatical articles (ex.: "the", "one", "a") are also presented

import spacy

nlp_en = spacy.load('en_core_web_sm') # v3.7.1
doc = nlp_en('The man has cars, houses and one dog')
nouns = [chunk.text for chunk in doc.noun_chunks]
print(nouns) # ['The man', 'cars', 'houses', 'one dog']

Is there a way to get ['man', 'cars', 'houses', 'dog']?

It should work for every language, thus just stripping words "a la carte" is not a solution.


Solution

  • I had to check if a specific word within a chunk is a determiner, now it seems to work

    import spacy
    
    nlp_en = spacy.load('en_core_web_sm') # v3.7.1
    doc = nlp_en('The third great man has cars, houses and a first dog')
    
    nouns = []
    for noun in doc.noun_chunks:
        # check if first word of chunk is a Determiner
        if noun[0].pos_ == 'DET':
            noun_ = noun.text.split(' ', 1)[1] # remove 1st word
        else:
            noun_ = noun.text
        nouns.append(noun_)
            
    print(nouns) # ['third great man', 'cars', 'houses', 'first dog']