[SOLVED] Spacy - return nouns without the grammatical articles

Spacy - return nouns without the grammatical articles

In Spacy, when we request the nouns, the grammatical articles (ex.: "the", "one", "a") are also presented

import spacy

nlp_en = spacy.load('en_core_web_sm') # v3.7.1
doc = nlp_en('The man has cars, houses and one dog')
nouns = [chunk.text for chunk in doc.noun_chunks]
print(nouns) # ['The man', 'cars', 'houses', 'one dog']

Is there a way to get ['man', 'cars', 'houses', 'dog']?

It should work for every language, thus just stripping words "a la carte" is not a solution.

Solution

I had to check if a specific word within a chunk is a determiner, now it seems to work

import spacy

nlp_en = spacy.load('en_core_web_sm') # v3.7.1
doc = nlp_en('The third great man has cars, houses and a first dog')

nouns = []
for noun in doc.noun_chunks:
    # check if first word of chunk is a Determiner
    if noun[0].pos_ == 'DET':
        noun_ = noun.text.split(' ', 1)[1] # remove 1st word
    else:
        noun_ = noun.text
    nouns.append(noun_)
        
print(nouns) # ['third great man', 'cars', 'houses', 'first dog']