In Spacy, when we request the nouns, the grammatical articles (ex.: "the", "one", "a") are also presented
import spacy
nlp_en = spacy.load('en_core_web_sm') # v3.7.1
doc = nlp_en('The man has cars, houses and one dog')
nouns = [chunk.text for chunk in doc.noun_chunks]
print(nouns) # ['The man', 'cars', 'houses', 'one dog']
Is there a way to get ['man', 'cars', 'houses', 'dog']
?
It should work for every language, thus just stripping words "a la carte" is not a solution.
I had to check if a specific word within a chunk is a determiner, now it seems to work
import spacy
nlp_en = spacy.load('en_core_web_sm') # v3.7.1
doc = nlp_en('The third great man has cars, houses and a first dog')
nouns = []
for noun in doc.noun_chunks:
# check if first word of chunk is a Determiner
if noun[0].pos_ == 'DET':
noun_ = noun.text.split(' ', 1)[1] # remove 1st word
else:
noun_ = noun.text
nouns.append(noun_)
print(nouns) # ['third great man', 'cars', 'houses', 'first dog']