Below code tokenises the text and identifies the grammar of each tokenised word.
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import wordnet as wn
#nltk.download()
text = "Natural language processing is fascinating"
# tokenise the sentence
words = word_tokenize(text)
print(words)
# identify noun, verb, etc grammatically in the sentence
for w in words:
tmp = wn.synsets(w)[0].pos()
print (w, ":", tmp)
The output is;
['Natural', 'language', 'processing', 'is', 'fascinating']
Natural : n
language : n
processing : n
is : v
fascinating : v
Where n is noun and v is verb
Can some Python code expert please advises me how to format the output so it will look like below;
nouns = ["natural", "language", "processing"]
verbs = ["is", "fascinating"]
I need assistance to change the result output format. I think it needs some relevant python code to perform this requirement.
You can achieve it this way :
# Lists to store parts of speech
nouns = []
verbs = []
for w in words:
synsets = wn.synsets(w)
if synsets:
pos = synsets[0].pos()
if pos == 'n':
nouns.append(w.lower())
elif pos == 'v':
verbs.append(w.lower())
full solution:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet as wn
# Make sure the necessary NLTK data is available
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('punkt_tab')
text = "Natural language processing is fascinating"
# Tokenize the text
words = word_tokenize(text)
# Lists to store parts of speech
nouns = []
verbs = []
for w in words:
synsets = wn.synsets(w)
if synsets:
pos = synsets[0].pos()
if pos == 'n':
nouns.append(w.lower())
elif pos == 'v':
verbs.append(w.lower())
print(f"nouns = {nouns}")
print(f"verbs = {verbs}")
output:
nouns = ['natural', 'language', 'processing']
verbs = ['is', 'fascinating']