Is there a way to find the position of the words with pos-tag 'NN' and 'VB' in a sentence in Python?
example of a sentences in a csv file: "Man walks into a bar." "Cop shoots his gun." "Kid drives into a ditch"
You can find positions for certein PoS tags on a text using some of the existing NLP frameworks such us Spacy or NLTK. Once you process the text you can iterate for each token and check if the pos tag is what you are looking for, then get the start/end position of that token in your text.
Spacy
Using spacy, the code to implement what you want would be something like this:
import spacy
nlp = spacy.load("en_core_web_lg")
doc = nlp("Man walks into a bar.") # Your text here
words = []
for token in doc:
if token.pos_ == "NOUN" or token.pos_ == "VERB":
start = token.idx # Start position of token
end = token.idx + len(token) # End position = start + len(token)
words.append((token.text, start, end, token.pos_))
print(words)
In short, I build a new document from the string, iterate over all the tokens and keep only those whose post tag is VERB or NOUN. Finally I add the token info to a list for further processing. I strongly recommend that you read the following spacy tutorial for more information.
NLTK
Using NLTK I think is pretty straightforward too, using NLTK tokenizer and pos tagger. The rest is almost analogous to how we do it using spacy.
What I'm not sure about is the most correct way to get the start-end positions of each token. Note that for this I am using a tokenization helper created by WhitespaceTokenizer().tokenize()
method, which returns a list of tuples with the start and end positions of each token. Maybe there is a simpler and NLTK-like way of doing it.
import nltk
from nltk.tokenize import WhitespaceTokenizer
text = "Man walks into a bar." # Your text here
tokens_positions = list(WhitespaceTokenizer().span_tokenize(text)) # Tokenize to spans to get start/end positions: [(0, 3), (4, 9), ... ]
tokens = WhitespaceTokenizer().tokenize(text) # Tokenize on a string lists: ["man", "walks", "into", ... ]
tokens = nltk.pos_tag(tokens) # Run Part-of-Speech tager
# Iterate on each token
words = []
for i in range(len(tokens)):
text, tag = tokens[i] # Get tag
start, end = tokens_positions[i] # Get token start/end
if tag == "NN" or tag == "VBZ":
words.append((start, end, tag))
print(words)
I hope this works for you!