I am using NLTK chunking and i want to capture the string that was matched for my rule. For example
Here is my input
The stocks show 67% rise, last year it was 12% fall
i want to capture
67% rise
and 12% fall
POS Tagging the above sentence shows
('The', 'DT'), ('stocks', 'NNS'), ('show', 'VBP'), ('67', 'CD'), ('%', 'NN'), ('rise', 'NN'), (',', ','), ('last', 'JJ'), ('year', 'NN'), ('it', 'PRP'), ('was', 'VBD'), ('12', 'CD'), ('%', 'NN'), ('fall', 'NN')
Now, i came up with a simple rule
Stat: {<CD><NN>(<NN>+|<VBN>|JJ)?}
which works well and captures
('67', 'CD'), ('%', 'NN'), ('rise', 'NN')
('12', 'CD'), ('%', 'NN'), ('fall', 'NN')
now, i want to extract the exact strings that were captured. So, i want
67% rise
and 12% fall
i tried
current=[]
for word,tag in subtree.leaves():
current.append(word)
print ' '.join(current)
but i get
67 % rise
and 12 % fall
notice the space between %
and the numbers. This is logically correct but not the desired output. I want the exact string as i want to know the starting and ending indices of the captured strings.
How can i achieve this?
(Horribly hacky) For your example strings and tags:
s = ('The', 'DT'), ('stocks', 'NNS'), ('show', 'VBP'), ('67', 'CD'), ('%', 'NN'), ('rise', 'NN'), (',', ','), ('last', 'JJ'), ('year', 'NN'), ('it', 'PRP'), ('was', 'VBD'), ('12', 'CD'), ('%', 'NN'), ('fall', 'NN')
a = (('67', 'CD'), ('%', 'NN'), ('rise', 'NN'))
c = 'The stocks show 67% rise, last year it was 12% fall'
Edit: As a list comprehension:
>>>c[min((c.index(i[0]) for i in a)):max((c.index(i[0]) for i in a)) + [len(i[0]) for i in a][-1]]
>>>'67% rise'
Find the position that each word occurs within your input sentence. Record the length of each word.
Check what position your desired part of speech tags have within your sample sentence. (Edit: Removed if as was unnecessary)
position=[]
lengths=[]
for wordtag in a:
print wordtag,c.index(i[0]),wordtag[0],len(wordtag[0])
position.append(c.index(wordtag[0]))
lengths.append(len(wordtag[0]))
> ('67', 'CD') 16 67 2
> ('%', 'NN') 18 % 1
> ('rise', 'NN') 20 rise 4
print position
print lengths
> [16, 18, 20]
> [2, 1, 4]
Slice your input sentence according to the minimum and maximum position of your desired tags. You add a lengths[-1]
to add the length of the word rise
.
valid = c[min(position):max(position) + lengths[-1]]
print valid
> [16, 18, 20]
> [2, 1, 4]
> 67% rise
You can then generalise this for any list of sentences and part of speech tags.