pythonnlpnltktext-chunking

NLTK: How to access the chunked string


I am using NLTK chunking and i want to capture the string that was matched for my rule. For example

Here is my input

The stocks show 67% rise, last year it was 12% fall

i want to capture

67% rise and 12% fall

POS Tagging the above sentence shows

('The', 'DT'), ('stocks', 'NNS'), ('show', 'VBP'), ('67', 'CD'), ('%', 'NN'), ('rise', 'NN'), (',', ','), ('last', 'JJ'), ('year', 'NN'), ('it', 'PRP'), ('was', 'VBD'), ('12', 'CD'), ('%', 'NN'), ('fall', 'NN')

Now, i came up with a simple rule

Stat: {<CD><NN>(<NN>+|<VBN>|JJ)?}

which works well and captures

('67', 'CD'), ('%', 'NN'), ('rise', 'NN')

('12', 'CD'), ('%', 'NN'), ('fall', 'NN')

now, i want to extract the exact strings that were captured. So, i want

67% rise and 12% fall

i tried

current=[]
for word,tag in subtree.leaves():
    current.append(word)
print ' '.join(current)

but i get

67 % rise and 12 % fall

notice the space between % and the numbers. This is logically correct but not the desired output. I want the exact string as i want to know the starting and ending indices of the captured strings.

How can i achieve this?


Solution

  • (Horribly hacky) For your example strings and tags:

    s = ('The', 'DT'), ('stocks', 'NNS'), ('show', 'VBP'), ('67', 'CD'), ('%', 'NN'), ('rise', 'NN'), (',', ','), ('last', 'JJ'), ('year', 'NN'), ('it', 'PRP'), ('was', 'VBD'), ('12', 'CD'), ('%', 'NN'), ('fall', 'NN')
    a = (('67', 'CD'), ('%', 'NN'), ('rise', 'NN'))
    c = 'The stocks show 67% rise, last year it was 12% fall'
    

    Edit: As a list comprehension:

    >>>c[min((c.index(i[0]) for i in a)):max((c.index(i[0]) for i in a)) + [len(i[0]) for i in a][-1]]
    >>>'67% rise'
    

    Find the position that each word occurs within your input sentence. Record the length of each word.

    Check what position your desired part of speech tags have within your sample sentence. (Edit: Removed if as was unnecessary)

    position=[]
    lengths=[]
    
    for wordtag in a:
        print wordtag,c.index(i[0]),wordtag[0],len(wordtag[0])
        position.append(c.index(wordtag[0]))
        lengths.append(len(wordtag[0]))
    
    > ('67', 'CD') 16 67 2
    > ('%', 'NN') 18 % 1
    > ('rise', 'NN') 20 rise 4
    
    print position
    print lengths
    
    > [16, 18, 20]
    > [2, 1, 4]
    

    Slice your input sentence according to the minimum and maximum position of your desired tags. You add a lengths[-1] to add the length of the word rise.

    valid = c[min(position):max(position) + lengths[-1]]
    print valid
    
    > [16, 18, 20]
    > [2, 1, 4]
    
    > 67% rise
    

    You can then generalise this for any list of sentences and part of speech tags.