pythonnltk

Python NLTK - counting occurrence of word in brown corpora based on returning top results by tag


I'm trying to return the top occurring values from a corpora for specific tags. I can get the tag and the word themselves to return fine however I can't get the count to return within the output.

import itertools
import collections
import nltk 
from nltk.corpus import brown

words = brown.words()

def findtags(tag_prefix, tagged_text):
cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text
                              if tag.startswith(tag_prefix))
return dict((tag, cfd[tag].keys()[:5]) for tag in cfd.conditions())

tagdictNNS = findtags('NNS', nltk.corpus.brown.tagged_words())

This returns the following fine

for tag in sorted(tagdictNNS):
    print tag, tagdictNNS[tag]

I have managed to return the count of every NN based word using this:

pluralLists = tagdictNNS.values()
pluralList = list(itertools.chain(*pluralLists)) 
for s in pluralList:
    sincident = words.count(s)
    print s
    print sincident

That returns everything.

Is there a better way of inserting the occurrence into the a dict tagdictNN[tag]?

edit 1:

pluralLists = tagdictNNS.values()[:5]
pluralList = list(itertools.chain(*pluralLists))

returns them in size order from the for s loop. still not the right way to do it though.

edit 2: updated dictionaries so they actually search for NNS plurals.


Solution

  • I might not understand, but given your tagdictNNS:

    >>> new = {}
    >>> for k,v in tagdictNNS.items():
            new[k] = len(tagdictNNS[k])
    >>> new
    {'NNS$-TL-HL': 1, 'NNS-HL': 5, 'NNS$-HL': 4, 'NNS-TL': 5, 'NNS-TL-HL': 5, 'NNS+MD': 2,      'NNS$-NC': 1, 'NNS-TL-NC': 1, 'NNS$-TL': 5, 'NNS': 5, 'NNS$': 5, 'NNS-NC': 5}
    

    Then you can do something like:

    >>> sorted(new.items(), key=itemgetter(1), reverse=True)[:2]
    [('NNS-HL', 5), ('NNS-TL', 5)]