pythontagsnlpnltktextblob

Converting POS tags from TextBlob into Wordnet compatible inputs


I'm using Python and nltk + Textblob for some text analysis. It's interesting that you can add a POS for wordnet to make your search for synonyms more specific, but unfortunately the tagging in both nltk and Textblob aren't "compatible" with the kind of input that wordnet expects for it's synset class.

Example Wordnet.synsets() requires that the POS you give it is one of n,v,a,r, like so

wn.synsets("dog", POS="n,v,a,r")

But a standard POS tagging from upenn_treebank looks like

JJ, VBD, VBZ, etc.

So I'm looking for a good way to convert between the two.

Does anyone know of a good way to make this conversion happen, besides brute force?


Solution

  • If textblob is using the PennTreeBank (ptb) tagset, then just use the first character in the POS tag to map to the WN pos tag.

    WN POS tagset includes 'a' = adjective/adverbs, 's'=satelite adjective, 'n' = nouns and 'v' = verbs.

    try:

    >>> from nltk import word_tokenize, pos_tag
    >>> from nltk.corpus import wordnet as wn
    >>> text = 'this is a pos tagset in some foo bar paradigm'
    >>> pos_tag(word_tokenize(text))
    [('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('pos', 'NN'), ('tagset', 'NN'), ('in', 'IN'), ('some', 'DT'), ('foo', 'NN'), ('bar', 'NN'), ('paradigm', 'NN')]
    >>> for tok, pos in pos_tag(word_tokenize(text)):
    ...     pos = pos[0].lower()
    ...     if pos in ['a', 'n', 'v']:
    ...             wn.synsets(tok, pos)
    ... 
    [Synset('be.v.01'), Synset('be.v.02'), Synset('be.v.03'), Synset('exist.v.01'), Synset('be.v.05'), Synset('equal.v.01'), Synset('constitute.v.01'), Synset('be.v.08'), Synset('embody.v.02'), Synset('be.v.10'), Synset('be.v.11'), Synset('be.v.12'), Synset('cost.v.01')]
    [Synset('polonium.n.01'), Synset('petty_officer.n.01'), Synset('po.n.03'), Synset('united_states_post_office.n.01')]
    []
    []
    [Synset('barroom.n.01'), Synset('bar.n.02'), Synset('bar.n.03'), Synset('measure.n.07'), Synset('bar.n.05'), Synset('prevention.n.01'), Synset('bar.n.07'), Synset('bar.n.08'), Synset('legal_profession.n.01'), Synset('stripe.n.05'), Synset('cake.n.01'), Synset('browning_automatic_rifle.n.01'), Synset('bar.n.13'), Synset('bar.n.14'), Synset('bar.n.15')]
    [Synset('paradigm.n.01'), Synset('prototype.n.01'), Synset('substitution_class.n.01'), Synset('paradigm.n.04')]