pythonpython-2.7encodingnlptreetagger

Must use *unicode* string as text to tag, while tagging with TreeTagger?


From TreeTagger's website I created a directory and downloaded the specified files. Then treetaggerwrapper, thus from the documentation I tried to test and try how to tag some text as follows:

In [40]:

import treetaggerwrapper

tagger = treetaggerwrapper.TreeTagger(TAGLANG='en')

tags = tagger.TagText("This is a very short text to tag.")

print tags

Then I got the following warnings:

WARNING:TreeTagger:Abbreviation file not found: english-abbreviations
WARNING:TreeTagger:Processing without abbreviations file.
ERROR:TreeTagger:Must use *unicode* string as text to tag, not <type 'str'>.

---------------------------------------------------------------------------
TreeTaggerError                           Traceback (most recent call last)
<ipython-input-40-37b912126580> in <module>()
      1 import treetaggerwrapper
      2 tagger = treetaggerwrapper.TreeTagger(TAGLANG='en')
----> 3 tags = tagger.TagText("This is a very short text to tag.")
      4 print tags

/usr/local/lib/python2.7/site-packages/treetaggerwrapper.pyc in TagText(self, text, numlines, tagonly, prepronly, tagblanks, notagurl, notagemail, notagip, notagdns, encoding, errors)
   1236         return self.tag_text(text, numlines=numlines, tagonly=tagonly,
   1237                  prepronly=prepronly, tagblanks=tagblanks, notagurl=notagurl,
-> 1238                  notagemail=notagemail, notagip=notagip, notagdns=notagdns)
   1239 
   1240     # --------------------------------------------------------------------------

/usr/local/lib/python2.7/site-packages/treetaggerwrapper.pyc in tag_text(self, text, numlines, tagonly, prepronly, tagblanks, notagurl, notagemail, notagip, notagdns, nosgmlsplit)
   1302             # Raise exception now, with an explicit message.
   1303             logger.error("Must use *unicode* string as text to tag, not %s.", type(text))
-> 1304             raise TreeTaggerError("Must use *unicode* string as text to tag.")
   1305 
   1306         if isinstance(text, six.text_type):

TreeTaggerError: Must use *unicode* string as text to tag.

Where do I download the abbreviation file for english and spanish languages?, and how can I install correctly treetaggerwrapper?.


Solution

  • The method only takes unicode strings add a u to your string to make it a unicode string:

    tags = tagger.TagText(u"This is a very short text to tag.")
    

    "This is a very short text to tag." is a str type, once you add the u it is unicode:

    In [12]: type("This is a very short text to tag.")
    Out[12]: str
    
    In [13]: type(u"This is a very short text to tag.")
    Out[13]: unicode
    

    If you were taking the str from another source you would need to decode:

    In [15]: s = "This is a very short text to tag."
    
    In [16]: type(s)
    Out[16]: str
    
    In [17]: type(s.decode("utf-8"))
    Out[17]: unicode
    

    The tagging scripts can be downloaded here