pythonnlpchunkingtreetagger

Chunking with Python-Treetaggerwrapper


The Treetagger can do POS-tagging as well as text-chunking, which means extracting verbal and nominal clauses, as in this German example:

$ echo 'Das ist ein Test.' | cmd/tagger-chunker-german
    reading parameters ...
    tagging ...
     finished.
<NC>
Das PDS die
</NC>
<VC>
ist VAFIN   sein
</VC>
<NC>
ein ART eine
Test    NN  Test
</NC>
.   $.  .

I'm trying to figure out how to do this with the Treetaggerwrapper in Python (since it's faster than directly calling Treetagger), but I can't figure out how it's done. The documentation refers to chunking as preprocessing, so I tried using this:

tags = tagger.tag_text(u"Dieser Satz ist ein Satz.",prepronly=True)

But the output is just a list of the words with no information added. I'm starting to think that what the Wrapper calls Chunking is something different than what the actual tagger calls Chunking, but maybe I'm just missing something? Any help would be appreciated.


Solution

  • The original poster is right in his assumptions. treetaggerwrapper (as of version 2.2.4) defines chunking as merely "preprocessing of text", and does not fully wrap TreeTagger's capabilities in this sense. From treetaggerwrapper.py:

    • Manage preprocessing of text (chunking) in place of external Perl scripts as in base TreeTagger installation, thus avoid starting Perl each time a piece of text must be tagged.

    But inspecting tagger-chunker-german one can see that getting clauses and tags is a string of operations, actually calling TreeTagger 3 times:

    $ echo 'Das ist ein Test.' | cmd/tree-tagger-german | perl -nae 'if ($#F==0){print} else {print "$F[0]-$F[1]\n"}' | bin/tree-tagger lib/german-chunker.par -token -sgml -eps 0.00000001 -hyphen-heuristics -quiet | cmd/filter-chunker-output-german.perl | bin/tree-tagger -quiet -token -lemma -sgml lib/german-utf8.par

    whereas treetaggerwrapper's tagging command (shown in tagcmdlist) is actually a one-shot call (after it's own preprocessing of the text) to:

    bin/tree-tagger -token -lemma -sgml -quiet -no-unknown lib/german-utf8.par


    The point of entry to extend it for chunking is the line

    "tagparfile": "german-utf8.par",

    where you would define something like

    "chunkingparfile": "german-chunker.par",

    and issue an additional call to TreeTagger with this other parfile following the tagger-chunker-german operation chain. You'd then probably still have to copy some extra logic from cmd/filter-chunker-output-german.perl though.