pythonnlppos-taggermorphological-analysisapertium

Apertium + Python: POS-tagger not providing surface form


I'm trying to POS-tag some sentences in Italian with Apertium's tagger. While according to the Apertium GitHub page I am supposed to get as output also the surface form in addition to the morphological analysis, I only get the analysis. I want also the surface form. I cannot infer it since the tagger doesn't necessarily tag a single token, so I cannot simply tokenize the original sentence and loop over it or zip it with the tagger's output.

According to the GitHub page:

In [1]: import apertium
In [2]: tagger = apertium.Tagger('ita')
In [3]: tagger.tag('gatti').
Out[3]: [gatti/gatto<n><m><pl>]

What I got:

In [1]: import apertium
In [2]: tagger = apertium.Tagger('ita')
In [3]: tagger.tag('gatti') # 'gatti' is the surface form
Out[3]: [gatto<n><m><pl>]

How can I get the surface form? If I provided one token at a time this would not be a problem since I would know what the token is. But in a sentence I cannot know how the tagger creates chunks.


Solution

  • By default, when creating a tagger of language ita it looks for /usr/share/apertium/modes/ita-tagger.mode. This is a shell script that calls various apertium commands. The command for the Italian tagger script happens to be configured to not include surface commands (it's missing the -p option).

    A quick and dirty solution is to just sudo vim /usr/share/apertium/modes/ita-tagger.mode (or sudo nano or whatever your editor is) and add -p to the end of the last command, so the file looks like

    lt-proc -w '/usr/share/apertium/apertium-ita/ita.automorf.bin' | cg-proc '/usr/share/apertium/apertium-ita/ita.rlx.bin' | apertium-tagger -g $2 '/usr/share/apertium/apertium-ita/ita.prob' -p
    

    and do tagger = apertium.Tagger('ita') again.


    A sudo-less solution would be to copy the mode file, edit, and add it to the search path, see https://github.com/apertium/apertium-python#installing-more-modes-from-other-language-data