I'm trying to POS-tag some sentences in Italian with Apertium's tagger. While according to the Apertium GitHub page I am supposed to get as output also the surface form in addition to the morphological analysis, I only get the analysis. I want also the surface form. I cannot infer it since the tagger doesn't necessarily tag a single token, so I cannot simply tokenize the original sentence and loop over it or zip it with the tagger's output.
According to the GitHub page:
In [1]: import apertium
In [2]: tagger = apertium.Tagger('ita')
In [3]: tagger.tag('gatti').
Out[3]: [gatti/gatto<n><m><pl>]
What I got:
In [1]: import apertium
In [2]: tagger = apertium.Tagger('ita')
In [3]: tagger.tag('gatti') # 'gatti' is the surface form
Out[3]: [gatto<n><m><pl>]
How can I get the surface form? If I provided one token at a time this would not be a problem since I would know what the token is. But in a sentence I cannot know how the tagger creates chunks.
By default, when creating a tagger of language ita
it looks for /usr/share/apertium/modes/ita-tagger.mode
. This is a shell script that calls various apertium commands. The command for the Italian tagger script happens to be configured to not include surface commands (it's missing the -p
option).
A quick and dirty solution is to just sudo vim /usr/share/apertium/modes/ita-tagger.mode
(or sudo nano
or whatever your editor is) and add -p
to the end of the last command, so the file looks like
lt-proc -w '/usr/share/apertium/apertium-ita/ita.automorf.bin' | cg-proc '/usr/share/apertium/apertium-ita/ita.rlx.bin' | apertium-tagger -g $2 '/usr/share/apertium/apertium-ita/ita.prob' -p
and do tagger = apertium.Tagger('ita')
again.
A sudo-less solution would be to copy the mode file, edit, and add it to the search path, see https://github.com/apertium/apertium-python#installing-more-modes-from-other-language-data