[SOLVED] Apertium translator. Is there a way to get the original phrase

Apertium translator. Is there a way to get the original phrase

Is there a way in apertium translator to get the original phrase for a translation?

I.E. get something like:

phrase: {
  original: { Hola, buenos días},
  translated: {Hello, good morning}
}

I need that in order to make a mechanism to improve the translations.

Solution

If you're sending a corpus through the command-line interface, e.g.

xzcat corpus.sme.xz | sed 's/$/ ./' | apertium -f html-noent sme-nob > translated.nob.mt

then you can try simply

xzcat corpus.sme.xz | paste - translated.nob.mt

afterwards to get the input next to the output. That's assuming you want to split things on newlines. The sed is there to ensure words aren't moved across newlines (rules tend not to move across sentence boundaries).

This will be fast, but it's a bit hacky and there are many edge cases.

If you want more control, one way would be to install the JSON API locally and send one request at a time.

If you've got a recent Debian/Ubuntu (or are using one of the apertium repos), you can get the API with

sudo apt install apertium-apy
sudo systemctl start apertium-apy   # start it right now
sudo systemctl enable apertium-apy  # let it start on next boot

And then you can translate like this:

$ echo 'Jeg liker ikke ansjos' | curl --data-urlencode 'q@-' 'localhost:2737/translate?langpair=nob|nno'
{"responseDetails": null, "responseData": {"translatedText": "Eg likar ikkje ansjos"}, "responseStatus": 200}

(or from Javascript with standard ajax requests, some docs at http://wiki.apertium.org/wiki/Apertium-apy/Debian and http://wiki.apertium.org/wiki/Apertium-apy#Usage )

Note that apertium-apy by default serves the pairs that are in /usr/share/apertium/modes; if you start it manually (instead of through systemctl) you can point it at a different path.

If you want to produce the JSON format you had in your example, the easiest way would be to use jq (sudo apt install jq), e.g.

$ orig="Jeg liker ikke ansjos"
$ echo "$orig" \
  | curl -Ss --data-urlencode 'q@-' 'localhost:2737/translate?langpair=nob|nno' \
  | jq "{phrase: {original:\"$orig\", translated:.responseData.translatedText }}"
{
  "phrase": {
    "original": "Jeg liker ikke ansjos",
    "translated": "Eg likar ikkje ansjos"
  }
}

or on a corpus:

xzcat corpus.nob.xz | while read -r orig; do 
  echo "$orig" \
    | curl -Ss --data-urlencode 'q@-' 'localhost:2737/translate?langpair=nob|nno' \
    | jq "{phrase: {original:\"$orig\", translated:.responseData.translatedText}}";
done

(A simple test of 500 lines showed this taking 23.7s wall clock time while the paste version took 5.5s.)