machine-translation

Translating parts of sentences based on its context


I am working on an application that needs to be able to translate parts of sentences. The problem is that if I send the parts to a translation API like Google Translate, the translations often don't make sense in the context they occurred in. Example:

He leaves the building

If I translate leaves to any destination language I will probably get a result in the context of "leaves of a tree", which of course makes no sense in the example. So, translation needs to keep context into account. If I expand the translation sentence to He leaves I get the correct translation of He leaves. However, I lose the translation of leaves, which is the word I am looking for.

Does anyone have any idea as to how I should approach this? Keep in mind the Google Translate API is a paid API, so I would like to minimize the amount of translations I request from the API.


Solution

  • You are right that translating words without sentence context is hopeless.

    (Whereas translating sentences without paragraph context can work most of the time, for many content types.)

    Luckily, the Google Translate API, like the Chrome integration, and most machine translation APIs, is smart about HTML tags (the default parameters include format=html).

    So one good option is to wrap the word or phrase in which you are interested in HTML tags.

    You can try this in the console:

    enter image description here

    It should be easy to parse the contents of the HTML tag back out, then you can lemmatise.

    Note 1:
    The consumer-facing standalone Google Translate UI does not expose this option, to try it you must translate via the API console or programmatically, or translate pages with Chrome.

    Note 2:
    There are some nuances because the words in translations are inherently not 1:1. Sometimes the word becomes two words, and occasionally the word has a null representation in the target language.

    1:2 example:
    en: He <span>left</span> the building.
    it: Ha <span>lasciato</span> l'edificio.
    [Arguably the ha should also be included.]

    1:0 example:
    en: How <span>are</span> you?
    ru: Как вы?
    [to be is usually dropped in Russian.]
    en: How are <span>you</span>?
    it: Come stai?
    [Pronouns are often dropped in Italian.]

    2:1 example:
    en: He is always <span>screwing</span> things up.
    it: Sempre <span>spiegazza</span> le cose.
    [English and other languages have separable verbs. The actual input here is to screw up, not to screw.]

    For you this is some work but it is also in fact very useful information, and anyway it is easier for you to process lasciato to get the correct lemma lasciare.

    See cloud.google.com/translate/docs/reference/rest for more parameter documentation