We are using Azure Cognitive Search to index various documents, e.g. Word or PDF files, which are stored in Azure Blob Storage. We would like to be able to translate the extracted content of non-English documents and store the translation result into a dedicated field in the index.
Currently the built-in Text Translation cognitive skill supports up to 50,000 characters on the input. The documents that we have could contain up to 1 MB of text. According to the documentation it's possible to split the text into chunks using the built-in Split Skill, however there's no skill that could merge the translated chunks back together. Our goal is to have all the extracted text translated and stored in one index field of type Edm.String, not an array.
Is there any way to translate large text blocks when indexing, other than creating a custom Cognitive Skill via Web API for that purpose?
Yes, the Merge Skill will actually do this. Define the skill in your skillset like the below. The "text" and "offsets" inputs to this skill are optional, and you can use "itemsToInsert" to specify the text you want to merge together (specify the appropriate source for your translation output). Use insertPreTag and insertPostTag if you want to insert perhaps a space before or after each merged section.
{
"@odata.type": "#Microsoft.Skills.Text.MergeSkill",
"description": "Merge text back together",
"context": "/document",
"insertPreTag": "",
"insertPostTag": "",
"inputs": [
{
"name": "itemsToInsert",
"source": "/document/translation_output/*/text"
}
],
"outputs": [
{
"name": "mergedText",
"targetName" : "merged_text_field_in_your_index"
}
]
}