I have an app where I index data in OpenSearch. The data model is largely defined by the end users and not by myself. So, when a user says that they have a "string" field, I index it both as a text
and keyword
field in OpenSearch because I don't know whether it's a short, enum-style string or long-form text. So my field mappings look like:
"example_field": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
The problem arises when a user then supplies long-form text, and I get errors like:
Document contains at least one immense term in field="example_field.keyword" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms.
I've tried setting ignore_above
like so:
"example_field": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 20000
}
}
}
But it looks like this doesn't actually prevent immense terms in keyword fields.
Ideally, my app would distinguish between short and long text fields so I didn't have to index everything as both text
and keyword
. But as that isn't the case now, is there a way I can limit the max length of the keyword field, but not of the text field?
The problem was that I had set ignore_above
too high:
"example_field": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 20000
}
}
}
That will work for English text, but not for e.g. Chinese text. ignore_above
will configure the character limit, but Lucene's internal limit is 32766 bytes.
So a safe limit to set is ignore_above: 8000
. Unicode characters are at most 4 bytes, and 8000*4 = 32000, which is still below the limit.