elasticsearchopensearch

How can I handle large keyword fields in OpenSearch?


I have an app where I index data in OpenSearch. The data model is largely defined by the end users and not by myself. So, when a user says that they have a "string" field, I index it both as a text and keyword field in OpenSearch because I don't know whether it's a short, enum-style string or long-form text. So my field mappings look like:

"example_field": {
  "type": "text",
  "fields": {
    "keyword": {
      "type": "keyword"
    }
  }
}

The problem arises when a user then supplies long-form text, and I get errors like:

Document contains at least one immense term in field="example_field.keyword" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms.

I've tried setting ignore_above like so:

"example_field": {
  "type": "text",
  "fields": {
    "keyword": {
      "type": "keyword",
      "ignore_above": 20000
    }
  }
}

But it looks like this doesn't actually prevent immense terms in keyword fields.

Ideally, my app would distinguish between short and long text fields so I didn't have to index everything as both text and keyword. But as that isn't the case now, is there a way I can limit the max length of the keyword field, but not of the text field?


Solution

  • The problem was that I had set ignore_above too high:

    "example_field": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword",
          "ignore_above": 20000
        }
      }
    }
    

    That will work for English text, but not for e.g. Chinese text. ignore_above will configure the character limit, but Lucene's internal limit is 32766 bytes.

    So a safe limit to set is ignore_above: 8000. Unicode characters are at most 4 bytes, and 8000*4 = 32000, which is still below the limit.