elasticsearchelasticsearch-mappingelasticsearch-performance

Which Elasticsearch string datatype to use if only exists filter is used?


I'm using Elasticsearch version 6.8. I want to store an identifier (a string with a combination of letters, numbers, and possibly whitespace). The only filter I will use on that field will be the exists filter (I will check if the value is set). What is the best option here, to use the keyword type or a text type? For the text type I can probably set

  "norms": false,
  "index_options": "freqs"

to reduce the index size.

The documentation states that, as this is the "structured" text, the best option would be to use the keyword type, but as the number of possible values is huge (it's an ID), I'm afraid this would take a lot of disk space.

I have an index with millions of records so I want to keep the disk usage low for this field. Which option is the best regarding the disk space, and what is the performance impact?


Solution

  • Since you don't want to search on the values of this field or run aggregations on them, you should store this field as keyword with doc_values disabled.

    "fieldName": { 
        "type":       "keyword",
        "doc_values": false
    }
    

    Disabling the doc_values will save you disk space.

    The fields mapped as text does not have the doc_values enabled and could use less space, but they are analyzed and could take space in memory.

    If you don't care at all about the value of the field you can even change it to a simple string or a single digit during ingestion, depending on how you are ingesting your data.