sortingelasticsearchelasticsearch-aggregationelasticsearch-6

Elasticsearch - Sort results of Terms aggregation by key string length


I am querying ES with a Terms aggregation to find the first N unique values of a string field foo where the field contains a substring bar, and the document matches some other constraints.

Currently I am able to sort the results by the key string alphabetically:

{
  "query": {other constraints},
  "aggs": {
    "my_values": {
      "terms": {
        "field": "foo.raw",
        "include": ".*bar.*",
        "order": {"_key": "asc"},
        "size": N
      }
    }
  }
}

This gives results like

{
  ...
  "aggregations": {
    "my_values": {
      "doc_count_error_upper_bound": 0,   
      "sum_other_doc_count": 145,           
      "buckets": [                        
        {
          "key": "aa_bar_aa",
          "doc_count": 1
        },
        {
          "key": "iii_bar_iii",
          "doc_count": 1
        },
        {
          "key": "z_bar_z",
          "doc_count": 1
       }
      ]
    }
  }
}

How can I change the order option so that the buckets are sorted by the length of the strings in the foo key field, so that the results are like

{
  ...
  "aggregations": {
    "my_values": {
      "doc_count_error_upper_bound": 0,   
      "sum_other_doc_count": 145,           
      "buckets": [                        
        {
          "key": "z_bar_z",
          "doc_count": 1
        },
        {
          "key": "aa_bar_aa",
          "doc_count": 1
        },
        {
          "key": "iii_bar_iii",
          "doc_count": 1
        }
      ]
    }
  }
}

This is desired because a shorter string is closer to the search substring so is considered a 'better' match so should appear earlier in the results than a longer string. Any alternative way to sort the buckets by how similar they are to the original substring would also be helpful.

I need the sorting to occur in ES so that I only have to load the top N results from ES.


Solution

  • I worked out a way to do this. I used a sub-aggregation per dynamic bucket to calculate the length of the key string as another field. Then I was able to sort by this new length field first, then by the actual key so keys of the same length are sorted alphabetically.

    {
      "query": {other constraints},
      "aggs": {
        "my_values": {
          "terms": {
            "field": "foo.raw",
            "include": ".*bar.*",
            "order": [
              {"key_length": "asc"},
              {"_key": "asc"}
            ],
            "size": N
          },
          "aggs": {
            "key_length": {
              "max": {"script": "doc['foo.raw'].value.length()" }
            }
          }
        }
      }
    }
    

    This gave me results like

    {
      ...
      "aggregations": {
        "my_values": {
          "doc_count_error_upper_bound": 0,   
          "sum_other_doc_count": 145,           
          "buckets": [                        
            {
              "key": "z_bar_z",
              "doc_count": 1
            },
            {
              "key": "aa_bar_aa",
              "doc_count": 1
            },
            {
              "key": "dd_bar_dd",
              "doc_count": 1
            },
            {
              "key": "bbb_bar_bbb",
              "doc_count": 1
            }
          ]
        }
      }
    }
    

    which is what I wanted.