elasticsearchaggregationelasticsearch-aggregationopensearchelasticsearch-indices

Elasticsearch Multi-term aggregations to retrieve duplicates


In my Elasticsearch index I have duplicates docs where some "unique" fields have the same values.

In order to fix them, I have to find them, so I'm using an aggregation query with min_doc_count=2. The problem is that I manage to run it only with one key and not with a couple of keys. So in this way it works:

GET /my_index/_search
{
   "size": 0,
   "aggs": {
      "receipts": {
         "terms": {
            "field": "key1",
            "min_doc_count": 2,
            "size": 1000000
          }
      }
  }
}

I'd like to have **two terms that simultaneously match, but how to insert a double field key2?

Any idea?

I tried with multi-terms aggregations, like this (I don't know if the syntax is correct):

GET /my_index/_search
{
   "size": 0,
   "aggs": {
      "receipts": {
          "multi_terms": {
            "terms": [
              {
                "field": "key1" 
              }, 
              {
                "field": "key2"
              }
            ],
            "min_doc_count": 2,
            "size": 1000000
       }
   }
  }
}

but I get this errror:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "parsing_exception",
        "reason" : "Unknown aggregation type [multi_terms] did you mean [rare_terms]?",
        "line" : 5,
        "col" : 26
      }
    ],
    "type" : "parsing_exception",
    "reason" : "Unknown aggregation type [multi_terms] did you mean [rare_terms]?",
    "line" : 5,
    "col" : 26,
    "caused_by" : {
      "type" : "named_object_not_found_exception",
      "reason" : "[5:26] unknown field [multi_terms]"
    }
  },
  "status" : 400
}


Solution

  • You can use script also to do this :

    GET /docs/_search
    {
      "size": 0,
      "aggs": {
        "receipts": {
          "terms": {
            "script": "doc['key1'].value + '_' + doc['key2'].value",
            "min_doc_count": 2,
            "size": 1000000
          }
        }
      }
    }
    

    But you need to know that there can be performance issues here when we compare with terms query.

    Here also some sample documents :

    POST docs/_doc
    {
      "key1": 1,
      "key2": 2
    }
    POST docs/_doc
    {
      "key1": 1,
      "key2": 2
    }
    POST docs/_doc
    {
      "key1": 2,
      "key2": 1
    }
    

    and the result of the query above :

    {
      "took": 6,
      "timed_out": false,
      "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
      },
      "hits": {
        "total": {
          "value": 3,
          "relation": "eq"
        },
        "max_score": null,
        "hits": []
      },
      "aggregations": {
        "receipts": {
          "doc_count_error_upper_bound": 0,
          "sum_other_doc_count": 0,
          "buckets": [
            {
              "key": "1_2",
              "doc_count": 2
            }
          ]
        }
      }
    }