elasticsearchpyelasticsearchelasticsearch-dsl

How to do bucket aggregation on multi value field in elasticsearch


Say each document in my elasticsearch index is a blog post which consists of only two fields, title and tags. The title field is just a string while tags is a multi value field.

If I have three documents like this:

title      tags
"blog1"    [A,B,C]
"blog2"    [A,B]
"blog3"    [B,C]

I would like to bucket by the unique values of all possible tags, but how can I get results like below, which contains three items in a bucket. Or is there any efficient alternatives?

{A: ["blog1", "blog2"]}
{B: ["blog1", "blog2", "blog3"]}
{C: ["blog1", "blog3"]}

It would be nice if someone can provide an answer in elasticsearch python API.


Solution

  • You can simply use a terms aggregation on the tags field and another nested top_hits sub-aggregation. With the following query, you'll get the expected results.

    {
        "size": 0,
        "aggs": {
            "tags": {
                "terms": { 
                    "field": "tags" 
                },
                "aggs": {
                    "top_titles": {
                        "top_hits": {
                            "_source": ["title"]
                        }
                    }
                }
            }
        }
    }
    

    Using this with Python is straightforward:

    from elasticsearch import Elasticsearch
    client = Elasticsearch()
    
    response = client.search(
        index="my-index",
        body= {
        "size": 0,
        "aggs": {
            "tags": {
                "terms": { 
                    "field": "tags" 
                },
                "aggs": {
                    "top_titles": {
                        "top_hits": {
                            "_source": ["title"]
                        }
                    }
                }
            }
        }
    }
    )
    
    # parse the tags
    for tag in response['aggregations']['tags']['buckets']:
        tag = tag['key'] # => A, B, C
        # parse the titles for the tag
        for hit in tag['top_titles']['hits']['hits']:
           title = hit['_source']['title'] # => blog1, blog2, ...