Say each document in my elasticsearch index is a blog post which consists of only two fields, title and tags. The title field is just a string while tags is a multi value field.
If I have three documents like this:
title tags
"blog1" [A,B,C]
"blog2" [A,B]
"blog3" [B,C]
I would like to bucket by the unique values of all possible tags, but how can I get results like below, which contains three items in a bucket. Or is there any efficient alternatives?
{A: ["blog1", "blog2"]}
{B: ["blog1", "blog2", "blog3"]}
{C: ["blog1", "blog3"]}
It would be nice if someone can provide an answer in elasticsearch python API.
You can simply use a terms
aggregation on the tags
field and another nested top_hits
sub-aggregation. With the following query, you'll get the expected results.
{
"size": 0,
"aggs": {
"tags": {
"terms": {
"field": "tags"
},
"aggs": {
"top_titles": {
"top_hits": {
"_source": ["title"]
}
}
}
}
}
}
Using this with Python is straightforward:
from elasticsearch import Elasticsearch
client = Elasticsearch()
response = client.search(
index="my-index",
body= {
"size": 0,
"aggs": {
"tags": {
"terms": {
"field": "tags"
},
"aggs": {
"top_titles": {
"top_hits": {
"_source": ["title"]
}
}
}
}
}
}
)
# parse the tags
for tag in response['aggregations']['tags']['buckets']:
tag = tag['key'] # => A, B, C
# parse the titles for the tag
for hit in tag['top_titles']['hits']['hits']:
title = hit['_source']['title'] # => blog1, blog2, ...