In Elasticsearch facing an issue in an aggregation query. The issue is as follow,
I am asking for two different aggs in the same query. The first is “show me the doc counts for subject.label for these specific values,” and the second is “show me the doc counts for the 5 most common values within subject.label.”
The query tried,
POST my_index/search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"selected": {
"terms": {
"field": "subject.label",
"include": [ "Buddhist art" ],
"order": { "_count": "desc" },
"size": 5
}
},
"subject": {
"terms": {
"field": "subject.label",
"order": { "_count": "desc" },
"size": 5
}
}
}
}
Got the below result
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"userFacets" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "Buddhist art",
"doc_count" : 12
}
]
},
"subject" : {
"doc_count_error_upper_bound" : 11,
"sum_other_doc_count" : 1005,
"buckets" : [
{
"key" : "Architecture",
"doc_count" : 88
},
{
"key" : "Painting",
"doc_count" : 80
},
{
"key" : "Berkeley (Calif.)",
"doc_count" : 25
},
{
"key" : "Buddhist art",
"doc_count" : 11
},
{
"key" : "Church architecture",
"doc_count" : 10
}
]
}
}
}
How can the value for Buddhist art be 12 in the first agg result and 11 in the second? It's a single agg query on a single index. (There are, in fact, 12 docs with a subject.label value of “Buddhist art”.).
(This example i have taken from one more post https://discuss.elastic.co/t/different-aggregation-count-for-the-same-value/324566)
Thank you.
approximate
results.From the ES response you shared I can see the following. That means there can be some documents that ignored for the sake of search speed.
"subject" : {
"doc_count_error_upper_bound" : 11,
"sum_other_doc_count" : 1005,
You can increase the shard_size to get more accurate results.
"subject": {
"terms": {
"field": "subject.label",
"order": { "_count": "desc" },
"size": 5
"shard_size": 10000
}
}
shard_size can be between 1 to 2147483647.
From official documentation:
Even with a larger shard_size
value, doc_count
values for a terms aggregation may be approximate. As a result, any sub-aggregations on the terms aggregation may also be approximate. sum_other_doc_count
is the number of documents that didn’t make it into the the top size terms. If this is greater than 0
, you can be sure that the terms agg had to throw away some buckets, either because they didn’t fit into size on the coordinating node or they didn’t fit into shard_size
on the data node.