I am using the significant terms aggregation, which gives me n significant terms with their doc_count and bg_count using the following query:
{
"query" : {
"terms" : {"user_id": ["x"]}
},
"aggregations" : {
"word_cloud" : {
"significant_terms": {
"field" : "transcript.results.alternatives.words.word.keyword",
"size": 200
}
}
},
"size": 0
}
If I am taking a term returned by significant terms aggregation and do a match phrase query for that term. Then I am getting a different value of hits than the doc_count in the aggregation.
Match phrase query:
{
"query": {
"bool": {
"must": [
{
"match_phrase": {
"preprocess_data.results.alternatives.transcript": "<term>"
}
},
{
"match_phrase": {
"user_id": "x"
}
}
]
}
},
"from": 0,
"size": 22
}
The field preprocess_data.results.alternatives.transcript
has the following mapping:
{
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
I am unable to explain the difference in document count when doing an aggregation and a match phrase search. Please help.
This behaviour is because the data regarding doc_count
is fetched from all shards of your index, and this data could be approximate in case of significant terms aggregation. Quoting elastic search documentation:
The counts of how many documents contain a term provided in results are based on summing the samples returned from each shard and as such may be:
- low if certain shards did not provide figures for a given term in their top sample
- high when considering the background frequency as it may count occurrences found in deleted documents
Like most design decisions, this is the basis of a trade-off in which we have chosen to provide fast performance at the cost of some (typically small) inaccuracies. However, the size and shard size settings covered in the next section provide tools to help control the accuracy levels