I'm trying to finely tune a "more like this" query to make it work on pretty similar documents (formalized announcements, most of the text is "template" so only certain paragraphs are importants).
So I would want to know, given a selected document, for my "max_query_terms": 20, which terms are elected, using an explained query only shows which of those are indeed found in retrieved documents, but not the whole set of twenty tokens.
I understood the set of terms is selected a priori comparing the reference document to the index, to build a unique "match" query but... as I browse explained hits I have more than 20 tokens...
If I use ngrams for example, the max_query_terms applies to tokens of the analyzed text ? or to terms BEFORE analysis, i.e taking 20 words THEN applying my filters (stopwords, elisions, ngrams, etc...) to this set ?
Is there a way through rest or the api to retrieve the match query generated by the mlt algorithm ?
You have to use validate
in combination with explain
to understand what terms have been selected by elastic.
GET /imdb/movies/_validate/query?explain=true
{
"query": {
"more_like_this": {
"like": {
"_id": "88247"
}
}
}
}
Response:
{
...
"explanations": [
{
"index": "imdb",
"valid": true,
"explanation": "filtered((((title:terminator^3.71334 plot:kyle^1.0604408 plot:cyborg^1.0863208 ... )~2)) -ConstantScore(_uid:movies#88247))->cache(_type:movies)"
}
]
}
Please see this discussion and this pull request for more details.