[SOLVED] Can I know the selected terms of a "more like this" query

Can I know the selected terms of a "more like this" query

I'm trying to finely tune a "more like this" query to make it work on pretty similar documents (formalized announcements, most of the text is "template" so only certain paragraphs are importants).

So I would want to know, given a selected document, for my "max_query_terms": 20, which terms are elected, using an explained query only shows which of those are indeed found in retrieved documents, but not the whole set of twenty tokens.

I understood the set of terms is selected a priori comparing the reference document to the index, to build a unique "match" query but... as I browse explained hits I have more than 20 tokens...

If I use ngrams for example, the max_query_terms applies to tokens of the analyzed text ? or to terms BEFORE analysis, i.e taking 20 words THEN applying my filters (stopwords, elisions, ngrams, etc...) to this set ?

Is there a way through rest or the api to retrieve the match query generated by the mlt algorithm ?

Solution

You have to use validate in combination with explain to understand what terms have been selected by elastic.

GET /imdb/movies/_validate/query?explain=true
{
  "query": {
    "more_like_this": {
      "like": {
        "_id": "88247"
      }
    }
  }
}

Response:

{
   ...
   "explanations": [
      {
         "index": "imdb",
         "valid": true,
         "explanation": "filtered((((title:terminator^3.71334 plot:kyle^1.0604408 plot:cyborg^1.0863208 ... )~2)) -ConstantScore(_uid:movies#88247))->cache(_type:movies)"
      }
   ]
}

Please see this discussion and this pull request for more details.