elasticsearchmorelikethis

Can I know the selected terms of a "more like this" query


I'm trying to finely tune a "more like this" query to make it work on pretty similar documents (formalized announcements, most of the text is "template" so only certain paragraphs are importants).

So I would want to know, given a selected document, for my "max_query_terms": 20, which terms are elected, using an explained query only shows which of those are indeed found in retrieved documents, but not the whole set of twenty tokens.

I understood the set of terms is selected a priori comparing the reference document to the index, to build a unique "match" query but... as I browse explained hits I have more than 20 tokens...

If I use ngrams for example, the max_query_terms applies to tokens of the analyzed text ? or to terms BEFORE analysis, i.e taking 20 words THEN applying my filters (stopwords, elisions, ngrams, etc...) to this set ?

Is there a way through rest or the api to retrieve the match query generated by the mlt algorithm ?


Solution

  • You have to use validate in combination with explain to understand what terms have been selected by elastic.

    GET /imdb/movies/_validate/query?explain=true
    {
      "query": {
        "more_like_this": {
          "like": {
            "_id": "88247"
          }
        }
      }
    }
    

    Response:

    {
       ...
       "explanations": [
          {
             "index": "imdb",
             "valid": true,
             "explanation": "filtered((((title:terminator^3.71334 plot:kyle^1.0604408 plot:cyborg^1.0863208 ... )~2)) -ConstantScore(_uid:movies#88247))->cache(_type:movies)"
          }
       ]
    }
    

    Please see this discussion and this pull request for more details.