ruby-on-railselasticsearchmorelikethis

Elasticsearch validate API explain query terms from more like this against single field getting highlighted terms


I have an index, with effectively the converted word or pdf document plain text "document_texts", built on a Rails stack the ActiveModel is DocumentText using the elasticsearch rails gems, for model, and API. I want to be able to match similar word documents or pdf's based on the document text

I have been able to match documents against each other by using

response = DocumentText.search \
  query: {
      filtered: {
          query: {
              more_like_this: {
                  ids: ["12345"]
              }
          }
      }
  }

But I want to see HOW did the result set get queried, what were the query terms used to match the documents

Using the elasticsearch API gem I can do the following

 client=Elasticsearch::Client.new log:true

 client.indices.validate_query index: 'document_texts',
    explain: true,
    body: {
      query: {
          filtered: {
              query: {
                  more_like_this: {
                      ids: ['12345']
                  }
              }
          }
      }
   }

But I get this in response

{"valid":true,"_shards":{"total":1,"successful":1,"failed":0},"explanations":[{"index":"document_texts","valid":true,"explanation":"+(like:null -_uid:document_text#12345)"}]}

I would like to find out how did the query get built, it uses upto 25 terms for the matching, what were those 25 terms and how can I get them from the query?

I'm not sure if its possible but I would like to see if I can get the 25 terms used by elasticsearches analyzer and then reapply the query with boosted values on the terms depending on my choice.

I also want to highlight this in the document text but tried this

response = DocumentText.search \
  from: 0, size: 25,
  query: {
      filtered: {
          query: {
              more_like_this: {
                  ids: ["12345"]
              }
          },
          filter: {
              bool: {
                  must: [                            
                      {match: { documentable_type: model}}
                 ]
              }
          }

      }
  },
  highlight: {
    pre_tags: ["<tag1>"],
    post_tags: ["</tag1>"],
    fields: {
        doc_text: {
                type_name: {
                content: {term_vector: "with_positions_offsets"}
            }
        }
    }
  }

But this fails to produce anything, I think I was being rather hopeful. I know that this should be possible but would be keen to know if anyone has done this or the best approach. Any ideas?


Solution

  • Including some stop words just for anyone else out there this will give an easy way for it to show the terms used for the query. It doesnt solve the highlight issue but can give the terms used for the mlt matching process. Some other settings are used just to show

      curl -XGET 'http://localhost:9200/document_texts/document_text/_validate/query?rewrite=true' -d '
      {
          "query": {
                "filtered": {
                    "query": {
                        "more_like_this": {
                            "ids": ["12345"],
                            "min_term_freq": 1,
                            "max_query_terms": 50,
                            "stop_words": ["this","of"]
                        }
                    }
                }
            }
        }'
    

    https://github.com/elastic/elasticsearch-ruby/pull/359

    Once this is merged this should be easier

    client.indices.validate_query index: 'document_texts',
      rewrite: true,
      explain: true,
      body: {
        query: {
            filtered: {
                query: {
                    more_like_this: {
                        ids: ['10538']
                    }
                }
            }
        }
     }