I have an index, with effectively the converted word or pdf document plain text "document_texts", built on a Rails stack the ActiveModel is DocumentText using the elasticsearch rails gems, for model, and API. I want to be able to match similar word documents or pdf's based on the document text
I have been able to match documents against each other by using
response = DocumentText.search \
query: {
filtered: {
query: {
more_like_this: {
ids: ["12345"]
}
}
}
}
But I want to see HOW did the result set get queried, what were the query terms used to match the documents
Using the elasticsearch API gem I can do the following
client=Elasticsearch::Client.new log:true
client.indices.validate_query index: 'document_texts',
explain: true,
body: {
query: {
filtered: {
query: {
more_like_this: {
ids: ['12345']
}
}
}
}
}
But I get this in response
{"valid":true,"_shards":{"total":1,"successful":1,"failed":0},"explanations":[{"index":"document_texts","valid":true,"explanation":"+(like:null -_uid:document_text#12345)"}]}
I would like to find out how did the query get built, it uses upto 25 terms for the matching, what were those 25 terms and how can I get them from the query?
I'm not sure if its possible but I would like to see if I can get the 25 terms used by elasticsearches analyzer and then reapply the query with boosted values on the terms depending on my choice.
I also want to highlight this in the document text but tried this
response = DocumentText.search \
from: 0, size: 25,
query: {
filtered: {
query: {
more_like_this: {
ids: ["12345"]
}
},
filter: {
bool: {
must: [
{match: { documentable_type: model}}
]
}
}
}
},
highlight: {
pre_tags: ["<tag1>"],
post_tags: ["</tag1>"],
fields: {
doc_text: {
type_name: {
content: {term_vector: "with_positions_offsets"}
}
}
}
}
But this fails to produce anything, I think I was being rather hopeful. I know that this should be possible but would be keen to know if anyone has done this or the best approach. Any ideas?
Including some stop words just for anyone else out there this will give an easy way for it to show the terms used for the query. It doesnt solve the highlight issue but can give the terms used for the mlt matching process. Some other settings are used just to show
curl -XGET 'http://localhost:9200/document_texts/document_text/_validate/query?rewrite=true' -d '
{
"query": {
"filtered": {
"query": {
"more_like_this": {
"ids": ["12345"],
"min_term_freq": 1,
"max_query_terms": 50,
"stop_words": ["this","of"]
}
}
}
}
}'
https://github.com/elastic/elasticsearch-ruby/pull/359
Once this is merged this should be easier
client.indices.validate_query index: 'document_texts',
rewrite: true,
explain: true,
body: {
query: {
filtered: {
query: {
more_like_this: {
ids: ['10538']
}
}
}
}
}