I'm trying wrap my mind around how the more like this query works, and I seem to be missing something. I read the documentation, but the ES documentation is often somewhat...lacking.
The goal is to be able to limit results by term frequency, as attempted here.
So I set up a simple index, including term vectors for debugging, then added two simple docs.
DELETE /test_index
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"doc": {
"properties": {
"text": {
"type": "string",
"term_vector": "yes"
}
}
}
}
}
PUT /test_index/doc/1
{
"text": "apple, apple, apple, apple, apple"
}
PUT /test_index/doc/2
{
"text": "apple, apple"
}
When I look at the termvectors I see what I expect:
GET /test_index/doc/1/_termvector
...
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_version": 1,
"found": true,
"term_vectors": {
"text": {
"field_statistics": {
"sum_doc_freq": 2,
"doc_count": 2,
"sum_ttf": 7
},
"terms": {
"apple": {
"term_freq": 5
}
}
}
}
}
GET /test_index/doc/2/_termvector
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_version": 1,
"found": true,
"term_vectors": {
"text": {
"field_statistics": {
"sum_doc_freq": 2,
"doc_count": 2,
"sum_ttf": 7
},
"terms": {
"apple": {
"term_freq": 2
}
}
}
}
}
When I run the following query with "min_term_freq": 1
I get back both docs:
POST /test_index/_search
{
"query": {
"more_like_this": {
"fields": [
"text"
],
"like_text": "apple",
"min_term_freq": 1,
"percent_terms_to_match": 1,
"min_doc_freq": 1
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.5816214,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 0.5816214,
"_source": {
"text": "apple, apple, apple, apple, apple"
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 0.5254995,
"_source": {
"text": "apple, apple"
}
}
]
}
}
But if I increase "min_term_freq"
to 2 (or more) I get nothing, though I would expect both documents to be returned:
POST /test_index/_search
{
"query": {
"more_like_this": {
"fields": [
"text"
],
"like_text": "apple",
"min_term_freq": 2,
"percent_terms_to_match": 1,
"min_doc_freq": 1
}
}
}
...
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
Why? What am I missing?
If I want to set up a query that would only return the document in which "apple"
appears 5 times, but not the one in which it appears 2 times, is there a better way?
Here is the code, for convenience:
http://sense.qbox.io/gist/341f9f77a6bd081debdcaa9e367f5a39be9359cc
The min term frequency and min doc frequency are actually applied on the input before doing the MLT. Which means as you have only one occurrence of apple in your input text , apple was never qualified for MLT as min term frequency is set to 2. If you change your input to "apple apple" as below , things will work -
POST /test_index/_search
{
"query": {
"more_like_this": {
"fields": [
"text"
],
"like_text": "apple apple",
"min_term_freq": 2,
"percent_terms_to_match": 1,
"min_doc_freq": 1
}
}
}
Same goes for min doc frequency too. Apple is found in atleast 2 document , so min_doc_freq upto 2 will qualify apply from input text for MLT operations.