elasticsearchelasticsearch-6

How to exclude a field from getting searched by elasticsearch 6.1?


I have an index with multiple fields in it. I want to filter out based on presence of search string in all the fields except one - user_comments. The query search that I am doing is

{
    "from": offset,
    "size": limit,
    "_source": [
      "document_title"
    ],
    "query": {
      "function_score": {
        "query": {
          "bool": {
            "must":
            {
              "query_string": {
                "query": "#{query}"
              }
            }
          }
        }
      }
    }
  }

Although the query string is searching through all the fields, and giving me documents with matching string in the user_comments field as well. But, I want to query it against all the fields leaving out the user_comments field. The white-list is a very big list and also the name of the fields are dynamic, so it is not feasible to mention the white-listed field list using the fields parameter like.

"query_string": {
                    "query": "#{query}",
                    "fields": [
                      "document_title",
                      "field2"
                    ]
                  }

Can anybody please suggest an idea on how to exclude a field from being searched?


Solution

  • There is a way to make it work, it's not pretty but will do the job. You may achieve your goal using a boost and multifield parameters of query_string, bool query to combine the scores and setting min_score:

    POST my-query-string/doc/_search
    {
      "query": {
        "bool": {
          "should": [
            {
              "query_string": {
                "query": "#{query}",
                "type": "most_fields",
                "boost": 1
              }
            },
            {
              "query_string": {
                "fields": [
                  "comments"
                ],
                "query": "#{query}",
                "boost": -1
              }
            }
          ]
        }
      },
      "min_score": 0.00001
    }
    

    So what happens under the hood?

    Let's assume you have the following set of documents:

    PUT my-query-string/doc/1
    {
      "title": "Prodigy in Bristol",
      "text": "Prodigy in Bristol",
      "comments": "Prodigy in Bristol"
    }
    PUT my-query-string/doc/2
    {
      "title": "Prodigy in Birmigham",
      "text": "Prodigy in Birmigham",
      "comments": "And also in Bristol"
    }
    PUT my-query-string/doc/3
    {
      "title": "Prodigy in Birmigham",
      "text": "Prodigy in Birmigham and Bristol",
      "comments": "And also in Cardiff"
    }
    PUT my-query-string/doc/4
    {
      "title": "Prodigy in Birmigham",
      "text": "Prodigy in Birmigham",
      "comments": "And also in Cardiff"
    }
    

    In your search request you would like to see only documents 1 and 3, but your original query will return 1, 2 and 3.

    In Elasticsearch, search results are sorted by relevance _score, the bigger the score the better.

    So let's try to boost down the "comments" field so its impact into relevance score is neglected. We can do this by combining two queries with a should and using a negative boost:

    POST my-query-string/doc/_search
    {
      "query": {
        "bool": {
          "should": [
            {
              "query_string": {
                "query": "Bristol"
              }
            },
            {
              "query_string": {
                "fields": [
                  "comments"
                ],
                "query": "Bristol",
                "boost": -1
              }
            }
          ]
        }
      }
    }
    

    This will give us the following output:

    {
      "hits": {
        "total": 3,
        "max_score": 0.2876821,
        "hits": [
          {
            "_index": "my-query-string",
            "_type": "doc",
            "_id": "3",
            "_score": 0.2876821,
            "_source": {
              "title": "Prodigy in Birmigham",
              "text": "Prodigy in Birmigham and Bristol",
              "comments": "And also in Cardiff"
            }
          },
          {
            "_index": "my-query-string",
            "_type": "doc",
            "_id": "2",
            "_score": 0,
            "_source": {
              "title": "Prodigy in Birmigham",
              "text": "Prodigy in Birmigham",
              "comments": "And also in Bristol"
            }
          },
          {
            "_index": "my-query-string",
            "_type": "doc",
            "_id": "1",
            "_score": 0,
            "_source": {
              "title": "Prodigy in Bristol",
              "text": "Prodigy in Bristol",
              "comments": "Prodigy in Bristol",
              "discount_percent": 10
            }
          }
        ]
      }
    }
    

    Document 2 has got penalized, but also document 1 did, although it is a desired match for us. Why did it happen?

    Here's how Elasticsearch computed _score in this case:

    _score = max(title:"Bristol", text:"Bristol", comments:"Bristol") - comments:"Bristol"

    Document 1 matches the comments:"Bristol" part and it also happens to be the best score. According to our formula the resulting score is 0.

    What we would actually like to do is to boost first clause (with "all" fields) more if more fields matched.

    Can we boost query_string matching more fields?

    We can, query_string in multifield mode has a type parameter that does exactly that. The query will look like this:

    POST my-query-string/doc/_search
    {
      "query": {
        "bool": {
          "should": [
            {
              "query_string": {
                "type": "most_fields",
                "query": "Bristol"
              }
            },
            {
              "query_string": {
                "fields": [
                  "comments"
                ],
                "query": "Bristol",
                "boost": -1
              }
            }
          ]
        }
      }
    }
    

    This will give us the following output:

    {
      "hits": {
        "total": 3,
        "max_score": 0.57536423,
        "hits": [
          {
            "_index": "my-query-string",
            "_type": "doc",
            "_id": "1",
            "_score": 0.57536423,
            "_source": {
              "title": "Prodigy in Bristol",
              "text": "Prodigy in Bristol",
              "comments": "Prodigy in Bristol",
              "discount_percent": 10
            }
          },
          {
            "_index": "my-query-string",
            "_type": "doc",
            "_id": "3",
            "_score": 0.2876821,
            "_source": {
              "title": "Prodigy in Birmigham",
              "text": "Prodigy in Birmigham and Bristol",
              "comments": "And also in Cardiff"
            }
          },
          {
            "_index": "my-query-string",
            "_type": "doc",
            "_id": "2",
            "_score": 0,
            "_source": {
              "title": "Prodigy in Birmigham",
              "text": "Prodigy in Birmigham",
              "comments": "And also in Bristol"
            }
          }
        ]
      }
    }
    

    As you can see, the undesired document 2 is on the bottom and has score of 0. Here's how the score was computed this time:

    _score = sum(title:"Bristol", text:"Bristol", comments:"Bristol") - comments:"Bristol"

    So the documents matching "Bristol" in any field got selected. Relevance score for comments:"Bristol" got eliminated, and only documents matching title:"Bristol" or text:"Bristol" got a _score > 0.

    Can we filter out those results with undesired score?

    Yes, we can, using min_score:

    POST my-query-string/doc/_search
    {
      "query": {
        "bool": {
          "should": [
            {
              "query_string": {
                "query": "Bristol",
                "type": "most_fields",
                "boost": 1
              }
            },
            {
              "query_string": {
                "fields": [
                  "comments"
                ],
                "query": "Bristol",
                "boost": -1
              }
            }
          ]
        }
      },
      "min_score": 0.00001
    }
    

    This will work (in our case) since the score of the documents will be 0 if and only if "Bristol" was matched against field "comments" only and didn't match any other field.

    The output will be:

    {
      "hits": {
        "total": 2,
        "max_score": 0.57536423,
        "hits": [
          {
            "_index": "my-query-string",
            "_type": "doc",
            "_id": "1",
            "_score": 0.57536423,
            "_source": {
              "title": "Prodigy in Bristol",
              "text": "Prodigy in Bristol",
              "comments": "Prodigy in Bristol",
              "discount_percent": 10
            }
          },
          {
            "_index": "my-query-string",
            "_type": "doc",
            "_id": "3",
            "_score": 0.2876821,
            "_source": {
              "title": "Prodigy in Birmigham",
              "text": "Prodigy in Birmigham and Bristol",
              "comments": "And also in Cardiff"
            }
          }
        ]
      }
    }
    

    Can it be done in a different way?

    Sure. I wouldn't actually advise to go with _score tweaking since it is a pretty complex matter.

    I would advise to make a fetch of existing mapping and construct a list of fields to run the query against beforehand, this will make the code much simpler and straightforward.

    Original solution proposed in the answer (kept for history)

    Originally it was proposed to use this kind of query with exactly the same intent as the solution above:

    POST my-query-string/doc/_search
    {
      "query": {
        "function_score": {
          "query": {
            "bool": {
              "must": {
                "query_string": {
                  "fields" : ["*", "comments^0"],
                  "query": "#{query}"
                }
              }
            }
          }
        }
      },
      "min_score": 0.00001
    }
    

    The only problem is that if an index contains any numeric values, this part:

    "fields": ["*"]
    

    raises an error since textual query string cannot be applied to a number.