elasticsearch

Elasticsearch query_string containing phrases with wildcards gives unexpected results


I'm working on a large index containing descriptions of companies.

Query

GET company_descriptions/_search
{
  "query": {
    "query_string": {
      "query": "(\"Blockchain\" or \"block-chain\" OR \"Blockchain?\" OR \"Block-chain?\" OR \"Distributed Ledger\" OR \"Distributed Ledger?\") AND (\"viruses\" OR \"virus\")",
      "fields": ["services.no_case_sensitive", "description.no_case_sensitive"]
    }
  }
}

The query is intended to search for documents that are talking about informatic viruses (second group, after the AND) and blockchain technology (first group, before the AND) at the same time. Note that some terms in the first group are phrases containing a wildcard.

Question

Among the others, the query returns this result:

{
    "_index": "company_descriptions",
    "_id": "123456",
    "_score": 20.032307,
    "_source": {
        "description": "To relieve needs of persons who are HIV positive or are suffering from aids or blood borne viruses and their families and/or carers and to advance the education of the public in the treatment and prevention of HIV and Aids and blood borne viruses.",
        "services": null
    }
},

which contains the word viruses, but not any other word. Then, excluding also reasoning about whether the wildcards are working correctly within phrases, I would interpret the execution of the query on this document as follows:

  1. the first part of the query, before the AND, returns False because there is no word related to blockchain in the document
  2. the second part of the query gives True, because the document contains the word viruses
  3. I would expect that the document is not retrieved, since points 1. and 2. are in AND relation, so that the overall query should evaluate to False.

What is wrong in this reasoning? Why the document is retrieved?

Additional information

Mapping

{
   "properties": {
        
        "description": {
            "type": "text",
            
            "fields": {
                "no_case_sensitive": {
                    "type": "text",
                    "analyzer": "NO_CASE_SENSITIVE",
                    "search_analyzer": "NO_CASE_SENSITIVE"
                },
                "case_sensitive": {
                    "type": "text",
                    "analyzer": "CASE_SENSITIVE",
                    "search_analyzer": "CASE_SENSITIVE"
                }
            }
        },
    
        "services": {
            "type": "text",
    
           "fields": {
                "no_case_sensitive": {
                    "type": "text",
                    "analyzer": "NO_CASE_SENSITIVE",
                    "search_analyzer": "NO_CASE_SENSITIVE"
                },
                "case_sensitive": {
                    "type": "text",
                    "analyzer": "CASE_SENSITIVE",
                    "search_analyzer": "CASE_SENSITIVE"
                }
            }
        }
        
        
    }
}

Analyzers

{
    "settings": {
        "analysis": {
            "analyzer": {
                "NO_CASE_SENSITIVE": {
                    "type": "custom",
                    "stopwords": [],
                    "filter": [
                        "lowercase"
                    ],
                    "tokenizer": "standard"
                },
                
                "CASE_SENSITIVE": {
                    "type": "custom",
                    "stopwords": [],
                    "filter": [],
                    "tokenizer": "standard"
                }
            }
        }
    }
}

Solution

  • I think that the first or should be in uppercase. Other wise it's seen as a term which is available in the text HIV positive or are.

    I simplified a bit your example:

    DELETE company_descriptions
    PUT company_descriptions
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "NO_CASE_SENSITIVE": {
              "type": "custom",
              "filter": [
                "lowercase"
              ],
              "tokenizer": "standard"
            }
          }
        }
      },
      "mappings": {
        "properties": {
          "description": {
            "type": "text",
            "analyzer": "NO_CASE_SENSITIVE"
          }
        }
      }
    }
    
    POST company_descriptions/_doc
    {
      "description": "To relieve needs of persons who are HIV positive or are suffering from aids or blood borne viruses and their families and/or carers and to advance the education of the public in the treatment and prevention of HIV and Aids and blood borne viruses."
    }
    
    GET company_descriptions/_search
    {
      "query": {
        "query_string": {
          "query": "(\"Blockchain\" OR \"block-chain\" OR \"Blockchain?\" OR \"Block-chain?\" OR \"Distributed Ledger\" OR \"Distributed Ledger?\") AND (\"viruses\" OR \"virus\")",
          "fields": ["description"]
        }
      }
    }