I'm working on a large index containing descriptions of companies.
GET company_descriptions/_search
{
"query": {
"query_string": {
"query": "(\"Blockchain\" or \"block-chain\" OR \"Blockchain?\" OR \"Block-chain?\" OR \"Distributed Ledger\" OR \"Distributed Ledger?\") AND (\"viruses\" OR \"virus\")",
"fields": ["services.no_case_sensitive", "description.no_case_sensitive"]
}
}
}
The query is intended to search for documents that are talking about informatic viruses (second group, after the AND) and blockchain technology (first group, before the AND) at the same time. Note that some terms in the first group are phrases containing a wildcard.
Among the others, the query returns this result:
{
"_index": "company_descriptions",
"_id": "123456",
"_score": 20.032307,
"_source": {
"description": "To relieve needs of persons who are HIV positive or are suffering from aids or blood borne viruses and their families and/or carers and to advance the education of the public in the treatment and prevention of HIV and Aids and blood borne viruses.",
"services": null
}
},
which contains the word viruses
, but not any other word. Then, excluding also reasoning about whether the wildcards are working correctly within phrases, I would interpret the execution of the query on this document as follows:
viruses
What is wrong in this reasoning? Why the document is retrieved?
{
"properties": {
"description": {
"type": "text",
"fields": {
"no_case_sensitive": {
"type": "text",
"analyzer": "NO_CASE_SENSITIVE",
"search_analyzer": "NO_CASE_SENSITIVE"
},
"case_sensitive": {
"type": "text",
"analyzer": "CASE_SENSITIVE",
"search_analyzer": "CASE_SENSITIVE"
}
}
},
"services": {
"type": "text",
"fields": {
"no_case_sensitive": {
"type": "text",
"analyzer": "NO_CASE_SENSITIVE",
"search_analyzer": "NO_CASE_SENSITIVE"
},
"case_sensitive": {
"type": "text",
"analyzer": "CASE_SENSITIVE",
"search_analyzer": "CASE_SENSITIVE"
}
}
}
}
}
{
"settings": {
"analysis": {
"analyzer": {
"NO_CASE_SENSITIVE": {
"type": "custom",
"stopwords": [],
"filter": [
"lowercase"
],
"tokenizer": "standard"
},
"CASE_SENSITIVE": {
"type": "custom",
"stopwords": [],
"filter": [],
"tokenizer": "standard"
}
}
}
}
}
I think that the first or
should be in uppercase. Other wise it's seen as a term which is available in the text HIV positive or are
.
I simplified a bit your example:
DELETE company_descriptions
PUT company_descriptions
{
"settings": {
"analysis": {
"analyzer": {
"NO_CASE_SENSITIVE": {
"type": "custom",
"filter": [
"lowercase"
],
"tokenizer": "standard"
}
}
}
},
"mappings": {
"properties": {
"description": {
"type": "text",
"analyzer": "NO_CASE_SENSITIVE"
}
}
}
}
POST company_descriptions/_doc
{
"description": "To relieve needs of persons who are HIV positive or are suffering from aids or blood borne viruses and their families and/or carers and to advance the education of the public in the treatment and prevention of HIV and Aids and blood borne viruses."
}
GET company_descriptions/_search
{
"query": {
"query_string": {
"query": "(\"Blockchain\" OR \"block-chain\" OR \"Blockchain?\" OR \"Block-chain?\" OR \"Distributed Ledger\" OR \"Distributed Ledger?\") AND (\"viruses\" OR \"virus\")",
"fields": ["description"]
}
}
}