We have an index. It has a keyword field. A document may have keywords such as: ['cheesecake', 'cinnamon roll'].
If the input text contains the word 'cheesecake' there is no problem. But if the input text is something like 'Today I have eaten a cinnamon roll', there is no matching. We think the problem is that the input text is tokenized into single words, so neither 'cinnamon' nor 'roll' match our keyword 'cinnamon roll' (and we don't want to! Only 'cinnamon roll' must match the keyword 'cinnamon roll').
How could we solve that? We thought of using shingles, but we didn't find the proper way. And it is only the input search text that we need to tokenize.
This is our current query:
GET /food-suggestion/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"keywords": {
"query": "cinnamon roll",
"analyzer": "standard",
"operator": "or"
}
}
}
],
"filter": [
{
"term": {
"languageId": 1
}
},
{
"term": {
"webId": 2
}
}
]
}
}
}
Index mapping:
description
Text
id
Integer
keywords
Keyword
languageId
Integer
foodId
Long
title
Text
webId
Integer
This is a document of the index:
{
"description": "Bla bla bla",
"keywords": [
"cinnamon roll",
"crema catalana",
"cheesecake",
],
"languageId": 1,
"foodId": 13,
"title": "Sample title",
"webId": 2
}
You are thinking in right direction. You can use shingle
tokenizer to solved this issue.
You can create analyzer as shown below:
PUT test/_settings
{
"settings": {
"analysis": {
"analyzer": {
"standard_shingle": {
"tokenizer": "standard",
"filter": [
"lowercase",
"shingle"
],
"min_shingle_size": 1,
"max_shingle_size": 15
}
}
}
}
}
You can use below query to get desired result:
GET test/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"keywords": {
"query": "cinnamon roll",
"analyzer": "standard_shingle"
}
}
}
],
"filter": [
{
"term": {
"languageId": 1
}
},
{
"term": {
"webId": 2
}
}
]
}
}
}
Above query will not give the result if you pass just only cinnamon
or roll
as single keyword query.
Below are the few things to consider:
keywords
field have data in lower case as keyword
type of field work as case sensitive.