My goal is to return the same results when searching by the symbol or html encoded version.
Example Queries:
# searching with symbol
GET my-test-index/_search
{
"query": {
"bool": {
"must": {
"simple_query_string": {
"query": "Hello®",
"analyzer": "english_syn",
"fields": [
"AllContent"
]
}
}
}
}
}
# html symbol
GET my-test-index/_search
{
"query": {
"bool": {
"must": {
"simple_query_string": {
"query": "Hello®",
"analyzer": "english_syn",
"fields": [
"AllContent"
]
}
}
}
}
}
I've tried a couple different things.
Adding synonyms but they still produced different results.
#######################################
# Synonyms
# Symbols
#######################################
™, ™
®, ®
Created a char_filter to replace special characters so they would at least be searching for "Hello". But that comes with its own set of issues that is out of scope of what I am trying to achieve.
char_filter": {
"specialCharactersFilter": {
"type": "pattern_replace",
"pattern": "[^A-Za-z0-9]",
"replacement": " "
}
I appreciate any feedback for any new alternatives to achieve this goal. Ideally a solution that covers more than ® and ™.
What you are looking for is the html strip char filter, which works not only for two symbols but for a broad html characters.
Working example
Index mapping with html strip char filter
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"html_strip"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
Index sample doc with just (™) in that document.
PUT 71622637/_doc/1
{
"title" : "™"
}
Search on its html encoded version
{
"query" :{
"match" : {
"title" : "&trade"
}
}
}
And search result
"hits": [
{
"_index": "71622637",
"_id": "1",
"_score": 0.89701396,
"_source": {
"title": "™"
}
}
]
Similar to this, search on trademark symbol
{
"query" :{
"match" : {
"title" : "™"
}
}
}
And search result
"hits": [
{
"_index": "71622637",
"_id": "1",
"_score": 0.89701396,
"_source": {
"title": "™"
}
}
]