elasticsearchfilteranalyzerelasticsearch-querystemming

Elastic: Treat symbol and html encoded symbol the same during search


My goal is to return the same results when searching by the symbol or html encoded version.

Example Queries:

# searching with symbol
GET my-test-index/_search
{
  "query": {
    "bool": {
      "must": {
        "simple_query_string": {
          "query": "Hello®",
          "analyzer": "english_syn",
          "fields": [
            "AllContent"
          ]
        }
      }
    }
  }
}

# html symbol
GET my-test-index/_search
{
  "query": {
    "bool": {
      "must": {
        "simple_query_string": {
          "query": "Hello®",
          "analyzer": "english_syn",
          "fields": [
            "AllContent"
          ]
        }
      }
    }
  }
}

I've tried a couple different things.

Adding synonyms but they still produced different results.

#######################################
# Synonyms
# Symbols
#######################################
™, ™
®, ®

Created a char_filter to replace special characters so they would at least be searching for "Hello". But that comes with its own set of issues that is out of scope of what I am trying to achieve.

char_filter": {
    "specialCharactersFilter": {
    "type": "pattern_replace",
    "pattern": "[^A-Za-z0-9]",
    "replacement": " "
}

I appreciate any feedback for any new alternatives to achieve this goal. Ideally a solution that covers more than ® and ­­™.


Solution

  • What you are looking for is the html strip char filter, which works not only for two symbols but for a broad html characters.

    Working example

    Index mapping with html strip char filter

    {
        "settings": {
            "analysis": {
                "analyzer": {
                    "my_analyzer": {
                        "tokenizer": "standard",
                        "char_filter": [
                            "html_strip"
                        ]
                    }
                }
            }
        },
        "mappings": {
            "properties": {
                "title": {
                    "type": "text",
                    "analyzer": "my_analyzer"
                }
            }
        }
    }
    

    Index sample doc with just (™) in that document.

    PUT 71622637/_doc/1
    
    {
       "title" : "™"
    }
    
    

    Search on its html encoded version

    {
        "query" :{
            "match" : {
                "title" : "&trade"
            }
        }
    }
    
    And search result
    
    "hits": [
                {
                    "_index": "71622637",
                    "_id": "1",
                    "_score": 0.89701396,
                    "_source": {
                        "title": "™"
                    }
                }
            ]
    

    Similar to this, search on trademark symbol

    {
        "query" :{
            "match" : {
                "title" : "™"
            }
        }
    }
    
    And search result
    
    "hits": [
                {
                    "_index": "71622637",
                    "_id": "1",
                    "_score": 0.89701396,
                    "_source": {
                        "title": "™"
                    }
                }
            ]