elasticsearchelasticsearch-analyzers

Ignore specific character during fuzzy searches analyzer in Elastic search


I have a fuzzy search analyzer in elastic search with following documents

PUT test_index
{
  "settings": {
    "index": {
      "max_ngram_diff": 40      
    },
    "analysis": {
      "analyzer": {
        "autocomplete": {
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "autocomplete"
          ]
        },
        "autocomplete_search": {
          "tokenizer": "whitespace",
          "filter": [
            "lowercase"
          ]
        }
      },
      "filter": {
        "autocomplete": {
          "type": "ngram",        
          "min_gram": 2,
          "max_gram": 40
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",            
        "analyzer": "autocomplete",
        "search_analyzer": "autocomplete_search"
      }
    }
  }
}

PUT test_index/_doc/1
{ "title": "HRT 2018-BN18 N-SB" }

PUT test_index/_doc/2
{ "title": "GMC 2019-BN18 A-SB" }

How can i ignore the hyphen ('-') during my fuzzy search so that GMC 2019-BN18 A-SB , gmc 2019, gmc 2019-BN18 A-SB and GMC 2019-BN18 ASB yield the same document

I had tried to create another analyzer separately but i am not sure how can we apply multiple analyzer on the same field

"settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "- => "
          ]
        }
      }
    }
  }

Solution

  • You're on the right path, you just need to add that character filter to both analyzers to make sure the hyphens get removed at indexing and search time:

    PUT test_index
    {
      "settings": {
        "index": {
          "max_ngram_diff": 40
        },
        "analysis": {
          "char_filter": {
            "my_char_filter": {
              "type": "mapping",
              "mappings": [
                "- => "
              ]
            }
          },
          "analyzer": {
            "autocomplete": {
              "char_filter": [
                "my_char_filter"
              ],
              "tokenizer": "whitespace",
              "filter": [
                "lowercase",
                "autocomplete"
              ]
            },
            "autocomplete_search": {
              "char_filter": [
                "my_char_filter"
              ],
              "tokenizer": "whitespace",
              "filter": [
                "lowercase"
              ]
            }
          },
          "filter": {
            "autocomplete": {
              "type": "ngram",
              "min_gram": 2,
              "max_gram": 40
            }
          }
        }
      },
      "mappings": {
        "properties": {
          "title": {
            "type": "text",
            "analyzer": "autocomplete",
            "search_analyzer": "autocomplete_search"
          }
        }
      }
    }