elasticsearch

Elasticsearch English stemming not working correctly


I've added an english stemmer analyzer and filter to our query but it doesn't seem to be working correctly with plurals stemming from 'y' => 'ies'. For example, when I search 'raspberry' the results never include 'raspberries' and so on. I've tried both english and minimal_english but I still get the same result.

Here's the analyzer and settings:

   analysis: {
     analyzer: {
       custom_analyzer: {
         type: "custom",
         tokenizer: "standard",
         filter: ["lowercase", "english_stemmer"],
       },
     },
     filter: {
       english_stemmer: {
         type: "stemmer",
         language: "english",
       },
     },
   },
 }

What am I doing wrong?


Solution

  • Though english should work for the e.g. you mentioned, you can even go for porter_stem instead. This is equivalent to stemmer with language english.

    porter_stem in action:

    POST /_analyze
    {
      "tokenizer": "standard",
      "filter": ["porter_stem"],
      "text": ["raspberry", "raspberries"]
    }
    

    Response of above request:

    {
      "tokens" : [
        {
          "token" : "raspberri",
          "start_offset" : 0,
          "end_offset" : 9,
          "type" : "<ALPHANUM>",
          "position" : 0
        },
        {
          "token" : "raspberri",
          "start_offset" : 10,
          "end_offset" : 21,
          "type" : "<ALPHANUM>",
          "position" : 101
        }
      ]
    }
    

    You can see both raspberry and raspberries get tokenise to raspberri. Therefore searching for raspberry will also match raspberries and vice-versa.

    Make sure that the field against which you are indexing and searching has defined the analyzer as custom_analyzer (according to settings you stated in your question).

    Working e.g.

    Mapping:

    PUT test
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "custom_analyzer": {
              "type": "custom",
              "tokenizer": "standard",
              "filter": [
                "lowercase",
                "english_stemmer"
              ]
            }
          },
          "filter": {
            "english_stemmer": {
              "type": "stemmer",
              "language": "english"
            }
          }
        }
      },
      "mappings": {
        "properties": {
          "field1": {
            "type": "text",
            "analyzer": "custom_analyzer"
          }
        }
      }
    }
    

    Indexing:

    PUT test/_doc/1
    {
      "field1": "raspberries"
    }
    
    PUT test/_doc/2
    {
      "field1": "raspberry"
    }
    

    Search:

    GET test/_search
    {
      "query": {
        "match": {
          "field1": {
            "query": "raspberry"
          }
        }
      }
    }
    

    Response:

    {
      "took" : 0,
      "timed_out" : false,
      "_shards" : {
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
      },
      "hits" : {
        "total" : {
          "value" : 2,
          "relation" : "eq"
        },
        "max_score" : 0.18232156,
        "hits" : [
          {
            "_index" : "test",
            "_type" : "_doc",
            "_id" : "1",
            "_score" : 0.18232156,
            "_source" : {
              "field1" : "raspberries"
            }
          },
          {
            "_index" : "test",
            "_type" : "_doc",
            "_id" : "2",
            "_score" : 0.18232156,
            "_source" : {
              "field1" : "raspberry"
            }
          }
        ]
      }
    }
    

    You can also have a look at other stemmer kstem.