elasticsearchstemmingspanish

wrong behavior in stemming for nonenglish languages?


I am working in a project with Spanish text, summing up, none of the stemmers that I have seen in the documentation for Spanish give me good results (only 2, snowball and the normal one), to give an example.

{
  "tokenizer": "standard",
  "filter": [ 
    {
      "type": "snowball",
      "language": "spanish"
    }
  ],
  "text": "alimento, alimentacion"
}

The previous query returns the following:

{
  "tokens" : [
    {
      "token" : "aliment",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "alimentacion",
      "start_offset" : 10,
      "end_offset" : 22,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]

When clearly "alimento" and "alimentacion" should have the same root, is there a way to look for other stemmers?


Solution

  • As it's been mentioned, while alimentaciĆ³n gets properly stemmed as aliment, and alimentacion doesn't. Nevertheless, I found this link from the official documentation of Elastic Search that allows to defined custom stemming patters.

    In your case, you just need to add a new filter just before the stemmer:

    "filter": {
        "custom_stems": {
          "type": "stemmer_override",
          "rules": [
            "alimentacion => aliment"
          ]
        }
      }