[SOLVED] wrong behavior in stemming for nonenglish languages?

wrong behavior in stemming for nonenglish languages?

I am working in a project with Spanish text, summing up, none of the stemmers that I have seen in the documentation for Spanish give me good results (only 2, snowball and the normal one), to give an example.

{
  "tokenizer": "standard",
  "filter": [ 
    {
      "type": "snowball",
      "language": "spanish"
    }
  ],
  "text": "alimento, alimentacion"
}

The previous query returns the following:

{
  "tokens" : [
    {
      "token" : "aliment",
      "start_offset" : 0,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "alimentacion",
      "start_offset" : 10,
      "end_offset" : 22,
      "type" : "<ALPHANUM>",
      "position" : 1
    }
  ]

When clearly "alimento" and "alimentacion" should have the same root, is there a way to look for other stemmers?

Solution

As it's been mentioned, while alimentación gets properly stemmed as aliment, and alimentacion doesn't. Nevertheless, I found this link from the official documentation of Elastic Search that allows to defined custom stemming patters.

In your case, you just need to add a new filter just before the stemmer:

"filter": {
    "custom_stems": {
      "type": "stemmer_override",
      "rules": [
        "alimentacion => aliment"
      ]
    }
  }