I am working in a project with Spanish text, summing up, none of the stemmers that I have seen in the documentation for Spanish give me good results (only 2, snowball and the normal one), to give an example.
{
"tokenizer": "standard",
"filter": [
{
"type": "snowball",
"language": "spanish"
}
],
"text": "alimento, alimentacion"
}
The previous query returns the following:
{
"tokens" : [
{
"token" : "aliment",
"start_offset" : 0,
"end_offset" : 8,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "alimentacion",
"start_offset" : 10,
"end_offset" : 22,
"type" : "<ALPHANUM>",
"position" : 1
}
]
When clearly "alimento" and "alimentacion" should have the same root, is there a way to look for other stemmers?
As it's been mentioned, while alimentaciĆ³n gets properly stemmed as aliment, and alimentacion doesn't. Nevertheless, I found this link from the official documentation of Elastic Search that allows to defined custom stemming patters.
In your case, you just need to add a new filter just before the stemmer:
"filter": {
"custom_stems": {
"type": "stemmer_override",
"rules": [
"alimentacion => aliment"
]
}
}