I'm using the phonetic plugin filter for elasticsearch. https://github.com/elastic/elasticsearch-analysis-phonetic
When I create the index I am creating a custom filter with the following settings.
soundex: {
type: "phonetic",
encoder: "metaphone",
replace: "true"
}
This works fine but is creating metaphone tokens with a maximum length of 4 characters which is adding too much noise to my search results. For example I get KNTR for both contraceptive and control (it's medical data).
According to Unexpected results from Metaphone algorithm the underlying Java API contains a setMaxCodeLen value. How do you set this when configuring it in elasticsearch?
I'd like to do something like:
soundex: {
type: "phonetic",
encoder: "metaphone",
replace: "true",
maxcodelen: 8
}
But thus far I've been unable to determine if its possible to configure the encoder to increase the maximum length of the encoded tokens. Is it possible to configure this? If so, how?
I think it's not possible to configure it. I've checked the source code of the plugin and it seems it's easy to achieve what you are asking for.
In PhoneticTokenFilterFactory.java you will see:
this.maxcodelength = 0;
this.replace = settings.getAsBoolean("replace", true);
As you can guess, replace
parameter can be configured but maxcodelenght
is always set to 0
. So you can change that line by something like:
this.maxcodelength = settings.getAsInt("maxcodelen", 0);
I named the new property "maxcodelen" because it's the name you use in your example.
Then you can compile it and install the modified plugin from your local (check how to install local plugins)
If everything works and you feel like, send a pull request :)