elasticsearchstormcrawlerelasticsearch-analyzers

How to stop storing special characters in content while indexing


This is a sample document with the following points: Pharmaceutical Marketing Building – responsibilities.  Mass. – Aug. 13, 2020 –Â

How to remove the special characters or non ascii unicode chars from content while indexing? I'm using ES 7.x and storm crawler 1.17


Solution

  • Looks like an incorrect detection of charset. You could normalise the content before indexing by writing a custom parse filter and remove the unwanted characters there.