solrsolrnet

Solr stop words not seem to work , stop words are removed while indexing but still it at query time the stopwords are not removed in proximity search


I am using solr 8.2.0 . I am trying to configure proximity search in my solr but it doesnt seem to remove the stopwords in query .

    <fieldType name="psearch" class="solr.TextField" positionIncrementGap="100" multiValued="true">
  <analyzer type="index">
    <tokenizer class="solr.ClassicTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true"  words="stopwords.txt" /> 
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.ClassicTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> 
  </analyzer>
</fieldType>

I have mentioned the stopwords in stopwords.txt file in the directory , at the index time solr is removing the words as you can see in the picture : indexed terms

I also checked it in the analysis tab overthere the stopwords are being removed Analysis tab

And here is the field :

<field name="pSearchField" type="psearch" indexed="true" stored="true" multiValued="false" />
    <copyField source="example" dest="pSearchField"/>

Searching with proximity

And when I set the proximity to 1 or 2 or 3 it returns no result : result


Solution

  • This is a known problem with Solr 5 and up, since it no longer rewrites the position for each token when the stopfilter is invoked. This issue, with a few suggestions of how to fix it, is tracked in SOLR-6468.

    The easiest solution is to introduce a mapping char filter factory, but I'm skeptical to it changing characters internally in a string. (i.e. "to" => "" also affecting veto and not just to). This can possible be handled with multiple PatternReplaceCharFilterFactories instead.

    Another option shown in the thread for the ticket is to use a custom filter that rewrites the position data for each token:

    package filters;
    
    import java.io.IOException;
    import java.util.Map;
    
    import org.apache.lucene.analysis.TokenFilter;
    import org.apache.lucene.analysis.TokenStream;
    import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
    import org.apache.lucene.analysis.util.TokenFilterFactory;
    
    public class RemoveTokenGapsFilterFactory extends TokenFilterFactory {
    
        public RemoveTokenGapsFilterFactory(Map<String, String> args) {
            super(args);
        }
    
        @Override
        public TokenStream create(TokenStream input) {
            RemoveTokenGapsFilter filter = new RemoveTokenGapsFilter(input);
            return filter;
        }
    
    }
    
    final class RemoveTokenGapsFilter extends TokenFilter {
    
        private final PositionIncrementAttribute posIncrAtt = addAttribute(PositionIncrementAttribute.class);
    
        public RemoveTokenGapsFilter(TokenStream input) {
            super(input);
        }
    
        @Override
        public final boolean incrementToken() throws IOException {
            while (input.incrementToken()) {
                posIncrAtt.setPositionIncrement(1);
                return true;
            }
            return false;
        }
    }
    

    There currently is no perfect, built-in solution to this issue as far as I know.