In our search based on Solr, we have started by using phrases. For example, when the user types
blue dress
then the Solr query will be
title:"blue dress" OR description:"blue dress"
We now want to remove stop words. Using the default StopFilterFactory, the query
the blue dress
will match documents containing "blue dress" or "the blue dress".
However, when typing
blue the dress
then it does not match documents containing "blue dress".
I am starting to wonder if we shouldn't instead only search using single terms. That is, convert the above user search into
title:the OR title:blue OR title:dress OR description:the OR description:blue OR description:dress
I am a bit reluctant to do this, though, as it seems doing the work of the StandardTokenizerFactory.
Here is my schema.xml:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" />
</analyzer>
</fieldType>
The title and the description fields are both of type text_general.
Is the single terms search the standard way of searching in Solr? Am I exposing ourselves to problems by tokenising the words before calling Solr (performance issues, maybe)? Maybe thinking in term of single terms vs. phrases is just wrong and we should leave it to the user to decide?
Although the initial approach might work if the query was split into multiple title:term statements, this is prone to errors (as the tokens might be split in the wrong places) and is also duplicating, probably badly, the work done by the built-in tokenizer.
The right approach is to maintain the initial query as-is and rely on the Solr configuration to handle it properly. This makes sense, but the difficulty was that I wanted to specify the fields in which I wanted to search. And it turns out that there is no way to do that using the default query parser, which is the one known as LuceneQParserPlugin (confusingly, there is a parameter called fl, for Field List, which is used for specifying the returned fields, not the fields to search in).
To be complete, it must be mentioned that it is possible to simulate the list of parameters to search in by using the copyField configuration is schema.xml. I do not find this very elegant nor flexible enough.
The elegant solution is to use the ExtendedDisMax query parser, aka edismax. With it, we can maintain the query as is, and fully leverage the configuration in the schema. In our case, it looks like this:
SolrQuery solrQuery = new SolrQuery();
solrQuery.set("defType", "edismax");
solrQuery.set("q", query); // ie. "blue the dress"
solrQuery.set("qf", "description title");
According to this page:
(e)Dismax generally makes the best first choice query parser for user facing Solr applications
It would have helped if this had indeed been the default choice.