I'm stuck up with an issue as elaborated here. I have a text field that stores bed and bath info into it, while indexing I store values like 2b 3bt for 2 beds and 3 baths respectively. Finally I need to support queries like "2 beds 3 baths" , "beds 2 3 baths", "2 bed rooms 3 baths", "2bd 3bth" ....
For attaining this, I use a text field with the text_general type as below
<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(?i)((\d\.?\d{0,2})\s*(bed\s*rooms|bed\s*room|beds|bed|bdr|bd|br|b)|(bed\s*rooms|bed\s+room|beds|bed|bdr|bd|br|b)\s*(\d\.?\d{0,2}))" replacement="$2$5b" />
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(?i)((\d\.?\d{0,2})\s*(bath\s*rooms|bath\s*room|baths|bath|bth|bt|bh|ba)|(bath\s*rooms|bath\s*room|baths|bath|bth|bt|bh|ba)\s*(\d\.?\d{0,2}))" replacement="$2$5bt" />
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.TrimFilterFactory" updateOffsets="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
I tried Solr queries with the admin interface and it is almost working fine for all combinations except for case with intermediate spaces like "6 beds 6 baths" or "6 bed room 6 bath room" at the same time "6beds 6baths" gets me correct results. Here is the url with the parameters that I put across to solr for this query
/solr/select?q=6b+6ba&wt=xml&indent=true&q.op=AND
I checked the Solr admin analysis interface for each of these case and found no difference at all. As the analysis phase is producing the same results I was expecting both these queries to work similar. Can any one direct me, why these two queries are not behaving in a similar manner ?
This is what I see at the solr admin analysis interface for the two queries in question
For input : 6 beds 6 bath room,
PRCF 6b 6bath room
PRCF 6b 6bt
ST 6b | 6bt
TF 6b | 6bt
SF 6b | 6bt
LCF 6b | 6bt
For input : 6b 6bt
PRCF 6b 6bt
PRCF 6b 6bt
ST 6b | 6bt
TF 6b | 6bt
SF 6b | 6bt
LCF 6b | 6bt
Sample inputs & outputs - Here are some sample inputs that I tried using the field definition I already mentioned above, Note: (#) is just the serial number and is not part of the input
(1) 2beds 3baths Fresno
(2) 3baths 2beds Fresno
(3) Fresno 2bedroom 3bathroom
(4) beds2 3baths Fresno
(5) beds2 bathrooms3 Fresno
All the above are working fine even now, Here are some inputs that are still a concern for me with the current field definition
(6) 2 beds 3 baths Fresno
(7) 2 bed rooms 3 baths Fresno
(8) Fresno 2 bed room 3 baths
(9) Fresno 3baths 2 bed rooms
The output that I expect for the above inputs after analysis phase in the same serial number order is as below (as while indexing for 2beds 3 baths, I index the data as 2b 3bt),
(1) 2b 3bt Fresno
(2) 3bt 2b Fresno
(3) Fresno 2b 3bt
(4) 2b 3bt Fresno
(5) 2b 3bt Fresno
(6) 2b 3bt Fresno
(7) 2b 3bt Fresno
(8) Fresno 2b 3bt
(9) Fresno 3bt 2b
But up to this point I think I'm doing fine as I can generate the exact same output on analysis which I confirmed through the Solr admin Analysis interface, The real issue here is that I can get the query to fetch correct search results for the first section of the input (ie) up to #5 but for the inputs from #6 to #9 I don't get any results
This is a sample query format that I try for input #1 ie) 2beds 3baths Fresno
/solr/collection1/select?q=2beds+3baths+Fresno&wt=xml&indent=true&q.op=AND
And this one for #6, ie) 2 beds 3 baths Fresno
/solr/collection1/select?q=2+beds+3+baths+Atlanta&wt=xml&indent=true&q.op=AND
The final solution that I applied here is as below,
I removed the PatternReplaceCharFilterFactory for bed and bath from the Query time Analyser and did a similar pattern replacement on the input text from my servlet.
So now for the following input text
2 beds 3 baths Fresno
From my servlet code, I convert it to
2b 3bt Fresno
This is what I then pass on to solr to work on ... and it is now working fine
Here is the modified fieldtype definition for the text_general field,
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.TrimFilterFactory" updateOffsets="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>