solrcoldfusioncoldfusion-10cfsearchcfindex

Does HTMLStripCharFilterFactory @ Solr 3.4 strip out html for returned fields?


I'm using CF10 which should be using Solr 3.4 according to corporatezen.com/2013/11/updating-solr-engine-coldfusion. I added <charFilter class="solr.HTMLStripCharFilterFactory"/> to <fieldType name="text"> but the summary field in the search result still includes HTML. Any idea why?

<field name="summary" type="text" indexed="false" stored="true" required="false" />

http://localhost:8985/solr/test/admin/schema.jsp shows:

Field: summary Field Type: TEXT

Properties: Tokenized, Stored

Schema: Tokenized, Stored

Position Increment Gap: 100

Index Analyzer: org.apache.solr.analysis.TokenizerChain DETAILS

Char Filters:

org.apache.solr.analysis.HTMLStripCharFilterFactory args:{luceneMatchVersion: LUCENE_24 } Tokenizer Class: org.apache.solr.analysis.WhitespaceTokenizerFactory

Filters:

org.apache.solr.analysis.StopFilterFactory args:{words: stopwords.txt ignoreCase: true enablePositionIncrements: true luceneMatchVersion: LUCENE_24 } org.apache.solr.analysis.WordDelimiterFilterFactory args:{splitOnCaseChange: 1 generateNumberParts: 1 catenateWords: 1 luceneMatchVersion: LUCENE_24 generateWordParts: 1 catenateAll: 0 catenateNumbers: 1 } org.apache.solr.analysis.LowerCaseFilterFactory args:{luceneMatchVersion: LUCENE_24 } org.apache.solr.analysis.EnglishPorterFilterFactory args:{protected: protwords.txt luceneMatchVersion: LUCENE_24 } org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory args:{luceneMatchVersion: LUCENE_24 } Query Analyzer: org.apache.solr.analysis.TokenizerChain DETAILS

Char Filters:

org.apache.solr.analysis.HTMLStripCharFilterFactory args:{luceneMatchVersion: LUCENE_24 } Tokenizer Class: org.apache.solr.analysis.WhitespaceTokenizerFactory

Filters:

org.apache.solr.analysis.SynonymFilterFactory args:{synonyms: synonyms.txt expand: true ignoreCase: true luceneMatchVersion: LUCENE_24 } org.apache.solr.analysis.StopFilterFactory args:{words: stopwords.txt ignoreCase: true luceneMatchVersion: LUCENE_24 } org.apache.solr.analysis.WordDelimiterFilterFactory args:{splitOnCaseChange: 1 generateNumberParts: 1 catenateWords: 0 luceneMatchVersion: LUCENE_24 generateWordParts: 1 catenateAll: 0 catenateNumbers: 0 } org.apache.solr.analysis.LowerCaseFilterFactory args:{luceneMatchVersion: LUCENE_24 } org.apache.solr.analysis.EnglishPorterFilterFactory args:{protected: protwords.txt luceneMatchVersion: LUCENE_24 } org.apache.solr.analysis.RemoveDuplicatesTokenFilterFactory args:{luceneMatchVersion: LUCENE_24 }


Solution

  • You need to differentiate between the stored and the indexed. The filter you have added to the field will alter the tokens that are stored in Solr's index, for searching, but not the stored values for display.

    Solr keeps two versions of a field*. One is the stored one. This is the original portion of text, in your case with HTML included. The other one is the index version. There the whole magic of text analysis has been applied.

    Then when you perform a search, the index is used to look up which documents have created a match. When displaying the result, the stored version is presented to you.


    * Of course only in case that you turned on stored="true" and indexed="true".