ruby-on-railssolrsunspotsunspot-railssunspot-solr

How to index big content in Solr-5.2.1?


We have content like more than 32KB, We unable to index the content

Please refer below logs

Rails log:

RSolr::Error::Http: RSolr::Error::Http - 400 Bad Request Error: {'responseHeader'=>{'status'=>400,'QTime'=>13},'error'=>{'msg'=>'Exception writing document id Article 872cc4f7-8731-4049-b889-85a040edb543 to the index; possible analysis error.','code'=>400}}

Solr Log:

INFO  - 2015-11-04 15:00:30.772; [   collection] org.apache.solr.update.processor.LogUpdateProcessor; [collection] webapp=/solr path=/update params={wt=ruby} {} 0 27

ERROR - 2015-11-04 15:00:30.779; [   collection] org.apache.solr.common.SolrException; org.apache.solr.common.SolrException: Exception writing document id Article 872cc4f7-8731-4049-b889-85a040edb543 to the index; possible analysis error.

. . .

 Caused by: java.lang.IllegalArgumentException: Document contains at least one immense term in field="content_textv" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[60, 112, 62, 83, 109, 97, 108, 108, 32, 97, 110, 100, 32, 77, 101, 100, 105, 117, 109, 32, 83, 99, 97, 108, 101, 32, 69, 110, 116, 101]...'

Content Field Type:

 <field name="content_textv" type="strings"/>

....

 <fieldType name="strings" class="solr.StrField" multiValued="true" sortMissingLast="true"/>

How to index big contents?


Solution

  • Instead of solr.StrField use solr.TextField. Create a new fieldType like -

    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="false">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
      </analyzer>
    </fieldType>
    

    than you can use that fieldtype as -

    <field name="content_textv" type="text_general" indexed="true" stored="false" multiValued="true"/>