
How does SOLR Cell add document content?

SOLR has a module called Cell. It uses Tika to extract content from documents and index it with SOLR.

From the sources at , I conclude that Cell places the raw extracted text document text into a field called "content". The field is indexed by SOLR, but not stored. When you query for documents, "content" doesn't come up.

My SOLR instance has no schema (I left the default schema in place).

I'm trying to implement a similar kind of behavior using the default UpdateRequestHandler (POST to /solr/corename/update). The POST request goes:

<add commitWithin="60000">
        <field name="content">lorem ipsum</field>
        <field name="id">123456</field>
        <field name="someotherfield_i">17</field>

With documents added in this manner, the content field is indexed and stored. It's present in query results. I don't want it to be; it's a waste of space.

What am I missing about the way Cell adds documents?


  • The Cell code indeed adds the content to the document as content, but there's a built-in field translation rule that replaces content with _text_. In the schemaless SOLR, _text_ is marked as not for storing.

    The rule is invoked by the following line in the SolrContentHandler.addField():

    String name = findMappedName(fname);

    In the params object, there's a rule that fmap.content should be treated as _text_. It comes from corename\conf\solrconfig.xml, where by default there's the following fragment:

    <requestHandler name="/update/extract"
                  class="solr.extraction.ExtractingRequestHandler" >
      <lst name="defaults">
        <str name="lowernames">true</str>
        <str name="fmap.meta">ignored_</str>
        <str name="fmap.content">_text_</str> <!-- This one! -->

    Meanwhile, in corename\conf\managed_schema there's a line:

    <field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false"/>

    And that's the whole story.