solrsolrjsolrcloudsolr-schema

Solr Cloud: How to disable document (pdf, office) metadata as fields


I am new to Solr and using Solr 7.3.1 in solr cloud mode and trying to index pdf, office documents in solr, using contentextraction in solr.

I created a collection with
bin\solr create -c tsindex -s 2 -rf 2

in SolrJ my code looks like

public static void main(String[] args) {
    System.out.println("Solr Indexer");
    final String solrUrl = "http://localhost:8983/solr/tsindex/";
    HttpSolrClient solr = new HttpSolrClient.Builder(solrUrl).build();
    String filename="C:\\iSampleDocs\\doc-file.doc";    
    ContentStreamUpdateRequest solrRequest = new ContentStreamUpdateRequest("/update/extract");
    try {
        solrRequest.addFile(new File(filename), "application/msword");
        solrRequest.setParam("litral.ts_ref", "ts-456123");
        //solrRequest.setParam("defaultField", "text");

        solrRequest.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
        NamedList<Object> result= solr.request(solrRequest);
        System.out.println(result);

    } catch (IOException  e) {
        e.printStackTrace();
    }catch ( SolrServerException e) {
        e.printStackTrace();
    }
}

I am getting multiple issues

  1. Although I have created field ts_ref as text_general in Solr Admin UI, this field does not get set at all.

  2. My goal is to index the complete document including its metadata in one field and then set couple of more fileds refrencing document in another system like e.g. ts_ref field. But what actually happens is the solr extracts the metadata of files and create seperate fileds for each metadata value.

I have tried disabling data driven schema functionality by bin\solr config -c tsindex -zkHost localhost:9983 -property update.autoCreateFields -value false

When I uncomment line solrRequest.setParam("defaultField", "text"); from beginning, there is not separate fields for all metadata extracted, but as soon as I comment this line and upload the files, the meta data are again in separate fields afterwards (even if I uncomment its again).


Solution

    1. "litral.ts_ref" there is a typo here, missing an e
    2. you can achieve ignoring all metadata fields by using uprefix field, and a dynamic field that goes with it. See the doc that shows exactly that case.