I am new to Solr and using Solr 7.3.1 in solr cloud mode and trying to index pdf, office documents in solr, using contentextraction in solr.
I created a collection with
bin\solr create -c tsindex -s 2 -rf 2
in SolrJ my code looks like
public static void main(String[] args) {
System.out.println("Solr Indexer");
final String solrUrl = "http://localhost:8983/solr/tsindex/";
HttpSolrClient solr = new HttpSolrClient.Builder(solrUrl).build();
String filename="C:\\iSampleDocs\\doc-file.doc";
ContentStreamUpdateRequest solrRequest = new ContentStreamUpdateRequest("/update/extract");
try {
solrRequest.addFile(new File(filename), "application/msword");
solrRequest.setParam("litral.ts_ref", "ts-456123");
//solrRequest.setParam("defaultField", "text");
solrRequest.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
NamedList<Object> result= solr.request(solrRequest);
System.out.println(result);
} catch (IOException e) {
e.printStackTrace();
}catch ( SolrServerException e) {
e.printStackTrace();
}
}
I am getting multiple issues
Although I have created field ts_ref
as text_general
in Solr Admin UI, this field does not get set at all.
My goal is to index the complete document including its metadata in one field and then set couple of more fileds refrencing document in another system like e.g. ts_ref field. But what actually happens is the solr extracts the metadata of files and create seperate fileds for each metadata value.
I have tried disabling data driven schema functionality
by bin\solr config -c tsindex -zkHost localhost:9983 -property update.autoCreateFields -value false
When I uncomment line solrRequest.setParam("defaultField", "text");
from beginning, there is not separate fields for all metadata extracted, but as soon as I comment this line and upload the files, the meta data are again in separate fields afterwards (even if I uncomment its again).