I want to use Solr with PDF files, but I don’t know how configure solrconfig.xml and schema.xml. What should I write in those file ? The aim is to do full-text search with synonym or spell checker for example.(I use Solr on Windows, and in the future i will use the API SolrNet).Thank you !
You would use Tika to extract text from a PDF file.
Tika extracts metadata from the PDF document, for title
,
author
, and so on. As such, your schema should include fields for
title
and author
.
Tika extracts the body of the PDF document to the content
field, so
your schema should also include a content
field.
Once Tika is configured, you issue a HTTP POST to Solr, specifying the PDF file you wish to index:
curl 'http://localhost:8983/solr/techproducts/update/extract?literal.id=doc1&commit=true' -F "myfile=@example/exampledocs/solr-word.pdf"
If you need to map the fields Tika generates (title
, author
, content
) to different fields in your Solr index, you can use the fmap
feature:
fmap.content=text
would map Tika's extracted content
field to Solr's text
field.