pdfsolrfull-text-searchapache-tikasolr-cell

Indexing PDF with page numbers with Solr


I'm indexing PDFs with Solr using the ExtractingRequestHandler. I would like to display the page number along with hits in a document, e.g. "term foo was found in bar.pdf on pages 2, 3 and 5."

Is it possible to include page numbers in the query result like this?


Solution

  • It would require some development effort, but you could achieve this by indexing each page of each document as a seperate Solr document, and then use field collapsing to group the different page hits for each document.

    Note that you need a nightly for this, field collapsing is not implemented in any currently released Solr version.

    Also note: Field Collapsing is implemented in version Solr 3.3. More updates are expected in the next big version ( Solr 4.0)