pdfindexingms-officeorchardcmsorchardcms-1.6

Extending Orchard's search/indexing module to search in uploaded Word, Excel, Powerpoint and PDF files


Apparently the following module indexes only content in Orchard sites:

http://docs.orchardproject.net/Documentation/Search-and-indexing

If I upload a DOC, XLS, PPT or PDF file, its content won't be added to the index.

Is there an out-of-the-box way to include those contents, or do I have to extend the indexing mechanism?

If the latter is true, any hints are welcome on how to do that. Thank you!

EDIT: by 'uploading a file', I mean the standard media upload to the /Media folder.


Solution

  • It's not available out of the box, but possible to implement on your own, especially with upcoming Orchard 1.7 which will make uploaded Media files content items.

    There are a few extension points for this, with OnIndexing<T> content handler method being the easiest and most straightforward to use. This is the place where extracting keywords and adding them to index should happen. Look at existing implementations for examples.

    Speaking of keyword extraction - I used iTextSharp for PDFs and MS OpenXML SDK 2.0 for handling Office documents (although it will work only for the new formats - DOCX, PPTX and XLSX). For legacy, non-XML Office formats you'd need some other library - there are lots of those on the web.