I have a HDFS archive to store variety of documents like pdf,ms word file,ppt,csv etc. I would like to build a platform using elasticsearch to search the file or text contents. I know I can use the es-hadoop plugin to index data to from HDFS to ES. I want to know the best ways that I can extract out the textual data from the docs stored in HDFS and index the same.
Any help would be appreciated.
I did a lot of searching, and this is the list of methods I've found so far.
Here's the overall integrations/plugins page: https://www.elastic.co/guide/en/elasticsearch/plugins/master/integrations.html
Here's the new replacement for mapper attachment, Injest plugin: https://www.elastic.co/guide/en/elasticsearch/plugins/current/ingest-attachment.html A posting on how to use it: https://qbox.io/blog/index-attachments-files-elasticsearch-mapper Here's a discussion on pros and cons of using Injest vs fs-crawler (dadoonet is an Elastic developer): https://discuss.elastic.co/t/mapper-attachment-plugin-vs-pre-parsing-and-extracting-content-from-binary-files/73764/10
Here is the file system crawler (FS crawler) plugin: https://github.com/dadoonet/fscrawler
Here is Ambar document search system - they have a community github with open source code: https://ambar.cloud/ https://github.com/RD17/ambar https://blog.ambar.cloud/ingesting-documents-pdf-word-txt-etc-into-elasticsearch/ They seem to use two database server types (MongoDB and Redis), not sure why yet.
Here is Apache Tika, which Injest and Ambar both use (and which also offers OCR through the use of Tesseract, which I have heard Injest does not support): http://tika.apache.org/1.16/
Also, in Injest's usage of Tika, only a subset of file types are supported: https://discuss.elastic.co/t/full-list-of-supported-document-formats-by-es/81149
I hope that the above saves other developers time and if people find more that they will comment below.
Thanks!