htmlelasticsearchpdfstormcrawler

Setting up Stormcrawler and ElasticSearch to crawl our website html file and pdf documents


We are using StormCrawler and ElasticSearch to crawl our website. We followed the documentation for using ElasticSearch with StormCrawler. When we search in Kibana we do get back html files results but not the pdf files content or links. How do we setup Stormcrawler to crawl and store in Elastic our website html and pdf files content. What configuration changes do we need to make. Does this have something to do with outlinks settings? Is there documentation that tell us how to setup StormCrawler and ElasticSearch to crawl html and pdf documents?


Solution

  • You are probably looking at the 'content' index in Kibana but should also look at the 'status' index, the latter should contain PDF docs. A quick look at the logs would have also told you that the PDFs are getting fetched but that the parser is skipping them. The status index contains an ERROR status and a message mentioning 'content-type checking'.

    So, how do you fix it? Just add the Tika module as a Maven dependency and follow the steps on its README, this way the PDF docs will get redirected to the Tika Parsing Bolt which is able to extract text and metadata from them. They should then be indexed correctly into the 'content' index.