apache-stormstormcrawler

Applying different parsefilters to each domain in the same topology


I am trying to crawl different websites (e-commerce websites) and extract specific information from the pages of each website (i.e. product price, quantity, date of publication, etc.). My question is: how to configure the parsing since each website has a different HTML layout which means I need different Xpaths for the same item depending on the website? Can we add multiple parser bolts in the topology for each website? If yes, how can we assign different parsefilters.json files to each parser bolt?


Solution

  • You need #586. At the moment there is no way to do it but to put all your XPATH expressions regardless of the site you want to use them on in the parsefilters.json.

    You can't assign different parsefilters.json to the various instances of a bolt.

    UPDATE however you could have multiple XpathFilters sections within the parseFilters.json. Each could cover a specific source, however, there is currently no way of constraining which source a parse filter gets applied to. You could extend XPathFilter so that it takes some extra config e.g. regular expression a URL must match in order to be applied. That would work quite nicely I think.

    I've recently added JsoupFilters which will be in the next release. These should be useful for your use case but that still doesn't solve the issue that you need an implementation of the filter that organizes the resources per host. It shouldn't be too hard to implement taking the URL filter one as a example and would also make a very nice contribution to the project.