I would like to be able crawl very specific sub-directories for a given website.
For example:
On the website www.world.com
there maybe multiple sub-directories /world
or /bye
. These in-turn may contain multiple pages /world/new
etc. Lets assume that these pages themselves contain links to other pages which may not be in the same sub-directory. ( /world/new
has a link to /bye/new
).
What I would like to accomplish is to crawl the contents of every page under /world/
and only those pages.
Would it be a good idea to ignore any outgoing link unless it also belongs to the same sub-directory? I feel like a lot of the pages would not be reached because it would not be linked directly. For example /world/new/
has a link /bye/new
which in turn has a link to /world/next
. This would cause the crawler to not reach the /next
page. (If I am understanding it correctly).
The alternative would be to crawl the entire website and then filter out the content based on URL post crawl, which would make the job itself significantly larger than it needs to be.
Does Storm crawler have any configuration which could be used to make this simpler? Or maybe there is a better approach to this solution?
Thank you.
You've described the two possible approaches in your question. The easiest would be to use the URL Filters and restrict to the area of the site that you are interested in but as you pointed out, you might miss some content. The alternative is indeed more expensive as you'd have to crawl the whole site and you could then filter as part of the indexing step; for this, you could add a simple parse filter to create a key /value in the metadata for URLs which are in the section of interest and use it as a value of indexer.md.filter.
Of course, if the site provides sitemaps, you'd know about all the URLs it contains in advance and in that case you'd be able to rely on the URL filter alone.