webweb-crawlerheritrix

find web trace to a web list in heritrix


I have been working with web crawler Heritrix recently in my company where i work for and after a while searching and testing it I can't find how to solve our need.

We want to run heritrix automatically in cron everyday to crawl a list of webpages and what we want to do is to check if any link of that webs are pointing to webs on our domains list. The difficult part and don't find the way is to log all the trace to that link that points to one our domains.

As the job's log file stores all the links with some information but not the trace. An example is run an script when job is done to grep brazzers that is a domain in the list, so if it finds "brazzers" in the crawl log it should show as a result in another log with the whole trace from start to end:

2015-10-25T20:18:58.369Z 200 91 http://cdn1.ads.brazzers.com/robots.txt XLEP http://cdn1.ads.brazzers.com/ text/plain #021 20151025201857643+726 sha1:CPA63O5POU3CVLCH3VDDIMBJCCWRVLPC - -

Is it possible to do this?, or other way?. Feel very stupid with this stuff and i am not very good in programming

Thank you very much in advance

Enrique.


Solution

  • Actually there is a way to analyse the final log for the crawl job when it finishes. Thanks to the response of a heritrix developer (https://groups.yahoo.com/neo) I have now the rule to get the trace of the web link:

    The fourth field of a line in the crawl.log is the URI that was downloaded. The sixth field of the line tells you the URI that referred (directly preceded) the downloaded URI given in the fourth field. So generally, if you find "ourdomain" in the fourth field of a line, then you take the URI in the sixth field of that line and look for that as a fourth field in the crawl.log, you can find its referrer and follow back in this pattern until you hit a seed URI. You should know when you get to a seed URI because the sixth field will have a "-" instead of a URI (the discovery path given in the fifth field will also be a "-").

    In this way you can get the particular path that this crawl instance took from the seed to "ourdomain", though there may be multiple other paths existing that the crawler did not take in this instance.

    Having this, one way to sort out the lines in the log file to build the web link trace is to create an snippet for example in PHP as an example following the rules given