javaweb-crawlerheritrix

Is Heritrix3.2.0 able to crawl ajax-based web sites?


Is it possible to crawl ajax-based web sites using Heritrix-3.2.0?


Solution

  • If you intend to make a "copy" of an ajax website, clearly no.

    If you want to grab some data by analysing the content of the website, you can customize the crawler with an Extractor that would determine which URLs to follow. On most website you can easily guess the urls that are interesting for your case without having to interpret the javascript. Then the ajax callbacks would be crawled and given to the Processor chain. By default this would store the ajax callback answers in the archive files.

    Making your own Extractor looks like that:

        import org.archive.modules.extractor.ContentExtractor;
        import org.archive.modules.extractor.LinkContext;
        import org.archive.modules.extractor.Hop;
        import org.archive.io.ReplayCharSequence;
        import org.archive.modules.CrawlURI;
    
        public class MyExtractor extends ContentExtractor {
        @Override
        protected boolean shouldExtract(CrawlURI uri) {
            return true;
        }
    
        @Override
        protected boolean innerExtract(CrawlURI curi) {
            try {
                ReplayCharSequence cs = curi.getRecorder().getContentReplayCharSequence();
                // ... analyse the page content cs as a CharSequence ...
    
                // decide you want to crawl some page with url [uri] :
                addOutlink( curi, uri, LinkContext.NAVLINK_MISC, Hop.NAVLINK );
        }
    

    Compile, put the jar file in the heritrix/lib directory and insert a bean refering to MyExtractor in the fetchProcessors chain : basically, duplicate the extractorHtml line in the crawl job cxml file.