javamultithreadingoptimizationweb-scrapingcrawler4j

Improving performance of crawler4j


I need to write a webscraper that scrapes around ~1M websites and saves their title, description and keywords into 1 big file (containing the scraped URL and the related words). The URLs should be extracted from a big file.

I've ran Crawler4j on the 1M URLs file and started the webcrawler using this: controller.start(MyCrawler.class, 20). 20 is an arbitrary number. Each crawler passes the resulted words into a blocking queue for a single thread to write these words and URL to the file. I've used 1 writer thread in order to not synchronize on the file. I set the crawl depth to 0 (I only need to crawl my seed list)

After running this for the night I've only downloaded around 200K of URLs. I'm running the scraper on 1 machine using a wired connection. Since most of the URLs are of different hosts I don't think the politeness parameter has any importance here.

EDIT

I tried starting the Crawler4j using the nonblocking start but it just got blocked. My Crawler4j version is: 4.2. This is the code I'm using:

CrawlConfig config = new CrawlConfig();
List<Header> headers = Arrays.asList(
        new BasicHeader("Accept", "text/html,text/xml"),
        new BasicHeader("Accept-Language", "en-gb, en-us, en-uk")
);
config.setDefaultHeaders(headers);
config.setCrawlStorageFolder(crawlStorageFolder);
config.setMaxDepthOfCrawling(0);
config.setUserAgentString("testcrawl");
config.setIncludeBinaryContentInCrawling(false);
config.setPolitenessDelay(10);

PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);

BlockingQueue<String> urlsQueue = new ArrayBlockingQueue<>(400);
controller = new CrawlController(config, pageFetcher, robotstxtServer);

ExecutorService executorService = Executors.newSingleThreadExecutor();
Runnable writerThread = new FileWriterThread(urlsQueue, crawlStorageFolder, outputFile);

executorService.execute(writerThread);

controller.startNonBlocking(() -> {
    return new MyCrawler(urlsQueue);
}, 4);

File file = new File(urlsFileName);
try (BufferedReader br = new BufferedReader(new FileReader(file))) {
    String url;
    while ((url = br.readLine()) != null) {
        controller.addSeed(url);
    }
}

EDIT 1 - This is the code for MyCrawler

public class MyCrawler extends WebCrawler {
    private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|gif|jpg|png|mp3|mp3|zip|gz))$");
    public static final String DELIMETER = "||||";
    private final StringBuilder buffer = new StringBuilder();
    private final BlockingQueue<String> urlsQueue;

    public MyCrawler(BlockingQueue<String> urlsQueue) {
        this.urlsQueue = urlsQueue;
    }

    @Override
    public boolean shouldVisit(Page referringPage, WebURL url) {
        String href = url.getURL().toLowerCase();
        return !FILTERS.matcher(href).matches();
    }

    @Override
    public void visit(Page page) {
        String url = page.getWebURL().getURL();

        if (page.getParseData() instanceof HtmlParseData) {
            HtmlParseData parseData = (HtmlParseData) page.getParseData();
            String html = parseData.getHtml();
            String title = parseData.getTitle();

            Document document = Jsoup.parse(html);
            buffer.append(url.replaceAll("[\n\r]", "")).append(DELIMETER).append(title);
            Elements descriptions = document.select("meta[name=description]");
            for (Element description : descriptions) {
                if (description.hasAttr("content"))
                    buffer.append(description.attr("content").replaceAll("[\n\r]", ""));
            }

            Elements elements = document.select("meta[name=keywords]");
            for (Element element : elements) {
                String keywords = element.attr("content").replaceAll("[\n\r]", "");
                buffer.append(keywords);
            }
            buffer.append("\n");
            String urlContent = buffer.toString();
            buffer.setLength(0);
            urlsQueue.add(urlContent);
        }
    }

    private boolean isSuccessful(int statusCode) {
        return 200 <= statusCode && statusCode < 400;
    }
}

And so I have 2 questions:

  1. can someone suggest any other way to make this process take less time? Maybe somehow tuning the number of crawler threads ? Maybe some other optimizations? I'd prefer a solution that doesn't require several machines but if you think that's the only way to role could someone suggest how to do that? maybe a code example?
  2. Is there any way to make the crawler start working on some URLs and keep adding more URLs during the crawl? I've looked at crawler.startNonBlocking but it doesn't seem to work very well

Thanks in advance


Solution

  • crawler4j is per default designed to be run on one machine. From the field of web-crawling, we know, that web-crawler performance depends primary on the following four resources:

    Defining the optimal number of threads depends on your hardware setup. Thus, more machines will result in a higher throughput. The next hard limitation is the network bandwidth. If you are not attached via highspeed Internet, this will be the bottleneck of your approach.

    Moreover, crawler4j is not designed to load such a huge seed file per default. This is due to the fact, that crawler4j resepcts crawler politness. This implies, that - before the crawl starts - every seed point is checked for a robots.txt, which can take quite a bit of time.

    Adding seeds after the crawl is started, is possible and should work, if the crawl is started in non-blocking mode. However, it can take a while until the URLs are processed.

    For a multi-machine setup you can take a look at Apache Nutch. However, Nutch is a bit difficult to learn.

    EDIT:

    After reproducing your setup, I'm able to answer your issue regarding the addition of seed pages in a dynamic way.

    Starting the crawler in this manner

    controller.startNonBlocking(() -> {
        return new MyCrawler(urlsQueue);
    }, 4);
    

    will invoke the run() method of every crawler thread. Investigating this method, we find a method named frontier.getNextURLs(50, assignedURLs);, which is responsible for taking unseen URLs from the frontier in order to process them. In this method, we find a so-called waitingList, which causes the thread to wait. Since notifyAll is never invoked on waitingList until the controller is shutdown, the threads will never reschedule new URLs.

    To overcome this issue, you have two possible solutions:

    1. Just add at least one URL per thread as a seed point. The deadlock situation will not occur. After starting the threads in non blocking mode, you can just add seeds as you like.

      controller.addSeed("https://www.google.de");
      
      controller.startNonBlocking(() -> {
          return new MyCrawler(urlsQueue);
      }, 4);
      
      controller.addSeed("https://www.google.de/test");
      
      controller.waitUntilFinish();
      
    2. Go for a fork of the Github project and adapt the code of Frontier.java so that the waitingList.notifyAll() method can be invoked from the CrawlController after seed pages are dynamically added.