javamultithreadingweb-crawlerforkjoinpool

How to manage a crawler URL frontier?


Guys

I have the following code to add visited links on my crawler. After extracting links i have a for loop which loop thorough each individual href tags.

And after i have visited a link , opened it , i will add the URL to a visited link collection variable defined above.

private final Collection<String> urlForntier = Collections.synchronizedSet(new HashSet<String>()); 

The crawler implementation is mulithread and assume if i have visited 100,000 urls, if i didn't terminate the crawler it will grow day by day . and It will create memory issues ? Please , what option do i have to refresh the variable without creating inconsistency across threads ?

Thanks in advance!


Solution

  • If your crawlers are any good, managing the crawl frontier quickly becomes difficult, slow and error-prone.

    Luckily, your don't need to write this yourself, just write your crawlers to use consume the URL Frontier API and plug-in an implementation that suits you.

    See https://github.com/crawler-commons/url-frontier