[SOLVED] How to manage a crawler URL frontier?

How to manage a crawler URL frontier?

Guys

I have the following code to add visited links on my crawler. After extracting links i have a for loop which loop thorough each individual href tags.

And after i have visited a link , opened it , i will add the URL to a visited link collection variable defined above.

private final Collection<String> urlForntier = Collections.synchronizedSet(new HashSet<String>());

The crawler implementation is mulithread and assume if i have visited 100,000 urls, if i didn't terminate the crawler it will grow day by day . and It will create memory issues ? Please , what option do i have to refresh the variable without creating inconsistency across threads ?

Thanks in advance!

Solution

If your crawlers are any good, managing the crawl frontier quickly becomes difficult, slow and error-prone.

Luckily, your don't need to write this yourself, just write your crawlers to use consume the URL Frontier API and plug-in an implementation that suits you.

See https://github.com/crawler-commons/url-frontier