Guys
I have the following code to add visited links on my crawler. After extracting links i have a for loop which loop thorough each individual href tags.
And after i have visited a link , opened it , i will add the URL to a visited link collection variable defined above.
private final Collection<String> urlForntier = Collections.synchronizedSet(new HashSet<String>());
The crawler implementation is mulithread and assume if i have visited 100,000 urls, if i didn't terminate the crawler it will grow day by day . and It will create memory issues ? Please , what option do i have to refresh the variable without creating inconsistency across threads ?
Thanks in advance!
If your crawlers are any good, managing the crawl frontier quickly becomes difficult, slow and error-prone.
Luckily, your don't need to write this yourself, just write your crawlers to use consume the URL Frontier API and plug-in an implementation that suits you.