javamultithreadingconcurrentmodificationforkjoinpool

ForkPoolTask.join() ConcurrentModificationException


I'm getting a very strange error here that I can't quite understand. As it stands I'm making a crawler program that uses ForkJoinTasks to scan for links on the page and then crawl those as well. However, when I use .join() to collect the tasks, I sometimes get ConcurrentModificationExceptions pointing to that line.

Spider[] newSpiders = crawlForURLs("a", "href").stream()
        .filter(url -> SpiderUtils.isLinkURLValid(url))
        .map(url -> new Spider(url).fork()).toArray(Spider[]::new);

for (Spider c : newSpiders) {
    Image[] crawledImages = c.join();
}

The line in question is c.join(). I tried surrounding it in a ReentrantLock just to see what would happen but it made no difference.

Full stack trace: https://pastebin.com/qCJte65Q

Lines of code are: at com.eulerity.hackathon.imagefinder.Spider.compute (Spider.java:85) (Image[] crawledImages = c.join();)

at com.eulerity.hackathon.imagefinder.ImageFinder.doPost (ImageFinder.java:45) images = SpiderUtils.commonPool.invoke(baseSpider);

I honestly can't figure out why this is the case, but if you need to see another part of my code please don't hesitate to ask. Thank you!


Solution

  • The problem, I believe, is in your SpiderUtils class. More specifically, this method:

    public static boolean isLinkURLValid(URI url) {
        return url.getHost().equalsIgnoreCase(baseUrl.getHost()) &&
                !containsURL(linksCrawled, url) &&
                !containsURL(linksInProgress, url) &&
                !url.getPath().contains(".");
    }
    

    And this method:

    public static boolean containsURL(ArrayList<URI> urls, URI left) {
        return urls.stream().anyMatch(right -> urlsMatch(left, right));
    }
    

    Both these methods are not synchronized, which means multiple threads can execute them at the same time. This is likely not a problem for the containsURL(...) method. Given what it does, it has no reason to be synchronized (neither it, nor the urlsMatch(...) method, rely on any external state). The problem is with how the isLinkURLValid(...) method passes both linksCrawled and linksInProgress to the containsURL(...) method.

    These list variables are static, and thus the same instances are passed to the isLinkURLValid(...) method on different threads. This would be okay, if all you did was stream the elements of these lists (i.e., only read from the lists). However, elsewhere in your code you also add elements to these lists (i.e., you write to the lists). These writes are performed in synchronized methods, but that doesn't matter because you're reading the lists without synchronizing on the same object. So, your code is vulnerable to ConcurrentModificationExceptions (you cannot modify a non-concurrent collection while iterating over it).

    I believe the solution would be to simply make the isLinkURLValid(...) method synchronized, but I'm not 100% positive as I haven't tested it.