Java/GPars - my thread pool seems to get "clogged"

What I'm doing: I'm going through a table of companies in a dbase... each company has a text description field, and inside that field there can be a number of hyperlinks (rarely more than 4). What I want to do is to test these links, using curl, for a "bad" response (typically 404, but anything non-200 will be of interest).

Incidentally, this is applicable to Java as much as Groovy, no doubt, and persons of either persuasion might be interested to know that the underlying thread pool class used here by GPars (Groovy parallelism) is ForkJoinPool.

Having gathered up these URLs using a Matcher using the Pattern /(https?:.*?)\)/ I get a map descripURLs of "url" --> "name of company". Then I use withPool with a large capacity (because of the intrinsic latency of waiting for responses, obviously) like so:

startMillis = System.currentTimeMillis() 
AtomicInteger nRequest = new AtomicInteger()
AtomicInteger nResponsesReceived = new AtomicInteger()
poolObject = null
resultP = withPool( 50 ){ pool ->
    poolObject = pool
    descripURLs.eachParallel{ url, name ->
        int localNRequest = nRequest.incrementAndGet()
        Process process = checkURL( url )

        def response
        try {
            //// with the next line TIME PASSES in this Thread...
            response = process.text
        } catch( Exception e ) {
            System.err.println "$e"
        }
        // NB this line doesn't appear to make much difference
        process.destroyForcibly()
        nResponses = nResponsesReceived.incrementAndGet()
        int nRequestsNowMade = nRequest.get()
        if( response.trim() != '200' ) {
            println "\n*** request $localNRequest BAD RESPONSE\nname $name url $url\nresponse |$response|" +
                "\n$nRequestsNowMade made, outstanding ${nRequestsNowMade - nResponses}"
             // NB following line may of course not be printed immmediately after the above line, due to parallelism
            println "\nprocess poolSize $pool.poolSize, queuedTaskCount $pool.queuedTaskCount," +
                " queuedSubmissionCount? $pool.queuedSubmissionCount"   
        }
        println "time now ${System.currentTimeMillis() - startMillis}, activeThreadCount $pool.activeThreadCount"
    }
    println "END OF withPool iterations"
    println "pool $pool class ${pool.class.simpleName}, activeThreadCount $pool.activeThreadCount"
    pool.shutdownNow()
}

println "resultP $resultP class ${resultP.class.simpleName}"
println "pool $poolObject class ${poolObject.class.simpleName}"
println "pool shutdown? $poolObject.shutdown"

def checkURL( url ) {
    def process =  "curl -LI $url -o /dev/null -w '%{http_code}\n' -s".execute()
    // this appears necessary... otherwise potentially you can have processes hanging around forever
    process.waitForOrKill( 8000 ) // 8 s to get a reponse
    process.addShutdownHook{
        println "shutdown on url $url"
    }
    process
}

What I observe with a 50-Thread pool as above is that 500 URLs will take 20 s to complete. I've experimented with smaller and larger pools, and 100 seems to make no difference, but 25 seems slower, and 10 more like 40 s to complete. Timings are also remarkably consistent from run to run for the same pool size.

What I don't understand is that the Processes' shutdown hooks only run at the very end of the closure... for all 500 Processes! This is not to say that 500 actual processes are hanging around on the machine: using task manager I can see that the number of curl.exe processes at any one time is relatively small.

At the same time I observe from the printlns here that the active thread count here starts at 50, but then declines throughout the run, reaching 3 (typically) by the end. AND YET... I can also observe that the final requests are only being added very near the end of the run.

This leads me to wonder whether the thread pool is in some way being "clogged up" by this "unfinished business" of these "zombie" Processes... I would expect the final requests (of the 500 made) to be made well before the end of the run. Is there any way I can shut down these Processes earlier?

Solution

Neither Java nor Groovy support a method addShutdownHook on child Process instances.

The only method addShutdownHook that Java supports is on the Runtime instance. This adds a hook to run at JVM shutdown.

Groovy adds a convenience addShutdownHook() to the Object class so that you don't have to write Runtime.getRuntime().addShutdownHook(..), but this changes nothing on the underlying mechanism: these hooks are only executed at JVM shutdown.

Because the closures that you add with process.addShutdownHook most probably keep references to the process instance, these will be kept alive until JVM shutdown (the Java objects, but not the OS processes)