What I'm doing: I'm going through a table of companies in a dbase... each company has a text description
field, and inside that field there can be a number of hyperlinks (rarely more than 4). What I want to do is to test these links, using curl
, for a "bad" response (typically 404, but anything non-200 will be of interest).
Incidentally, this is applicable to Java as much as Groovy, no doubt, and persons of either persuasion might be interested to know that the underlying thread pool class used here by GPars (Groovy parallelism) is ForkJoinPool
.
Having gathered up these URLs using a Matcher
using the Pattern
/(https?:.*?)\)/
I get a map descripURLs
of "url" --> "name of company". Then I use withPool
with a large capacity (because of the intrinsic latency of waiting for responses, obviously) like so:
startMillis = System.currentTimeMillis()
AtomicInteger nRequest = new AtomicInteger()
AtomicInteger nResponsesReceived = new AtomicInteger()
poolObject = null
resultP = withPool( 50 ){ pool ->
poolObject = pool
descripURLs.eachParallel{ url, name ->
int localNRequest = nRequest.incrementAndGet()
Process process = checkURL( url )
def response
try {
//// with the next line TIME PASSES in this Thread...
response = process.text
} catch( Exception e ) {
System.err.println "$e"
}
// NB this line doesn't appear to make much difference
process.destroyForcibly()
nResponses = nResponsesReceived.incrementAndGet()
int nRequestsNowMade = nRequest.get()
if( response.trim() != '200' ) {
println "\n*** request $localNRequest BAD RESPONSE\nname $name url $url\nresponse |$response|" +
"\n$nRequestsNowMade made, outstanding ${nRequestsNowMade - nResponses}"
// NB following line may of course not be printed immmediately after the above line, due to parallelism
println "\nprocess poolSize $pool.poolSize, queuedTaskCount $pool.queuedTaskCount," +
" queuedSubmissionCount? $pool.queuedSubmissionCount"
}
println "time now ${System.currentTimeMillis() - startMillis}, activeThreadCount $pool.activeThreadCount"
}
println "END OF withPool iterations"
println "pool $pool class ${pool.class.simpleName}, activeThreadCount $pool.activeThreadCount"
pool.shutdownNow()
}
println "resultP $resultP class ${resultP.class.simpleName}"
println "pool $poolObject class ${poolObject.class.simpleName}"
println "pool shutdown? $poolObject.shutdown"
def checkURL( url ) {
def process = "curl -LI $url -o /dev/null -w '%{http_code}\n' -s".execute()
// this appears necessary... otherwise potentially you can have processes hanging around forever
process.waitForOrKill( 8000 ) // 8 s to get a reponse
process.addShutdownHook{
println "shutdown on url $url"
}
process
}
What I observe with a 50-Thread pool as above is that 500 URLs will take 20 s to complete. I've experimented with smaller and larger pools, and 100 seems to make no difference, but 25 seems slower, and 10 more like 40 s to complete. Timings are also remarkably consistent from run to run for the same pool size.
What I don't understand is that the Process
es' shutdown hooks only run at the very end of the closure... for all 500 Process
es! This is not to say that 500 actual processes are hanging around on the machine: using task manager I can see that the number of curl.exe
processes at any one time is relatively small.
At the same time I observe from the println
s here that the active thread count here starts at 50, but then declines throughout the run, reaching 3 (typically) by the end. AND YET... I can also observe that the final requests are only being added very near the end of the run.
This leads me to wonder whether the thread pool is in some way being "clogged up" by this "unfinished business" of these "zombie" Process
es... I would expect the final requests (of the 500 made) to be made well before the end of the run. Is there any way I can shut down these Process
es earlier?
Neither Java nor Groovy support a method addShutdownHook
on child Process
instances.
The only method addShutdownHook
that Java supports is on the Runtime
instance. This adds a hook to run at JVM shutdown.
Groovy adds a convenience addShutdownHook()
to the Object
class so that you don't have to write Runtime.getRuntime().addShutdownHook(..)
, but this changes nothing on the underlying mechanism: these hooks are only executed at JVM shutdown.
Because the closures that you add with process.addShutdownHook
most probably keep references to the process
instance, these will be kept alive until JVM shutdown (the Java objects, but not the OS processes)