I try to ocr some images with kraken. I prepared a console command for doing that. It was slow, so I combined that with gnu parallel.
find temp/ -name '*.tif' -or -name '*.jpg' | parallel -j4 kraken -i {} {}.html binarize segment ocr -h
It works fine, when I'm doing this in the terminal. When I start this in java(eclipse), the execution stops after 30 images. It does not terminate. It left defunct processes.
String command = "find temp/ -name '*.tif' -or -name '*.jpg' | parallel -j4 kraken -i {} {}.html binarize segment ocr -h";
Process p = Runtime.getRuntime().exec(new String[]{"/bin/bash","-c",command});
p.waitFor() == 0;
I tried several configurations(more memory(eclipse and the exceution), less threads), but nothing helped.
Has someone an idea to avoid defunct processes or how the execution can be started again?
Almost certainly, the problem is that you're not consuming the output of the process, causing its output buffer to fill and therefore the process to stall.
Try:
String command = "find temp/ -name '*.tif' -or -name '*.jpg' | parallel -j4 kraken -i {} {}.html binarize segment ocr -h";
Process p = Runtime.getRuntime().exec(new String[]{"/bin/bash","-c",command});
InputStream is = p.getInputStream();
// is.skip(Long.MAX_VALUE); Doesn't work
while (is.read() != -1) { } // consume all process output
p.waitFor();
A complete solution would also process the error stream. This can be done by starting a separate thread which reads/skips the input from the error stream.
(Alternatively, you could redirect output to /dev/null
in the bash command script).