javagoogle-cloud-dataflowgzipinputstream

Dataflow GZIP TextIO ZipException: too many length or distance symbols


Using TextIO.Read transform with a large collection of compressed text files (1000+ files, sizes between 100MB and 1.5GB), we sometimes get the following error:

java.util.zip.ZipException: too many length or distance symbols at
java.util.zip.InflaterInputStream.read(InflaterInputStream.java:164) at
java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117) at
java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at
java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at
java.io.BufferedInputStream.read(BufferedInputStream.java:345) at
java.io.FilterInputStream.read(FilterInputStream.java:133) at
java.io.PushbackInputStream.read(PushbackInputStream.java:186) at 
com.google.cloud.dataflow.sdk.runners.worker.TextReader$ScanState.readBytes(TextReader.java:261) at 
com.google.cloud.dataflow.sdk.runners.worker.TextReader$TextFileIterator.readElement(TextReader.java:189) at 
com.google.cloud.dataflow.sdk.runners.worker.FileBasedReader$FileBasedIterator.computeNextElement(FileBasedReader.java:265) at 
com.google.cloud.dataflow.sdk.runners.worker.FileBasedReader$FileBasedIterator.hasNext(FileBasedReader.java:165) at 
com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:169) at 
com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.start(ReadOperation.java:118) at 
com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:66) at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.executeWork(DataflowWorker.java:204) at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.doWork(DataflowWorker.java:151) at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:118) at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:139) at 
com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness$WorkerThread.call(DataflowWorkerHarness.java:124) at
java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at
java.lang.Thread.run(Thread.java:745)

Searching online for the same ZipException, only lead to this reply:

Zip file errors often happen when the hot deployer attempts to deploy an application before it is fully copied to the deploy directory. This is fairly common if it takes several seconds to copy the file. The solution is to copy the file to a temporary directory on the same disk partition as the application server, and then move the file to the deploy directory.

Did anybody else run into a similar exception? Or anyway we can fix this problem?


Solution

  • Looking at the code that produces the error message it seems to be a problem with zlib library (which is used by JDK) not supporting the format of gzip files that you have.

    It looks to be the following bug in zlib: Codes for reserved symbols are rejected even if unused.

    Unfortunately there's probably little we can do to help other than suggest producing these compressed file using another utility.

    If you can produce a small example gzip file that we could use to reproduce the issue, we might be able to see if it is possible to work around somehow, but I wouldn't rely on this to succeed.