See updated question in edit section below
I'm trying to decompress large (~300M) GZIPed files from Amazon S3 on the fly using GZIPInputStream but it only outputs a portion of the file; however, if I download to the filesystem before decompression then GZIPInputStream will decompress the entire file.
How can I get GZIPInputStream to decompress the entire HTTPInputStream and not just the first part of it?
see update in the edit section below
I suspected a HTTP problem except that no exceptions are ever thrown, GZIPInputStream returns a fairly consistent chunk of the file each time and, as far as I can tell, it always breaks on a WET record boundary although the boundary it picks is different for each URL (which is very strange as everything is being treated as a binary stream, no parsing of the WET records in the file is happening at all.)
The closest question I could find is GZIPInputStream is prematurely closed when reading from s3 The answer to that question was that some GZIP files are actually multiple appended GZIP files and GZIPInputStream doesn't handle that well. However, if that is the case here why would GZIPInputStream work fine on a local copy of the file?
Below is a piece of sample code that demonstrates the problem I am seeing. I've tested it with Java 1.8.0_72 and 1.8.0_112 on two different Linux computers on two different networks with similar results. I expect the byte count from the decompressed HTTPInputStream to be identical to the byte count from the decompressed local copy of the file, but the decompressed HTTPInputStream is much smaller.
OutputTesting URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 87894 bytes from HTTP->GZIP
Read 448974935 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile0.wet
------
Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00040-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 1772936 bytes from HTTP->GZIP
Read 451171329 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile40.wet
------
Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541142.66/wet/CC-MAIN-20161202170901-00500-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 89217 bytes from HTTP->GZIP
Read 453183600 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile500.wet
Sample Code
import java.net.*;
import java.io.*;
import java.util.zip.GZIPInputStream;
import java.nio.channels.*;
public class GZIPTest {
public static void main(String[] args) throws Exception {
// Our three test files from CommonCrawl
URL url0 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz");
URL url40 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00040-ip-10-31-129-80.ec2.internal.warc.wet.gz");
URL url500 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541142.66/wet/CC-MAIN-20161202170901-00500-ip-10-31-129-80.ec2.internal.warc.wet.gz");
/*
* Test the URLs and display the results
*/
test(url0, "testfile0.wet");
System.out.println("------");
test(url40, "testfile40.wet");
System.out.println("------");
test(url500, "testfile500.wet");
}
public static void test(URL url, String testGZFileName) throws Exception {
System.out.println("Testing URL "+url.toString());
// First directly wrap the HTTPInputStream with GZIPInputStream
// and count the number of bytes we read
// Go ahead and save the extracted stream to a file for further inspection
System.out.println("Testing HTTP Input Stream direct to GZIPInputStream");
int bytesFromGZIPDirect = 0;
URLConnection urlConnection = url.openConnection();
FileOutputStream directGZIPOutStream = new FileOutputStream("./"+testGZFileName);
// FIRST TEST - Decompress from HTTPInputStream
GZIPInputStream gzipishttp = new GZIPInputStream(urlConnection.getInputStream());
byte[] buffer = new byte[1024];
int bytesRead = -1;
while ((bytesRead = gzipishttp.read(buffer, 0, 1024)) != -1) {
bytesFromGZIPDirect += bytesRead;
directGZIPOutStream.write(buffer, 0, bytesRead); // save to file for further inspection
}
gzipishttp.close();
directGZIPOutStream.close();
// Now save the GZIPed file locally
System.out.println("Testing saving to file before decompression");
int bytesFromGZIPFile = 0;
ReadableByteChannel rbc = Channels.newChannel(url.openStream());
FileOutputStream outputStream = new FileOutputStream("./test.wet.gz");
outputStream.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);
outputStream.close();
// SECOND TEST - decompress from FileInputStream
GZIPInputStream gzipis = new GZIPInputStream(new FileInputStream("./test.wet.gz"));
buffer = new byte[1024];
bytesRead = -1;
while((bytesRead = gzipis.read(buffer, 0, 1024)) != -1) {
bytesFromGZIPFile += bytesRead;
}
gzipis.close();
// The Results - these numbers should match but they don't
System.out.println("Read "+bytesFromGZIPDirect+" bytes from HTTP->GZIP");
System.out.println("Read "+bytesFromGZIPFile+" bytes from HTTP->file->GZIP");
System.out.println("Output from HTTP->GZIP saved to file "+testGZFileName);
}
}
Closed Stream and associated Channel in demonstration code as per comment by @VGR.
UPDATE:
The problem does seem to be something specific to the file. I pulled the Common Crawl WET archive down locally (wget), uncompressed it (gunzip 1.8), then recompressed it (gzip 1.8) and re-uploaded to S3 and the on-the-fly decompression then worked fine. You can see the test if you modify the sample code above to include the following lines:
// Original file from CommonCrawl hosted on S3
URL originals3 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz");
// Recompressed file hosted on S3
URL rezippeds3 = new URL("https://s3-us-west-1.amazonaws.com/com.jeffharwell.commoncrawl.gziptestbucket/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz");
test(originals3, "originalhost.txt");
test(rezippeds3, "rezippedhost.txt");
URL rezippeds3 points to the WET archive file that I downloaded, decompressed and recompressed, then re-uploaded to S3. You will see the following output:
Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 7212400 bytes from HTTP->GZIP
Read 448974935 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file originals3.txt
-----
Testing URL https://s3-us-west-1.amazonaws.com/com.jeffharwell.commoncrawl.gziptestbucket/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 448974935 bytes from HTTP->GZIP
Read 448974935 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file rezippeds3.txt
As you can see once the file was recompressed I was able to stream it through GZIPInputStream and get the entire file. The original file still shows the usual premature end of the decompression. When I downloaded and uploaded the WET file without recompressing it I got the same incomplete streaming behavior so it was definitely the recompression that fixed it. I also put both files, the original and the recompressed, onto a traditional Apache web server and was able to replicate the results, so S3 doesn't seem to have anything to do with the problem.
So. I have a new question.
Why would a FileInputStream behave differently than a HTTPInputStream when reading the same content. If it is the exact same file why does:
new GZIPInputStream(urlConnection.getInputStream());
behave any differently than
new GZIPInputStream(new FileInputStream("./test.wet.gz"));
?? Isn't an input stream just an input stream??
It turns out that InputStreams can vary quite a bit. In particular they differ in how they implement the .available() method. For example ByteArrayInputStream .available() returns the number of bytes remaining in the InputStream. However, HTTPInputStream .available() returns the number of bytes available for reading before a blocking IO request needs to be made to refill the buffer. (See the Java Docs for more info)
The problem is that GZIPInputStream uses the output of .available() to determine if there might be an additional GZIP file available in the InputStream after it finishes decompressing a complete GZIP file. Here is line 231 from OpenJDK source file GZIPInputStream.java method readTrailer().
if (this.in.available() > 0 || n > 26) {
If the HTTPInputStream read buffer empties right at the boundary of two concatenated GZIP files GZIPInputStream calls .available(), which responds with a 0 as it would need to go out to the network to refill the buffer, and so GZIPInputStream treats the file as complete and closes prematurely.
The Common Crawl .wet archives are hundreds of megabytes of small concatenated GZIP files and so eventually the HTTPInputStream buffer will empty right at the end of one of the concatenated GZIP files and GZIPInputStream will close prematurely. This explains the problem demonstrated in the question.
This GIST contains a patch to jdk8u152-b00 revision 12039 and two jtreg tests that remove the (in my humble opinion) incorrect reliance on .available().
If you cannot patch the JDK a work around is to make sure that available() always returns > 0 which forces GZIPInputStream to always check for another GZIP file in the stream. Unfortunately HTTPInputStream is private so you cannot subclass it directly, instead extend InputStream and wrap the HTTPInputStream. The below code demonstrates this work around.
Here is the output showing that when the HTTPInputStream is wrapped as discussed GZIPInputStream will produces identical results when reading the concatenated GZIP from a file and directly from HTTP.
Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 448974935 bytes from HTTP->GZIP
Read 448974935 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile0.wet
------
Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00040-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 451171329 bytes from HTTP->GZIP
Read 451171329 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile40.wet
------
Testing URL https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541142.66/wet/CC-MAIN-20161202170901-00500-ip-10-31-129-80.ec2.internal.warc.wet.gz
Testing HTTP Input Stream direct to GZIPInputStream
Testing saving to file before decompression
Read 453183600 bytes from HTTP->GZIP
Read 453183600 bytes from HTTP->file->GZIP
Output from HTTP->GZIP saved to file testfile500.wet
Here is the demonstration code from the question modified with an InputStream wrapper.
import java.net.*;
import java.io.*;
import java.util.zip.GZIPInputStream;
import java.nio.channels.*;
public class GZIPTest {
// Here is a wrapper class that wraps an InputStream
// but always returns > 0 when .available() is called.
// This will cause GZIPInputStream to always make another
// call to the InputStream to check for an additional
// concatenated GZIP file in the stream.
public static class AvailableInputStream extends InputStream {
private InputStream is;
AvailableInputStream(InputStream inputstream) {
is = inputstream;
}
public int read() throws IOException {
return(is.read());
}
public int read(byte[] b) throws IOException {
return(is.read(b));
}
public int read(byte[] b, int off, int len) throws IOException {
return(is.read(b, off, len));
}
public void close() throws IOException {
is.close();
}
public int available() throws IOException {
// Always say that we have 1 more byte in the
// buffer, even when we don't
int a = is.available();
if (a == 0) {
return(1);
} else {
return(a);
}
}
}
public static void main(String[] args) throws Exception {
// Our three test files from CommonCrawl
URL url0 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00009-ip-10-31-129-80.ec2.internal.warc.wet.gz");
URL url40 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698540409.8/wet/CC-MAIN-20161202170900-00040-ip-10-31-129-80.ec2.internal.warc.wet.gz");
URL url500 = new URL("https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2016-50/segments/1480698541142.66/wet/CC-MAIN-20161202170901-00500-ip-10-31-129-80.ec2.internal.warc.wet.gz");
/*
* Test the URLs and display the results
*/
test(url0, "testfile0.wet");
System.out.println("------");
test(url40, "testfile40.wet");
System.out.println("------");
test(url500, "testfile500.wet");
}
public static void test(URL url, String testGZFileName) throws Exception {
System.out.println("Testing URL "+url.toString());
// First directly wrap the HTTP inputStream with GZIPInputStream
// and count the number of bytes we read
// Go ahead and save the extracted stream to a file for further inspection
System.out.println("Testing HTTP Input Stream direct to GZIPInputStream");
int bytesFromGZIPDirect = 0;
URLConnection urlConnection = url.openConnection();
// Wrap the HTTPInputStream in our AvailableHttpInputStream
AvailableInputStream ais = new AvailableInputStream(urlConnection.getInputStream());
GZIPInputStream gzipishttp = new GZIPInputStream(ais);
FileOutputStream directGZIPOutStream = new FileOutputStream("./"+testGZFileName);
int buffersize = 1024;
byte[] buffer = new byte[buffersize];
int bytesRead = -1;
while ((bytesRead = gzipishttp.read(buffer, 0, buffersize)) != -1) {
bytesFromGZIPDirect += bytesRead;
directGZIPOutStream.write(buffer, 0, bytesRead); // save to file for further inspection
}
gzipishttp.close();
directGZIPOutStream.close();
// Save the GZIPed file locally
System.out.println("Testing saving to file before decompression");
ReadableByteChannel rbc = Channels.newChannel(url.openStream());
FileOutputStream outputStream = new FileOutputStream("./test.wet.gz");
outputStream.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);
// Now decompress the local file and count the number of bytes
int bytesFromGZIPFile = 0;
GZIPInputStream gzipis = new GZIPInputStream(new FileInputStream("./test.wet.gz"));
buffer = new byte[1024];
while((bytesRead = gzipis.read(buffer, 0, 1024)) != -1) {
bytesFromGZIPFile += bytesRead;
}
gzipis.close();
// The Results
System.out.println("Read "+bytesFromGZIPDirect+" bytes from HTTP->GZIP");
System.out.println("Read "+bytesFromGZIPFile+" bytes from HTTP->file->GZIP");
System.out.println("Output from HTTP->GZIP saved to file "+testGZFileName);
}
}