javajava-streambufferedreadercompressiongzipinputstream

Reading a Gzip file of an Unknown length in Java


I have requirement to downlad a file from s3 which is in .gz format. I can very well do that

BufferedInputStream bufferedInputStream = new BufferedInputStream( new GZIPInputStream(fileObj.getObjectContent()));

Now, to read content of this file I might have to do something like this

    int n;
    byte[] buffer = new byte[1024];
     while ((n = bufferedInputStream.read(buffer)) != -1) {
     }

However I do not know the size of my original .gz file.

It might be said that I might get size from some API of aws-s3-sdk. But Still I think there must be a better way.

Also, I need to do this uncompression really fast. Is there any equivalent of Parallel Streaming which I can perform on GZIPInputStream?


Solution

  • I have requirement to downlad a file from s3 which is in .gz format. I can very well do that

    BufferedInputStream bufferedInputStream = new BufferedInputStream(new
    GZIPInputStream(fileObj.getObjectContent()));
    

    First all GZIPInputStream doesn't the file content as contructor argument but rather the file input stream (like this).

    Second you don't necessarily need a BufferedInputStream because you can already buffer your input using the GZIPInputStream.read(buffer[]) method the parent FileInputStream class.

    Thirdly, You need to know to size of a Gzip file (or any other files) when reading it in Java. This is precisely what the xxxInputStream family class is all about: You just need to know where to start with your reading but you must not know where to end.

    So your code will look like:

        int megabytesCount = 10;
        try(GZIPInputStream gzipInputStream = new GZIPInputStream(yourFileInputStream))
        {
            bytes[] buffer = new bytes[megabytesCount * 1024];
            int bytesRead = -1;
            if(( bytesRead = gzipInputStream.read(buffer)) = -1)
            {
                // do Something with your buffer and its current size n; 
            }
        }catch(Expection blahBlah){
    
        }
    

    the bufferedInputStream class will start to read from your file chunck of bytes of maximum 1024 bytes (your buffer array buffer). It can read less than the max or exactly the max, you don't know. What you know is that the amount of byte read from your file will be saved in your variable bytesRead. if bytesRead != -1 it means your have read some data from file. only when you reach bytesRead == -1, you know you are at the end of the file. That why you don't need to know the actual size of your file. Just open the file/or download it from aws-s3 and start reading it.

    Also, I need to do this uncompression really fast. Is there any equivalent of Parallel Streaming which I can perform on GZIPInputStream?

    Zipping/Unzipping a *.gzip file using a GZIPFileInputStream should be fast enough if you know to set your buffer. for example for a file with 1G (1000 * 1024 bytes) with megabytesCount = 10 your only to access the file 100. times.

    If you want to move faster (and if your memory allows it for your programm), then do megabytesCount = 100, and your access will only be only 10;

    Parallel Streaming brings nothing here, if you must to access your data one chunk after the other.