java amazon-web-services amazon-s3 aws-lambda gzipinputstream

OOM when trying to process s3 file

I am trying to use below code to download and read data from file, any how this goes OOM, exactly while reading the file, the size of s3 file is 22MB, I downloaded through browser it is 650 MB, but when I monitor through visual VM, memory consumed while uncompressing and reading is more than 2GB. Anyone please guide so that I would find the reason of high memory usage. Thanks.

public static String unzip(InputStream in) throws IOException, CompressorException, ArchiveException {
            System.out.println("Unzipping.............");
            GZIPInputStream gzis = null;
            try {
                gzis = new GZIPInputStream(in);
                InputStreamReader reader = new InputStreamReader(gzis);
                BufferedReader br = new BufferedReader(reader);
                double mb = 0;
                String readed;
                int i=0;
                while ((readed = br.readLine()) != null) {
                     mb = mb+readed.getBytes().length / (1024*1024);
                     i++;
                     if(i%100==0) {System.out.println(mb);}
                }


            } catch (IOException e) {
                e.printStackTrace();
                LOG.error("Invoked AWSUtils getS3Content : json ", e);
            } finally {
                closeStreams(gzis, in);
            }

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3332) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596) at java.lang.StringBuffer.append(StringBuffer.java:367) at java.io.BufferedReader.readLine(BufferedReader.java:370) at java.io.BufferedReader.readLine(BufferedReader.java:389) at com.kpmg.rrf.utils.AWSUtils.unzip(AWSUtils.java:917)

Solution

This is a theory, but I can't think of any other reasons why your example would OOM.

Suppose that the uncompressed file consists contains a very long line; e.g. something like 650 million ASCII bytes.

Your application seems to just read the file a line at a time and (try to) display a running total of the megabytes that have been read.

Internally, the readLine() method reads characters one at a time and appends them to a StringBuffer. (You can see the append call in the stack trace.) If the file consist of a very large line, then the StringBuffer is going to get very large.

Each text character in the uncompressed string becomes a char in the char[] that is the buffer part of the StringBuffer.
Each time the buffer fills up, StringBuffer will grow the buffer by (I think) doubling its size. This entails allocating a new char[] and copying the characters to it.
So if the buffer fills when there are N characters, Arrays.copyOf will allocate a char[] hold 2 x N characters. And while the data is being copied, a total of 3 x N of character storage will be in use.
So 650MB could easily turn into a heap demand of > 6 x 650M bytes

The other thing to note that the 2 x N array has to be a single contiguous heap node.

Looking at the heap graphs, it looks like the heap got to ~1GB in use. If my theory is correct, the next allocation would have been for a ~2GB node. But 1GB + 2GB is right on the limit for your 3.1GB heap max. And when we take the contiguity requirement into account, the allocation cannot be done.

So what is the solution?

It is simple really: don't use readLine() if it is possible for lines to be unreasonably long.

    public static String unzip(InputStream in) 
            throws IOException, CompressorException, ArchiveException {
        System.out.println("Unzipping.............");
        try (
            GZIPInputStream gzis = new GZIPInputStream(in);
            InputStreamReader reader = new InputStreamReader(gzis);
            BufferedReader br = new BufferedReader(reader);
        ) {
            int ch;
            long i = 0;
            while ((ch = br.read()) >= 0) {
                 i++;
                 if (i % (100 * 1024 * 1024) == 0) {
                     System.out.println(i / (1024 * 1024));
                 }
            }
        } catch (IOException e) {
            e.printStackTrace();
            LOG.error("Invoked AWSUtils getS3Content : json ", e);
        }