javaamazon-web-servicesamazon-s3aws-lambdagzipinputstream

OOM when trying to process s3 file


I am trying to use below code to download and read data from file, any how this goes OOM, exactly while reading the file, the size of s3 file is 22MB, I downloaded through browser it is 650 MB, but when I monitor through visual VM, memory consumed while uncompressing and reading is more than 2GB. Anyone please guide so that I would find the reason of high memory usage. Thanks.

public static String unzip(InputStream in) throws IOException, CompressorException, ArchiveException {
            System.out.println("Unzipping.............");
            GZIPInputStream gzis = null;
            try {
                gzis = new GZIPInputStream(in);
                InputStreamReader reader = new InputStreamReader(gzis);
                BufferedReader br = new BufferedReader(reader);
                double mb = 0;
                String readed;
                int i=0;
                while ((readed = br.readLine()) != null) {
                     mb = mb+readed.getBytes().length / (1024*1024);
                     i++;
                     if(i%100==0) {System.out.println(mb);}
                }


            } catch (IOException e) {
                e.printStackTrace();
                LOG.error("Invoked AWSUtils getS3Content : json ", e);
            } finally {
                closeStreams(gzis, in);
            }

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3332) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:596) at java.lang.StringBuffer.append(StringBuffer.java:367) at java.io.BufferedReader.readLine(BufferedReader.java:370) at java.io.BufferedReader.readLine(BufferedReader.java:389) at com.kpmg.rrf.utils.AWSUtils.unzip(AWSUtils.java:917)

Monitoring


Solution

  • This is a theory, but I can't think of any other reasons why your example would OOM.

    Suppose that the uncompressed file consists contains a very long line; e.g. something like 650 million ASCII bytes.

    Your application seems to just read the file a line at a time and (try to) display a running total of the megabytes that have been read.

    Internally, the readLine() method reads characters one at a time and appends them to a StringBuffer. (You can see the append call in the stack trace.) If the file consist of a very large line, then the StringBuffer is going to get very large.

    The other thing to note that the 2 x N array has to be a single contiguous heap node.

    Looking at the heap graphs, it looks like the heap got to ~1GB in use. If my theory is correct, the next allocation would have been for a ~2GB node. But 1GB + 2GB is right on the limit for your 3.1GB heap max. And when we take the contiguity requirement into account, the allocation cannot be done.


    So what is the solution?

    It is simple really: don't use readLine() if it is possible for lines to be unreasonably long.

        public static String unzip(InputStream in) 
                throws IOException, CompressorException, ArchiveException {
            System.out.println("Unzipping.............");
            try (
                GZIPInputStream gzis = new GZIPInputStream(in);
                InputStreamReader reader = new InputStreamReader(gzis);
                BufferedReader br = new BufferedReader(reader);
            ) {
                int ch;
                long i = 0;
                while ((ch = br.read()) >= 0) {
                     i++;
                     if (i % (100 * 1024 * 1024) == 0) {
                         System.out.println(i / (1024 * 1024));
                     }
                }
            } catch (IOException e) {
                e.printStackTrace();
                LOG.error("Invoked AWSUtils getS3Content : json ", e);
            }