javajsoninputstream

How to read a big Json string value as a stream in Java


I'm receiving a JSON DTO containing a b64 file in one of its fields.

That file can be quite big (100MiB+ - don't ask) and I'm trying to read it as a stream from the JSON to reduce the load on memory. The JSON itself has only three different fields, it's quite small.

Note that I'm already able to read the JSON itself as a stream and iterating on its tokens, but I haven't been able to retrieve the value itself as a stream.

The basic JsonParser.getText(writer) does a .getText which loads everything into memory and every solution out there seem to always load the whole value into memory at some point (or maybe my google-fu isn't up to par).


Solution

  • Yes, you can use the methods that don't load the whole JSON into the memory using JsonParser:

    Unfortunately, using another library GSON, there's no way to use a writer stream because its interface only provides a nextString method.

    Using readBinaryValue

    you can use another method that reads base64, decodes it to binary and streams it to your own OutputStream.

    parser.readBinaryValue(Base64Variants.MIME, jsonOutputStream);
    

    and then inherit and tweak the implementation of your OutputStream to use it the way you want

    public class JsonOutputStream extends OutputStream {
        private final ByteArrayOutputStream buffer = new ByteArrayOutputStream(4);
        private final Base64.Encoder encoder = Base64.getMimeEncoder();
    
        @Getter
        private int count = 0;
    
        @Override
        public void write(int b) throws IOException {
            buffer.write(b);
            if (buffer.size() % 4 == 0) { // Process in chunks of 4 Base64 characters
                flushBuffer();
            }
        }
    
        public void resetCount() {
            count = 0;
        }
    
        private void flushBuffer() {
            if (buffer.size() < 4) return; // Only decode full Base64 chunks
            byte[] encoded = encoder.encode(buffer.toByteArray());
            buffer.reset();
        }
    }
    

    JsonParser uses its own base64 variant, if you want the standard one you have to choose MIME. Another option is also URL.

    And the good thing about this option is that you don't have to tweak anything in this case as it doesn't create segments like the getText option. It sets the buffer length configured more or less like private final static int[] BYTE_BUFFER_LENGTHS = new int[] { 8000, 8000, 2000, 2000 }; (because there's a padding in b64, and chunks are always 4 bytes long, there is some tweaking with - 3). Then when it needs more room, it simply flushes the buffer.

    Therefore, it doesn't load the whole string into the memory and you don't have to do any custom tweaking.

    Using JsonParser.getText(Writer w)

    The basic JsonParser.getText(writer) does a .getText which loads everything into memory

    Yes, you're right, but you can tweak it. Every implementation like ReaderBasedJsonParser or UTF8DataInputJsonParser calls a helper method inside the getText from the snippet

    if (t == JsonToken.VALUE_STRING) {
        if (_tokenIncomplete) {
        _tokenIncomplete = false;
        _finishString(); // only strings can be incomplete
      }
    

    and I debugged it, for the string tokens _tokenIncomplete is always true as intended. Then inside the method _finishedString2 called by the _finishString there is a nice loop that copies the JSON's string input into the buffer that is segmented. Each segment has the size that can be configured from the factory in the context passed to the parser's constructor. The default value configured in TextBuffer is

    final static int MAX_SEGMENT_LEN = 0x10000;
    

    and the new segment is created in the code fragment

    // Need more room?
    if (outPtr >= outBuf.length) {
        outBuf = _textBuffer.finishCurrentSegment();
        outPtr = 0;
    }
    

    so instead of finishing the current segment we can simply write the contents to our writer and reset the buffer, releasing the memory each time a new chunk is read. That way we never load the whole string into the memory, only MAX_SEGMENT_LEN at maximum.

    For example, you can inherit the ReaderJsonParser and write your own factory method for it analogously to the JsonFactory:

    public class ChunkReaderBasedJsonParser extends ReaderBasedJsonParser {
        private boolean finishedStringBuffer = false;
    
        public ChunkReaderBasedJsonParser(IOContext ctxt, int features, Reader r, ObjectCodec codec, CharsToNameCanonicalizer st, char[] inputBuffer, int start, int end, boolean bufferRecyclable) {
            super(ctxt, features, r, codec, st, inputBuffer, start, end, bufferRecyclable);
        }
    
        public ChunkReaderBasedJsonParser(IOContext ctxt, int features, Reader r, ObjectCodec codec, CharsToNameCanonicalizer st) {
            super(ctxt, features, r, codec, st);
        }
    
        @Override
        public int getText(Writer writer) throws IOException {
            JsonToken t = _currToken;
            if (t == JsonToken.VALUE_STRING) {
                int total = 0;
                while (_tokenIncomplete) {
                    _tokenIncomplete = getNextStringChunk();
                    total += _textBuffer.contentsToWriter(writer);
                    _textBuffer.resetWithEmpty();
                }
                return total;
            }
            if (t == JsonToken.FIELD_NAME) {
                String n = _parsingContext.getCurrentName();
                writer.write(n);
                return n.length();
            }
            if (t != null) {
                if (t.isNumeric()) {
                    return _textBuffer.contentsToWriter(writer);
                }
                char[] ch = t.asCharArray();
                writer.write(ch);
                return ch.length;
            }
            return 0;
        }
    
        /**
         *
         * @return true if the token is still incomplete, false if it is complete
         */
        private boolean getNextStringChunk() throws IOException {
            finishedStringBuffer = false;
            _finishString();
            return finishedStringBuffer;
        }
    
        @Override
        protected void _finishString2() throws IOException {
            char[] outBuf = _textBuffer.getCurrentSegment();
            int outPtr = _textBuffer.getCurrentSegmentSize();
            final int[] codes = INPUT_CODES_LATIN1;
            final int maxCode = codes.length;
    
            while (true) {
                // we ran out of buffer?
                if (outPtr >= outBuf.length) {
                    finishedStringBuffer = true;
                    return;
                }
    
                if (_inputPtr >= _inputEnd) {
                    if (!_loadMore()) {
                        _reportInvalidEOF(": was expecting closing quote for a string value",
                                JsonToken.VALUE_STRING);
                    }
                }
                char c = _inputBuffer[_inputPtr++];
                int i = c;
                if (i < maxCode && codes[i] != 0) {
                    if (i == INT_QUOTE) {
                        break;
                    } else if (i == INT_BACKSLASH) {
                        /* Although chars outside of BMP are to be escaped as
                         * an UTF-16 surrogate pair, does that affect decoding?
                         * For now let's assume it does not.
                         */
                        c = _decodeEscaped();
                    } else if (i < INT_SPACE) {
                        _throwUnquotedSpace(i, "string value");
                    } // anything else?
                }
    
                // Ok, let's add char to output:
                outBuf[outPtr++] = c;
            }
            _textBuffer.setCurrentLength(outPtr);
        }
    }
    

    I debugged it for a big string and it loads it in segments, in a rather unexpected way. There has to be a bug somewhere but it's a good scaffolding to begin with. Also there is some problem with the read string length, it reads about 100kb instead of 350kb, maybe it has something to do with MAX_SEGMENT_LEN but I don't think so. I guess it's an encoding issue. And then you use it like that, write your own factory method

    var factory = new ChunkJsonFactory();
    try (var parser = factory.createParser(new FileReader("src/main/resources/sample-json.json"))){
            while(parser.nextToken()!=null){
            if(parser.getCurrentToken().isScalarValue()){
                parser.getText(writer);
            }
        }
    }
    

    Inheriting Writer

    For example, if you want to count the characters inside of the string, you can extend Writer and use the following code

    public class JsonWriter extends Writer {
    
        @Getter
        @Setter
        private int count;
    
        @Setter
        private char letterToCount = 'a';
    
        @Override
        public void write(char[] cbuf, int off, int len) throws IOException {
            for (int i = off; i < off + len; i++) {
                if (cbuf[i] == letterToCount) {
                    count++;
                }
            }
        }
    }
    

    I created a sample repository here