I'm receiving a JSON DTO containing a b64 file in one of its fields.
That file can be quite big (100MiB+ - don't ask) and I'm trying to read it as a stream from the JSON to reduce the load on memory. The JSON itself has only three different fields, it's quite small.
Note that I'm already able to read the JSON itself as a stream and iterating on its tokens, but I haven't been able to retrieve the value itself as a stream.
The basic JsonParser.getText(writer)
does a .getText
which loads everything into memory and every solution out there seem to always load the whole value into memory at some point (or maybe my google-fu isn't up to par).
Yes, you can use the methods that don't load the whole JSON into the memory using JsonParser
:
JsonParser.readBinaryValue(OutputStream o)
JsonParser.getText(Writer w)
Unfortunately, using another library GSON, there's no way to use a writer stream because its interface only provides a nextString
method.
readBinaryValue
you can use another method that reads base64, decodes it to binary and streams it to your own OutputStream.
parser.readBinaryValue(Base64Variants.MIME, jsonOutputStream);
and then inherit and tweak the implementation of your OutputStream to use it the way you want
public class JsonOutputStream extends OutputStream {
private final ByteArrayOutputStream buffer = new ByteArrayOutputStream(4);
private final Base64.Encoder encoder = Base64.getMimeEncoder();
@Getter
private int count = 0;
@Override
public void write(int b) throws IOException {
buffer.write(b);
if (buffer.size() % 4 == 0) { // Process in chunks of 4 Base64 characters
flushBuffer();
}
}
public void resetCount() {
count = 0;
}
private void flushBuffer() {
if (buffer.size() < 4) return; // Only decode full Base64 chunks
byte[] encoded = encoder.encode(buffer.toByteArray());
buffer.reset();
}
}
JsonParser uses its own base64 variant, if you want the standard one you have to choose MIME. Another option is also URL.
And the good thing about this option is that you don't have to tweak anything in this case as it doesn't create segments like the getText
option. It sets the buffer length configured more or less like private final static int[] BYTE_BUFFER_LENGTHS = new int[] { 8000, 8000, 2000, 2000 };
(because there's a padding in b64, and chunks are always 4 bytes long, there is some tweaking with - 3
). Then when it needs more room, it simply flushes the buffer.
Therefore, it doesn't load the whole string into the memory and you don't have to do any custom tweaking.
JsonParser.getText(Writer w)
The basic
JsonParser.getText(writer)
does a.getText
which loads everything into memory
Yes, you're right, but you can tweak it. Every implementation like ReaderBasedJsonParser
or UTF8DataInputJsonParser
calls a helper method inside the getText
from the snippet
if (t == JsonToken.VALUE_STRING) {
if (_tokenIncomplete) {
_tokenIncomplete = false;
_finishString(); // only strings can be incomplete
}
and I debugged it, for the string tokens _tokenIncomplete
is always true as intended. Then inside the method _finishedString2
called by the _finishString
there is a nice loop that copies the JSON's string input into the buffer that is segmented. Each segment has the size that can be configured from the factory in the context passed to the parser's constructor. The default value configured in TextBuffer
is
final static int MAX_SEGMENT_LEN = 0x10000;
and the new segment is created in the code fragment
// Need more room?
if (outPtr >= outBuf.length) {
outBuf = _textBuffer.finishCurrentSegment();
outPtr = 0;
}
so instead of finishing the current segment we can simply write the contents to our writer and reset the buffer, releasing the memory each time a new chunk is read. That way we never load the whole string into the memory, only MAX_SEGMENT_LEN
at maximum.
For example, you can inherit the ReaderJsonParser
and write your own factory method for it analogously to the JsonFactory
:
public class ChunkReaderBasedJsonParser extends ReaderBasedJsonParser {
private boolean finishedStringBuffer = false;
public ChunkReaderBasedJsonParser(IOContext ctxt, int features, Reader r, ObjectCodec codec, CharsToNameCanonicalizer st, char[] inputBuffer, int start, int end, boolean bufferRecyclable) {
super(ctxt, features, r, codec, st, inputBuffer, start, end, bufferRecyclable);
}
public ChunkReaderBasedJsonParser(IOContext ctxt, int features, Reader r, ObjectCodec codec, CharsToNameCanonicalizer st) {
super(ctxt, features, r, codec, st);
}
@Override
public int getText(Writer writer) throws IOException {
JsonToken t = _currToken;
if (t == JsonToken.VALUE_STRING) {
int total = 0;
while (_tokenIncomplete) {
_tokenIncomplete = getNextStringChunk();
total += _textBuffer.contentsToWriter(writer);
_textBuffer.resetWithEmpty();
}
return total;
}
if (t == JsonToken.FIELD_NAME) {
String n = _parsingContext.getCurrentName();
writer.write(n);
return n.length();
}
if (t != null) {
if (t.isNumeric()) {
return _textBuffer.contentsToWriter(writer);
}
char[] ch = t.asCharArray();
writer.write(ch);
return ch.length;
}
return 0;
}
/**
*
* @return true if the token is still incomplete, false if it is complete
*/
private boolean getNextStringChunk() throws IOException {
finishedStringBuffer = false;
_finishString();
return finishedStringBuffer;
}
@Override
protected void _finishString2() throws IOException {
char[] outBuf = _textBuffer.getCurrentSegment();
int outPtr = _textBuffer.getCurrentSegmentSize();
final int[] codes = INPUT_CODES_LATIN1;
final int maxCode = codes.length;
while (true) {
// we ran out of buffer?
if (outPtr >= outBuf.length) {
finishedStringBuffer = true;
return;
}
if (_inputPtr >= _inputEnd) {
if (!_loadMore()) {
_reportInvalidEOF(": was expecting closing quote for a string value",
JsonToken.VALUE_STRING);
}
}
char c = _inputBuffer[_inputPtr++];
int i = c;
if (i < maxCode && codes[i] != 0) {
if (i == INT_QUOTE) {
break;
} else if (i == INT_BACKSLASH) {
/* Although chars outside of BMP are to be escaped as
* an UTF-16 surrogate pair, does that affect decoding?
* For now let's assume it does not.
*/
c = _decodeEscaped();
} else if (i < INT_SPACE) {
_throwUnquotedSpace(i, "string value");
} // anything else?
}
// Ok, let's add char to output:
outBuf[outPtr++] = c;
}
_textBuffer.setCurrentLength(outPtr);
}
}
I debugged it for a big string and it loads it in segments, in a rather unexpected way. There has to be a bug somewhere but it's a good scaffolding to begin with. Also there is some problem with the read string length, it reads about 100kb instead of 350kb, maybe it has something to do with MAX_SEGMENT_LEN
but I don't think so. I guess it's an encoding issue. And then you use it like that, write your own factory method
var factory = new ChunkJsonFactory();
try (var parser = factory.createParser(new FileReader("src/main/resources/sample-json.json"))){
while(parser.nextToken()!=null){
if(parser.getCurrentToken().isScalarValue()){
parser.getText(writer);
}
}
}
Writer
For example, if you want to count the characters inside of the string, you can extend Writer
and use the following code
public class JsonWriter extends Writer {
@Getter
@Setter
private int count;
@Setter
private char letterToCount = 'a';
@Override
public void write(char[] cbuf, int off, int len) throws IOException {
for (int i = off; i < off + len; i++) {
if (cbuf[i] == letterToCount) {
count++;
}
}
}
}
I created a sample repository here