javastringjava-8

CharsetDecoder used within InputStream in release Java 8 onwards doesn't seem to honor CodingErrorAction.Replace


Issue can be found in version Java8 onwards. A CharsetDecoder is explicitly configured with CodingErrorAction.REPLACE ; I would expect the replacement character \uFFFD to be applied when decoding malformed input .

On using CharsetDecoder with the InputStream the replacement Character is not applied on malformed input, especially when it is towards the end of the byte stream.

Is this the expected outcome on using REPLACE with CharsetDecoder

final String z = "髙";
Charset charset = Charset.forName("x-windows-iso2022jp");

CharsetDecoder charsetDecoderForStr = charset.newDecoder()
        .onMalformedInput(CodingErrorAction.REPLACE)
        .onUnmappableCharacter(CodingErrorAction.REPLACE);

CharsetDecoder charsetDecoderForStrm = charset.newDecoder()
        .onMalformedInput(CodingErrorAction.REPLACE)
        .onUnmappableCharacter(CodingErrorAction.REPLACE);

ByteBuffer buffer = ByteBuffer.allocate(21);
byte[] bytes = {27, 36, 66, 124, 98, 27, 40, 66, 27};
buffer.put(bytes);

System.out.println("Using new String without Decoder"+new String(bytes, charset).trim());
System.out.println("Using Decoder "+charsetDecoderForStr.decode((ByteBuffer) buffer.flip()).toString().trim());

Reader reader = new InputStreamReader(new ByteArrayInputStream(bytes), charsetDecoderForStrm);
char[] chars = new char[10];
reader.read(chars);
System.out.println("Using StreamDecoder "+new String(chars).trim());


Using new String without Decoder髙�  //uses default decoder
Using Decoder 髙�               // uses decoder with codingErrorAction.REPLACE
Using StreamDecoder 髙    //uses decoder with codingErrorAction.REPLACE

StreamDecoder has no replacement character "\uFFFD"


Solution

  • On using CharsetDecoder with the InputStream the replacement Character is not applied on malformed input, especially when it is towards the end of the byte stream.

    Your example does not demonstrate that. What it shows is that the replacement character is not read from the reader on the first read. But neither Reader in general nor InputStreamReader in particular makes any promises about how much data is transferred on any given invocation of read(char[]), other than that (for an argument with length greater than 0), it will block until at least one character is transferred, the end of the data is reached, or an error occurs.

    You observe one character being transferred on the first read(), which is consistent with the specifications. You do nothing further to check the state of the reader, but I expect that you would find, as I do, that if you perform another read then the first and only character it will transfer is the replacement character you expected. That is, you will get exactly the same data from your InputStreamReader / CharsetDecoder combo as you do by other methods if you make sure to read all the data it provides.

    It is not safe to interpret Reader.read(char[]) reading fewer characters than the length of the array as an indication that the end of the character stream has been reached. There are particular cases where that is a reliable test, but also plenty of cases where it isn't. The best and safest way to now you've reached the end of the data is to observe one of the Reader.read() methods telling you so by returning -1.