javautf-8character-encodingcharset

Does UTF-8 content could be malformed in Java


I am trying to create a test case in java to test

decoder.onMalformedInput(CodingErrorAction.REPLACE);
decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);

I need some character in UTF_8 Charset which are able to test them.


Solution

  • tl;dr

    UTF-8 can represent virtually any character on earth, which is all the characters in Unicode.

    If you're asking if a sample of UTF-8 content could be malformed, yes it can. Lay down some bits in a way than violates the rules described in Wikipedia. I assume this would trigger your onMalformedInput but I’ve not tried it.

    Most possible code points in Unicode have not been assigned to any character. Some of those are set aside for “private use” (Klingon, etc.). And some of those are reserved for future use. Perhaps UTF-8 encoded text containing any of those reserved-for-future-use code points would trigger your onUnmappableCharacter, but I’ve not tried it.

    Details

    any character which is not included in UTF_8 Charset

    You are conflating two different things:

    Unicode is a character set which seeks to represent the characters of all living and most academically-significant dead languages. Unicode is a superset of all other character sets combined. Currently Unicode 15 has recognized 149,186 characters. Each character is assigned a code point number in a range from zero to just over a million.

    UTF-8 is a character encoding that uses one or more octets to represent each assigned number. UTF-8 can represent any of the over one million possible numbers assignable by Unicode.

    So, you would be hard-pressed to find any character used by most any peoples on the earth not already listed in Unicode. And all of those characters can be encoded in UTF-8.