javastringunicodecodepagescp1250

Java converts standard String to CP1250 with only one byte for every char


I need to convert standard String to CP1250 with only one byte for every char, so for example polish char 'ł' should be parsed to 0xB3, no unicode with two bytes. When I'm trying to do something like that:

byte[] array = "ała".getBytes();
s = new String(array, 0, array.length, Charset.forName("CP1250"));

and next if I'm doing s.getBytes(); it returns more bytes than letters, and for 'ł' is 2 bytes like unicode. I need to converts every String and get bytes from them to exactly CP1250 codes like here: https://pl.wikipedia.org/wiki/Windows-1250#Tablica_kod.C3.B3w


Solution

  • You are converting a String to a byte array using Java's default charset, whatever that happens to be (it could be UTF-8, it could be something else. It is a configurable option). And then you are converting those bytes back to a String, but telling the converter that the bytes are encoded as CP1250, which they might not be. So you could end up with a corrupted String. But either way, you still end up back with a String, which is not what you are asking for.

    You need to tell getBytes() that you want the bytes to be encoded as CP1250, eg:

    byte[] array = "ała".getBytes("CP1250");
    

    Or:

    byte[] array = "ała".getBytes(Charset.forName("CP1250"));