javastringencodingcharacter-encodingwindows-1251

String.getBytes("8bit encoding") returns 2 bytes per symbol


I'm trying to understand character encoding for Strings in Java. I'm working on Windows 10 and the default character encoding is windows-1251. it is 8-bit encoding character. So it must be 1 byte for 1 symbol. So when I call getBytes() for a String with 6 symbols, I expect an array of 6 bytes. But the following code snippet returns 12, instead of 6.

"Привет".getBytes("windows-1251").length // returns 12

At first, I thought that the first byte of the character must be zero. But both bytes related to the character have non-zero values. Could anyone explain, what I'm missing here, please?

Here is an example of how I tested it

import java.nio.charset.Charset;
import java.io.*;
import java.util.HexFormat;

public class Foo
{
    public static void main(String[] args) throws Exception
    {
        System.out.println(Charset.defaultCharset().displayName());
        String s = "Привет";
        System.out.println("bytes count in windows-1251: " + s.getBytes("windows-1251").length);
        printBytes(s.getBytes("windows-1251"), "windows-1251");
    }
    
    public static void printBytes(byte[] array, String name) {
        for (int k = 0; k < array.length; k++) {
            System.out.println(name + "[" + k + "] = " + "0x" +
                byteToHex(array[k]));
        }
    }

static public String byteToHex(byte b) {
      // Returns hex String representation of byte b
      char hexDigit[] = {
         '0', '1', '2', '3', '4', '5', '6', '7',
         '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'
      };
      char[] array = { hexDigit[(b >> 4) & 0x0f], hexDigit[b & 0x0f] };
      return new String(array);
   }
}

the result is:

windows-1251
bytes count in windows-1251: 12
windows-1251[0] = 0xd0
windows-1251[1] = 0x9f
windows-1251[2] = 0xd1
windows-1251[3] = 0x80
windows-1251[4] = 0xd0
windows-1251[5] = 0xb8
windows-1251[6] = 0xd0
windows-1251[7] = 0xb2
windows-1251[8] = 0xd0
windows-1251[9] = 0xb5
windows-1251[10] = 0xd1
windows-1251[11] = 0x82

but what I expect is:

windows-1251
bytes count in windows-1251: 6
windows-1251[0] = 0xcf
windows-1251[1] = 0xf0
windows-1251[2] = 0xe8
windows-1251[3] = 0xe2
windows-1251[4] = 0xe5
windows-1251[5] = 0xf2

Solution

  • It looks like perhaps you have UTF-8 encoded source file when you compiled?

    HexFormat.of().withPrefix(", ").formatHex("Привет".getBytes("UTF-8"))
    ==> "d09fd180d0b8d0b2d0b5d182"
    

    If I save your code in my UTF-8 editor and compile+run:

    java -Dfile.encoding=UTF-8 Foo.java
    UTF-8
    bytes count in windows-1251: 6
    windows-1251[0] = 0xcf
    windows-1251[1] = 0xf0
    windows-1251[2] = 0xe8
    windows-1251[3] = 0xe2
    windows-1251[4] = 0xe5
    windows-1251[5] = 0xf2
    

    Whereas this matches your output if compile+run that UTF8 file with your default encoding:

    java -Dfile.encoding=windows-1251 Foo.java
    windows-1251
    bytes count in windows-1251: 12
    windows-1251[0] = 0xd0
    windows-1251[1] = 0x9f
    windows-1251[2] = 0xd1
    windows-1251[3] = 0x80
    windows-1251[4] = 0xd0
    windows-1251[5] = 0xb8
    windows-1251[6] = 0xd0
    windows-1251[7] = 0xb2
    windows-1251[8] = 0xd0
    windows-1251[9] = 0xb5
    windows-1251[10] = 0xd1
    windows-1251[11] = 0x82
    

    If I change my editor charset to windows-1251 then the output is as expected:

    java -Dfile.encoding=windows-1251 Foo.java
    windows-1251
    bytes count in windows-1251: 6
    windows-1251[0] = 0xcf
    windows-1251[1] = 0xf0
    windows-1251[2] = 0xe8
    windows-1251[3] = 0xe2
    windows-1251[4] = 0xe5
    windows-1251[5] = 0xf2
    

    EDIT

    For simplicity above I've used java Foo.java "compile and launch" mode but for normal separate use the important steps are to match up javac with character encoding of the source code, and java with any encoding you want the app to use:

    javac -Dfile.encoding=TheJavaFilesCharSet {JavaFiles}
    java  -Dfile.encoding=AnyOrDefaultCharSet {ClassWithMain}
    

    As mentioned in the comments its worth using HexFormat, it is immutable and therefore safe to assign to a static field directly or via a user-friendly debugging output method:

    private static final HexFormat HEX = HexFormat.ofDelimiter(", ").withPrefix("0x").withUpperCase();
    public static String formatHex(byte[] arr) {
        return "new byte[/*"+arr.length+"*/] {"+HEX.formatHex(arr)+"}";
    }
    
    HEX.formatHex(new byte[]{1,2,3});
    ==> "0x01, 0x02, 0x03"
    
    formatHex(new byte[]{1,2,3});
    ==> "new byte[/*3*/] {0x01, 0x02, 0x03}"
    

    The latter is helpful if you want to cut/paste definitions back into testcases.