javacharacter-encodingiso-8859-1windows-1252

Character encoding confusion on Windows


I have a simple Java program that takes in hex and converts it to ASCII. Using Java 8, I compiled the following:

import java.nio.charset.Charset;
import java.util.Scanner;

public class Main 
{
    public static void main(String[] args) 
    {
        System.out.println("Charset: " + Charset.defaultCharset());
        Scanner in = new Scanner(System.in);
        System.out.print("Type a HEX string: ");
        String s = in.nextLine();
        String asciiStr = new String();
        
        //  Split the string into an array
        String[] hexes = s.split(":");
        
        //  For each hex
        for (String hex : hexes) {
            //  Translate the hex to ASCII
            System.out.print(" " + Integer.parseInt(hex, 16) + "|" + (char)Integer.parseInt(hex, 16));
            asciiStr += ((char) Integer.parseInt(hex, 16));
        }
        
        System.out.println("\nthe ASCII string is " + asciiStr);
        
        in.close();
    }
}

I am passing in a hex string of C0:A8:96:FE to the program. My main concern is the 0x96 value, because it is defined as a control character (characters in the range of 128 - 159).

The output when I run the program without any JVM flags is the following:

Charset: windows-1252
Type a HEX string: C0:A8:96:FE
 192|À 168|¨ 150|? 254|þ
the ASCII string is À¨?þ

The output when I use the JVM flag -Dfile.encoding=ISO-8859-1 to set the character encoding appears to be the following:

Charset: ISO-8859-1
Type a HEX string: C0:A8:96:FE
 192|À 168|¨ 150|– 254|þ
the ASCII string is À¨–þ

I'm wondering why, when the character encoding is set to ISO-8859-1, I get the extra Windows-1252 characters for characters 128 - 159? These characters shouldn't be defined in ISO-8859-1, but should be defined in Windows-1252, but it is appearing to be backwards here. In ISO-8859-1, I would think that the 0x96 character is supposed to be encoded as a blank character, but that is not the case. Instead, the Windows-1252 encoding does this, when it should properly encode it as a . Any help here?


Solution

  • tl;dr

    My guess: While the default Charset of your JVM may be "windows-1252", your System.out is actually using Unicode.

    You said:

    when I use the JVM flag -Dfile.encoding=ISO-8859-1 to set the character encoding

    My experiments below lead me to suspect that whatever you were doing did not actually affect the character set used by System.out. I believe that in both your runs, when you thought your System.out was using "windows-1252" or "ISO-8859-1", your System.out was in fact using Unicode, likely UTF-8.

    I wish I knew how to get the Charset of System.out.

    This behavior might change in the future, with a proposal (JEP 400) to use UTF-8 by default across platforms.

    Details

    Actually, you are asking about Unicode rather than ASCII. ASCII has only 128 characters.

    You said:

    My main concern is the 0x96 value, because it is defined as a control character (characters in the range of 128 - 159).

    Actually, that range of control characters starts at 127 in Unicode (and ASCII), not 128. Code point 127 is DELETE character. So 127-159 are control characters.

    First, let’s split your input string of hex codes.

            final List < String > hexInputs = List.of( "C0:A8:96:FE".split( ":" ) );
            System.out.println( "hexInputs = " + hexInputs );
    

    When run.

    hexInputs = [C0, A8, 96, FE]
    

    Now convert each hex text into hex integer. We use that integer as a Unicode code point.

    Rather than rely on some default character encoding, let's explicitly set the Charset of our System.out. I'm no expert on this, but some web-searching found the code below where we wrap System.out in a new PrintStream while setting a Charset by its name. I could not find a way to get the Charset of a PrintStream, so I asked.

    UTF-8

            // UTF-8
            System.out.println( "----------|  UTF-8  |--------------------------" );
            try
            {
                PrintStream printStream = new PrintStream( System.out , true , StandardCharsets.UTF_8.name() ); // "UTF-8".
    
                for ( String hexInput : hexInputs )
                {
                    int codePoint = Integer.parseInt( hexInput , 16 );
                    String string = Character.toString( codePoint );
                    printStream.println( "hexInput: " + hexInput + " = codePoint: " + codePoint + " = string: [" + string + "] = isLetter: " + Character.isLetter( codePoint ) + " = name: " + Character.getName( codePoint ) );
                }
            }
            catch ( UnsupportedEncodingException e )
            {
                e.printStackTrace();
            }
    

    When run.

    ----------|  UTF-8  |--------------------------
    hexInput: C0 = codePoint: 192 = string: [À] = isLetter: true = name: LATIN CAPITAL LETTER A WITH GRAVE
    hexInput: A8 = codePoint: 168 = string: [¨] = isLetter: false = name: DIAERESIS
    hexInput: 96 = codePoint: 150 = string: [] = isLetter: false = name: START OF GUARDED AREA
    hexInput: FE = codePoint: 254 = string: [þ] = isLetter: true = name: LATIN SMALL LETTER THORN
    

    Windows-1252

    Next, we do the same but for setting "windows-1252" as the Charset of our wrapped System.out. Before doing the wrapping, we verify that such a character encoding is actually available on our current JVM.

            // windows-1252
            System.out.println( "----------|  windows-1252  |--------------------------" );
    
            // Verify windows-1252 charset is available on the current JVM.
            String windows1252CharSetName = "windows-1252";
            boolean isWindows1252CharsetAvailable = Charset.availableCharsets().keySet().contains( windows1252CharSetName );
            if ( isWindows1252CharsetAvailable )
            {
                System.out.println( "isWindows1252CharsetAvailable = " + isWindows1252CharsetAvailable );
            } else
            {
                System.out.println( "FAIL - No charset available for name: " + windows1252CharSetName );
            }
    
            try
            {
                PrintStream printStream = new PrintStream( System.out , true , windows1252CharSetName );
    
                for ( String hexInput : hexInputs )
                {
                    int codePoint = Integer.parseInt( hexInput , 16 );
                    String string = Character.toString( codePoint );
                    printStream.println( "hexInput: " + hexInput + " = codePoint: " + codePoint + " = string: [" + string + "] = isLetter: " + Character.isLetter( codePoint ) + " = name: " + Character.getName( codePoint ) );
                }
            }
            catch ( UnsupportedEncodingException e )
            {
                e.printStackTrace();
            }
    

    When run.

    ----------|  windows-1252  |--------------------------
    isWindows1252CharsetAvailable = true
    hexInput: C0 = codePoint: 192 = string: [�] = isLetter: true = name: LATIN CAPITAL LETTER A WITH GRAVE
    hexInput: A8 = codePoint: 168 = string: [�] = isLetter: false = name: DIAERESIS
    hexInput: 96 = codePoint: 150 = string: [?] = isLetter: false = name: START OF GUARDED AREA
    hexInput: FE = codePoint: 254 = string: [�] = isLetter: true = name: LATIN SMALL LETTER THORN
    

    Latin-1

    And we can try Latin-1 as well, producing yet a different result.

            // ISO-8859-1
            System.out.println( "----------|  Latin-1  |--------------------------" );
    
            // Verify that  charset is available on the current JVM.
            String latin1CharsetName = "ISO-8859-1"; // Also known as "Latin-1".
            boolean isLatin1CharsetNameAvailable = Charset.availableCharsets().keySet().contains( latin1CharsetName );
            if ( isLatin1CharsetNameAvailable )
            {
                System.out.println( "isLatin1CharsetNameAvailable = " + isLatin1CharsetNameAvailable );
            } else
            {
                System.out.println( "FAIL - No charset available for name: " + latin1CharsetName );
            }
    
            try
            {
                PrintStream printStream = new PrintStream( System.out , true , latin1CharsetName );
    
                for ( String hexInput : hexInputs )
                {
                    int codePoint = Integer.parseInt( hexInput , 16 );
                    String string = Character.toString( codePoint );
                    printStream.println( "hexInput: " + hexInput + " = codePoint: " + codePoint + " = string: [" + string + "] = isLetter: " + Character.isLetter( codePoint ) + " = name: " + Character.getName( codePoint ) );
                }
            }
            catch ( UnsupportedEncodingException e )
            {
                e.printStackTrace();
            }
    

    When run.

    ----------|  Latin-1  |--------------------------
    isLatin1CharsetNameAvailable = true
    hexInput: C0 = codePoint: 192 = string: [�] = isLetter: true = name: LATIN CAPITAL LETTER A WITH GRAVE
    hexInput: A8 = codePoint: 168 = string: [�] = isLetter: false = name: DIAERESIS
    hexInput: 96 = codePoint: 150 = string: [�] = isLetter: false = name: START OF GUARDED AREA
    hexInput: FE = codePoint: 254 = string: [�] = isLetter: true = name: LATIN SMALL LETTER THORN
    

    Conclusion

    So you can see that when hard-coding the Charset of our wrapped System.out, we do indeed see a difference. With UTF-8, we get actual characters [À], [¨], [], [þ] whereas with windows-1252 we get three funky question mark characters and one regular question mark, [�], [�], [?], [�]. Remember that we added the square brackets in our code.

    This behavior of my code matches my expectations, and apparently meets yours as well. Two of those four hex/decimal integer numbers are letters in Unicode while none of them are letters in Windows 1252 character set nor in Latin-1. The only mysterious thing to me is that the hex 96 decimal 150 number has two different representations, an empty space with UTF-8 while a question mark with windows-1252, and then a funky-question-mark under Latin-1.

    Conclusion: Your System.out is not using the Charset that you think it is using. I suspect that while the JVM’s default Charset of your JVM may be named "windows-1252", your System.out is actually the Unicode character set, likely with UTF-8 encoding.


    Note to the reader: If unfamiliar with character sets and character encoding, I recommend the fun and easy-reading post, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).