javacharacter-encodingwindows-1252

How do I write a character at codepoint 80 to a file in Windows-1252?


I am trying to write bytes to a file in the windows-1252 charset. The example below, writing the raw bytes of a float to a file, is similar to what I'm doing in my actual program.

In the example given, I am writing the raw hex of 1.0f to test.txt. As the raw hex of 1.0f is 3f 80 00 00 I expect to get ?€(NUL)(NUL), as from what I can see in the Windows 1252 Wikipedia article, 0x3f should correspond to '?', 0x80 should correspond to '', and 0x00 is 'NUL'. Everything goes fine until I actually try to write to the file; at that point, I get a java.nio.charset.UnmappableCharacterException on the console, and after the program stops on that exception the file only has a single '?' in it. The full console output is below the code down below.

It looks like Java considers the codepoint 0x80 unmappable in the windows-1252 codepage. However, this doesn't seem right – all the codepoints should map to actual characters in that codepage. The problem is definitely with the codepoint 0x80, as if I try with 0.5f (3f 00 00 00) it is happy to write ?(NUL)(NUL)(NUL) into the file, and does not throw the exception. Experimenting with other codepages doesn't seem to work either; looking at key encodings supported by the Java language here, only the UTF series will not give me an exception, but due to their encoding they don't give me codepoint 0x80 in the actual file.

I'm going to try just using bytes instead so I don't have to worry about string encoding, but is anyone able to tell me why my code below gives me the exception it does?

Code:

import java.io.IOException;
import java.io.Writer;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Paths;

public class CharsetTest {
    public static void main(String[] args) {
        float max = 1.0f;
        System.out.println("Checking " + max);
        String stringFloatFormatHex = String.format("%08x", Float.floatToRawIntBits(max));
        System.out.println(stringFloatFormatHex);
        byte[] bytesForFile = javax.xml.bind.DatatypeConverter.parseHexBinary(stringFloatFormatHex);
        String stringForFile = new String(bytesForFile);
        System.out.println(stringForFile);

        String charset = "windows-1252";
        try {
            Writer output = Files.newBufferedWriter(Paths.get("test.txt"), Charset.forName(charset));
            output.write(stringForFile);
            output.close();
        } catch (IOException e) {
            System.err.println(e.getMessage());
            e.printStackTrace();
        }
    }
}

Console output:

Checking 1.0
3f800000
?�  
Input length = 1
java.nio.charset.UnmappableCharacterException: Input length = 1
    at java.nio.charset.CoderResult.throwException(CoderResult.java:282)
    at sun.nio.cs.StreamEncoder.implWrite(StreamEncoder.java:285)
    at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:125)
    at java.io.OutputStreamWriter.write(OutputStreamWriter.java:207)
    at java.io.BufferedWriter.flushBuffer(BufferedWriter.java:129)
    at java.io.BufferedWriter.close(BufferedWriter.java:265)
    at CharsetTest.main(CharsetTest.java:21)

Solution

  • Edit: The problem is on the instruction String stringForFile = new String(bytesForFile);, below the DatatypeConverter. As I was constructing a string without providing a charset, it uses my default charset, which is UTF-8, which doesn't have a symbol for codepoint 80. However, it only throws an exception when it writes to a file. This doesn't happen in the code below because my refactor (keeping in mind Johannes Kuhn's suggestion in the comments) doesn't use the String(byte[]) constructor without specifying a charset.

    Johannes Kuhn's suggestion about the String(byte[]) constructor gave me some good clues. I've ended up with the following code, which looks like it works fine: even printing the symbol to the console as well as writing it to test.txt. That suggests that codepoint 80 can be translated using the windows-1252 codepage.

    If I were to guess at this point why this code works but the other didn't, I'd still be confused, but I would guess it was something around the conversion in javax.xml.bind.DatatypeConverter.parseHexBinary(stringFloatFormatHex);. That looks to be the main difference, although I'm not sure why it would matter.

    Anyway, the code below works (and I don't even have to turn it into a string; I can write the bytes to a file with FileOutputStream fos = new FileOutputStream("test.txt"); fos.write(bytes); fos.close();), so I'm happy with this one.

    Code:

    import java.io.FileOutputStream;
    import java.io.IOException;
    import java.io.Writer;
    import java.nio.ByteBuffer;
    import java.nio.charset.Charset;
    import java.nio.file.Files;
    import java.nio.file.Paths;
    
    public class BytesCharsetTest {
        public static void main(String[] args) {
            float max = 1.0f;
            System.out.println("Checking " + max);
            int convInt = Float.floatToRawIntBits(max);
            byte[] bytes = ByteBuffer.allocate(4).putInt(convInt).array();
    
            String charset = "windows-1252";
            try {
                String stringForFile = new String(bytes, Charset.forName(charset));
                System.out.println(stringForFile);
    
                Writer output = Files.newBufferedWriter(Paths.get("test.txt"), Charset.forName(charset));
                output.write(stringForFile);
                output.close();
            } catch (IOException e) {
                System.err.println(e.getMessage());
                e.printStackTrace();
            }
        }
    }
    

    Console output:

    Checking 1.0
    ?€  
    
    Process finished with exit code 0