javautf-8zipoutputstream

Java: Problems with ZipOutputStream and UTF-8 encoding


I am currently having trouble getting a ZipOutputStream to correctly encode xml files in UTF-8 under certain circumstances.

Here's the relevant code:

     public void saveQuest(File selectedFile) {

        try {
            ZipOutputStream out = new ZipOutputStream(new FileOutputStream(selectedFile + ".zip"), StandardCharsets.UTF_8);

            out.putNextEntry(new ZipEntry(".data"));
            out.write(quest.getConfigAsBytes());
            for (String scene : quest.getSceneNames()) {
                out.putNextEntry(new ZipEntry(scene+".xml"));
                out.write(quest.getSceneSource(scene).getBytes());
            }
            out.flush();

        } catch (IOException e) {
            e.printStackTrace();

        } finally {
            try {out.close();} catch(IOException e) {}
    }

The code zips all the xml files as well as a data file in UTF-8. But only as long as I run it in my Eclipse-IDE. As soon as I put it into a runnable jar and run it outside of Eclipse it is no longer able to write the files in the UTF-8 encoding.

Information that might be helpful:

The code is from an old maven project and I have set the compiler to Java 1.8. But as I have no real experience with Maven I do not know if something else is going awry around there.

This is my first stackoverflow-question and as you guys can probably see, I'm not all that experienced. If I have forgotten to provide any other essential information, please let me know.


Solution

  • You're calling getBytes() without specifying the encoding, which is probably the source of your problem.

    Never call String.getBytes() (or new String(byte[]) for that matter), as it happily uses the platform default encoding, which isn't guaranteed to always be what you expect as you've found out.

    Change quest.getSceneSource(scene).getBytes() to quest.getSceneSource(scene).getBytes(StandardCharsets.UTF_8) and fix your quest.getConfigAsBytes() so it doesn't return just any bytes, but UTF-8 bytes.