javautf-8apache-commonsutf-16document-conversion

How does Apache commons IO convert my XML header from UTF-8 to UTF-16?


I’m using Java 6. I have an XML template, which begins like so

<?xml version="1.0" encoding="UTF-8"?>

However, I notice when I parse and output it with the following code (using Apache Commons-io 2.4) …

    Document doc = null;
    InputStream in = this.getClass().getClassLoader().getResourceAsStream(“my-template.xml”);

    try
    {
        byte[] data = org.apache.commons.io.IOUtils.toByteArray( in );
        InputSource src = new InputSource(new StringReader(new String(data)));

        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        DocumentBuilder builder = factory.newDocumentBuilder();
        doc = builder.parse(src);
    }
    finally
    {
        in.close();
    }

The first line is output as

<?xml version="1.0" encoding="UTF-16”?>

What do I need to do when parsing/outputting the file so that the header encoding will remain “UTF-8”?

Edit: Per the suggestion given, I changed my code to

    Document doc = null;
    InputStream in = this.getClass().getClassLoader().getResourceAsStream(name);

    try
    {
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        DocumentBuilder builder = factory.newDocumentBuilder();
        doc = builder.parse(in);
    }
    finally
    {
        in.close();
    }

But despite the fact my input element template file's first line is

<?xml version="1.0" encoding="UTF-8"?>

when i output the document as a String it produces

<?xml version="1.0" encoding="UTF-16"?>

as a first line. Here's what I use to output the "doc" object as a string ...

private String getDocumentString(Document doc)
{
    DOMImplementationLS domImplementation = (DOMImplementationLS)doc.getImplementation();
    LSSerializer lsSerializer = domImplementation.createLSSerializer();
    return lsSerializer.writeToString(doc);  
}

Solution

  • Turns out that when I changed my Document -> String method to

    private String getDocumentString(Document doc)
    {
        String ret = null;
        DOMSource domSource = new DOMSource(doc);
        StringWriter writer = new StringWriter();
        StreamResult result = new StreamResult(writer);
        TransformerFactory tf = TransformerFactory.newInstance();
        Transformer transformer;
        try
        {
            transformer = tf.newTransformer();
            transformer.transform(domSource, result);
            ret = writer.toString();
        }
        catch (TransformerConfigurationException e)
        {
            e.printStackTrace();
        }
        catch (TransformerException e)
        {
            e.printStackTrace();
        }
        return ret;
    }
    

    the 'encoding="UTF-8"' headers no longer got output as 'encoding="UTF-16"'.