javaxmlutf-16utf-16le

Remove illegal xml characters from UTF-16LE encoded file


I have a java application that parses an xml file that was encoded in utf-16le. The xml has been erroring out while being parsed due to illegal xml characters. My solution is to read in this file into a java string, then removing the xml characters, so it can be parsed successfully. It works 99% but there are some slight differences in the input output from this process, not caused by the illegal characters being removed, but going from the utf-16le encoding to java string utf-16.. i think

BufferedReader reader = null;
    String fileText = ""; //stored as UTF-16
    try {
        reader = new BufferedReader(new InputStreamReader(in, "UTF-16LE"));
        for (String line; (line = reader.readLine()) != null; ) {
            fileText += line;
        }
    } catch (Exception ex) {
        logger.log(Level.WARNING, "Error removing illegal xml characters", ex);
    } finally {
        if (reader != null) {
            reader.close();
        }
    }

//code to remove illegal chars from string here, irrelevant to problem 

        ByteArrayInputStream inStream = new ByteArrayInputStream(fileText.getBytes("UTF-16LE"));
    Document doc = XmlUtil.openDocument(inStream, XML_ROOT_NODE_ELEM);

Do characters get changed/lost when going from UTF-16LE to UTF-16? Is there a way to do this in java and assuring the input is exactly the same as the output?


Solution

  • Certainly one problem is that readLine throws away the line ending.

    You would need to do something like:

           fileText += line + "\r\n";
    

    Otherwise XML attributes, DTD entities, or something else could get glued together where at least a space was required. Also you do not want the text content to be altered when it contains a line break.

    Performance (speed and memory) can be improved using a

    StringBuilder fileText = new StringBuilder();
    ... fileText.append(line).append("\n");
    ... fileText.toString();
    

    Then there might be a problem with the first character of the file, which sometimes redundantly is added: a BOM char.

    line = line.replace("\uFEFF", "");