javaarraysunicodeutf-8apache-commons

StringEscapeUtils not handling utf-8


I have a string like this

String incoming = "<html> <head></head> <body>  <p><span style=\"font-family: Arial;\">Ευχαριστώ (eff-kha-ri-STOE) Tι κανείς (tee-KAH-nis)? Mε συγχωρείτε.</span></p> </body></html>";

and I'm escaping it using the StringEscapeUtils

import org.apache.commons.text.StringEscapeUtils;
String escaped = StringEscapeUtils.escapeJava(incoming);

The result is

<html> <head></head> <body>  <p><span style=\"font-family: Arial;\">\u0395\u03C5\u03C7\u03B1\u03C1\u03B9\u03C3\u03C4\u03CE (eff-kha-ri-STOE) T\u03B9 \u03BA\u03B1\u03BD\u03B5\u03AF\u03C2 (tee-KAH-nis)? M\u03B5 \u03C3\u03C5\u03B3\u03C7\u03C9\u03C1\u03B5\u03AF\u03C4\u03B5.</span></p> </body></html>

I've tried converting it to utf-8 by getting the bytes and it doesn't work, is there any way I could get it fixed?

here's what I tried:

String s = new String(escaped.getBytes("UTF-8"), "UTF-8");

I've also tried a different library to escape the text still doesn't work.


Solution

  • I'm assuming that you want the characters such as single quote, double quote and backslash in your input String to be escaped, but you want the Greek characters to remain unchanged.

    Unfortunately StringEscapeUtils.escapeJava() will translate any text characters with a Unicode value > 0x7f to their Unicode Escape equivalents. For example, your sample data shows that the Greek letter tau (τ) is escaped to \u03C4 in the String returned by StringEscapeUtils.escapeJava(). I don't know why escapeJava() does this. Its Javadoc states "Escapes the characters in a String using Java String rules." but I couldn't find a formal definition of "Java String rules".

    A simple way to to remove the Unicode escapes in the string returned by StringEscapeUtils.escapeJava() is to call the translate() method for the UnicodeUnescaper() class:

    Translates escaped Unicode values of the form \u+\d\d\d\d back to Unicode. It supports multiple 'u' characters and will work with or without the +.

    So calling UnicodeUnescaper.translate() will return a String that:

    The code is straightforward. Using your data:

    import org.apache.commons.text.StringEscapeUtils;
    import org.apache.commons.text.translate.UnicodeUnescaper;
    
    void convert() {
        String incoming = "<html> <head></head> <body>  <p><span style=\"font-family: Arial;\">Ευχαριστώ (eff-kha-ri-STOE) Tι κανείς (tee-KAH-nis)? Mε συγχωρείτε.</span></p> </body></html>";
        String escaped = StringEscapeUtils.escapeJava(incoming); 
        String greekChars = new UnicodeUnescaper().translate(escaped);
    
        System.out.println("incoming:   " + incoming); 
        System.out.println("escaped:    " + escaped);    // Quotes are escaped, and Greek characters are converted to Unicode escapes.
        System.out.println("greekChars: " + greekChars); // Quotes remain escaped, but Unicode escapes are converted back to Greek characters.
    }
    

    This is the output from the println() calls:

    run:
    incoming:   <html> <head></head> <body>  <p><span style="font-family: Arial;">Ευχαριστώ (eff-kha-ri-STOE) Tι κανείς (tee-KAH-nis)? Mε συγχωρείτε.</span></p> </body></html>
    escaped:    <html> <head></head> <body>  <p><span style=\"font-family: Arial;\">\u0395\u03C5\u03C7\u03B1\u03C1\u03B9\u03C3\u03C4\u03CE (eff-kha-ri-STOE) T\u03B9 \u03BA\u03B1\u03BD\u03B5\u03AF\u03C2 (tee-KAH-nis)? M\u03B5 \u03C3\u03C5\u03B3\u03C7\u03C9\u03C1\u03B5\u03AF\u03C4\u03B5.</span></p> </body></html>
    greekChars: <html> <head></head> <body>  <p><span style=\"font-family: Arial;\">Ευχαριστώ (eff-kha-ri-STOE) Tι κανείς (tee-KAH-nis)? Mε συγχωρείτε.</span></p> </body></html>
    BUILD SUCCESSFUL (total time: 0 seconds)
    

    Notes: