I have a string like this
String incoming = "<html> <head></head> <body> <p><span style=\"font-family: Arial;\">Ευχαριστώ (eff-kha-ri-STOE) Tι κανείς (tee-KAH-nis)? Mε συγχωρείτε.</span></p> </body></html>";
and I'm escaping it using the StringEscapeUtils
import org.apache.commons.text.StringEscapeUtils;
String escaped = StringEscapeUtils.escapeJava(incoming);
The result is
<html> <head></head> <body> <p><span style=\"font-family: Arial;\">\u0395\u03C5\u03C7\u03B1\u03C1\u03B9\u03C3\u03C4\u03CE (eff-kha-ri-STOE) T\u03B9 \u03BA\u03B1\u03BD\u03B5\u03AF\u03C2 (tee-KAH-nis)? M\u03B5 \u03C3\u03C5\u03B3\u03C7\u03C9\u03C1\u03B5\u03AF\u03C4\u03B5.</span></p> </body></html>
I've tried converting it to utf-8 by getting the bytes and it doesn't work, is there any way I could get it fixed?
here's what I tried:
String s = new String(escaped.getBytes("UTF-8"), "UTF-8");
I've also tried a different library to escape the text still doesn't work.
I'm assuming that you want the characters such as single quote, double quote and backslash in your input String
to be escaped, but you want the Greek characters to remain unchanged.
Unfortunately StringEscapeUtils.escapeJava()
will translate any text characters with a Unicode value > 0x7f
to their Unicode Escape equivalents. For example, your sample data shows that the Greek letter tau (τ
) is escaped to \u03C4
in the String returned by StringEscapeUtils.escapeJava()
. I don't know why escapeJava()
does this. Its Javadoc states "Escapes the characters in a String using Java String rules." but I couldn't find a formal definition of "Java String rules".
A simple way to to remove the Unicode escapes in the string returned by StringEscapeUtils.escapeJava()
is to call the translate()
method for the UnicodeUnescaper()
class:
Translates escaped Unicode values of the form \u+\d\d\d\d back to Unicode. It supports multiple 'u' characters and will work with or without the +.
So calling UnicodeUnescaper.translate()
will return a String
that:
\u03C4
will be changed to τ
.The code is straightforward. Using your data:
import org.apache.commons.text.StringEscapeUtils;
import org.apache.commons.text.translate.UnicodeUnescaper;
void convert() {
String incoming = "<html> <head></head> <body> <p><span style=\"font-family: Arial;\">Ευχαριστώ (eff-kha-ri-STOE) Tι κανείς (tee-KAH-nis)? Mε συγχωρείτε.</span></p> </body></html>";
String escaped = StringEscapeUtils.escapeJava(incoming);
String greekChars = new UnicodeUnescaper().translate(escaped);
System.out.println("incoming: " + incoming);
System.out.println("escaped: " + escaped); // Quotes are escaped, and Greek characters are converted to Unicode escapes.
System.out.println("greekChars: " + greekChars); // Quotes remain escaped, but Unicode escapes are converted back to Greek characters.
}
This is the output from the println()
calls:
run:
incoming: <html> <head></head> <body> <p><span style="font-family: Arial;">Ευχαριστώ (eff-kha-ri-STOE) Tι κανείς (tee-KAH-nis)? Mε συγχωρείτε.</span></p> </body></html>
escaped: <html> <head></head> <body> <p><span style=\"font-family: Arial;\">\u0395\u03C5\u03C7\u03B1\u03C1\u03B9\u03C3\u03C4\u03CE (eff-kha-ri-STOE) T\u03B9 \u03BA\u03B1\u03BD\u03B5\u03AF\u03C2 (tee-KAH-nis)? M\u03B5 \u03C3\u03C5\u03B3\u03C7\u03C9\u03C1\u03B5\u03AF\u03C4\u03B5.</span></p> </body></html>
greekChars: <html> <head></head> <body> <p><span style=\"font-family: Arial;\">Ευχαριστώ (eff-kha-ri-STOE) Tι κανείς (tee-KAH-nis)? Mε συγχωρείτε.</span></p> </body></html>
BUILD SUCCESSFUL (total time: 0 seconds)
Notes:
org.apache.commons.text.translate
for UnicodeUnescaper
. Older deprecated versions exist in org.apache.commons.lang3.text.translate
. This is a link to the download page for Apache Commons Text, currently at version 1.8.UnicodeUnescaper.translate()
to fix the mess created by StringEscapeUtils.escapeJava()
. There may be other approaches that are cleaner (by using an alternative to StringEscapeUtils.escapeJava()
), but this way seems to work fine for your data.