javahtml-escape-charactersaccent-sensitive

Java StringEscapeUtils.escapeHtml4 as regular text


My target is to display special letters of message as regular text after using StringEscapeUtils.escapeHtml4. Text example:

<html>
<body>
<p>éô</p>
</body>
</html>

My expected result is to make all the HTML tags being escaped, but not the text, that is here: éô

Code example:

String original = "<html><head><\\head><>éô";
System.out.println("original: " + original);

String translated = StringEscapeUtils.escapeHtml4(original);
System.out.println("translated: " + translated);

Output:

original: <html><head><\head><body>éô
translated: &lt;html&gt;&lt;head&gt;&lt;\head&gt;&lt;body&gt;&eacute;&ocirc;

I am expect to get: &lt;html&gt;&lt;head&gt;&lt;\head&gt;&lt;body&gt;éô


Solution

  • I think that I found the solution that mentioned here: Escape HTML in Languages with Accented Letters

    by creating a custom escaping method that will use only two lookup translators:

    public static final CharSequenceTranslator ESCAPE_HTML4_CUSTOM =
            new AggregateTranslator(
                    new LookupTranslator(EntityArrays.BASIC_ESCAPE()),
                    new LookupTranslator(EntityArrays.HTML40_EXTENDED_ESCAPE())
            );
    

    In the original method StringEscapeUtils.escapeHtml4 there are:

        public static final CharSequenceTranslator ESCAPE_HTML4 = 
        new AggregateTranslator(
            new LookupTranslator(EntityArrays.BASIC_ESCAPE()),
            new LookupTranslator(EntityArrays.ISO8859_1_ESCAPE()),
            new LookupTranslator(EntityArrays.HTML40_EXTENDED_ESCAPE())
        );