javadomencodingxhtmlampersand

Prevent re-encoding ampersands using Node's setTextContent method


Background

Converting straight quotes into curled quotes and apostrophes within an XHTML document. Given a document with straight quotes (" and '), some pre-processing is performed to convert the straight quotes to their curled, semantic equivalents (“, ”, ‘, ’, and '). Typically, the curled character is used for closing single quotes (’) and apostrophes ('), but this loses the semantic meaning, which I'd like to keep by using the entity instead---for subsequent translation to TeX (e.g., \quote{outer \quote{we’re inside quotes} outer}). Thus:

Markdown -> XHTML (straight) -> XHTML (curled) -> TeX

The code is using Java's built-in document object model (DOM) classes.

Problem

Calling Node's setTextContent method will double-encode any ampersand resulting in:

“I reckon, I'm 'bout dat.”
“Elizabeth Davenport;” she said ‘Elizabeth’ to be dignified, “and really my father owns the place.”

Rather than:

“I reckon, I'm 'bout dat.”
“Elizabeth Davenport;” she said ‘Elizabeth’ to be dignified, “and really my father owns the place.”

Disabling and enabling by setting the processing instruction didn't seem to work.

Code

Here's the code to walk a tree:

  public static void walk(
    final Document document, final String xpath,
    final Consumer<Node> consumer ) {
    assert document != null;
    assert consumer != null;

    try {
      final var expr = lookupXPathExpression( xpath );
      final var nodes = (NodeList) expr.evaluate( document, NODESET );

      if( nodes != null ) {
        for( int i = 0, len = nodes.getLength(); i < len; i++ ) {
          consumer.accept( nodes.item( i ) );
        }
      }
    } catch( final Exception ex ) {
      clue( ex );
    }
  }

Here's the code that replaces the quotes with curled equivalents:

walk(
  xhtml,
  "//*[normalize-space( text() ) != '']",
  node -> node.setTextContent( sConverter.apply( node.getTextContent() ) )
);

Where xhtml is the Document and sConverter curls quotes.

Question

How would you instruct the DOM to accept &apos; and friends without re-encoding the ampersand?

Related

Semi-related questions:


Solution

  • Change the pre-processing to replace straight quotes with Unicode characters, not with invalid XML entities. Those entities are defined by HTML, and is not valid XML.