Converting straight quotes into curled quotes and apostrophes within an XHTML document. Given a document with straight quotes ("
and '
), some pre-processing is performed to convert the straight quotes to their curled, semantic equivalents (“
, ”
, ‘
, ’
, and '
). Typically, the curled character ’
is used for closing single quotes (’
) and apostrophes ('
), but this loses the semantic meaning, which I'd like to keep by using the entity instead---for subsequent translation to TeX (e.g., \quote{outer \quote{we’re inside quotes} outer}
). Thus:
Markdown -> XHTML (straight) -> XHTML (curled) -> TeX
The code is using Java's built-in document object model (DOM) classes.
Calling Node
's setTextContent
method will double-encode any ampersand resulting in:
“I reckon, I'm 'bout dat.”
“Elizabeth Davenport;” she said ‘Elizabeth’ to be dignified, “and really my father owns the place.”
Rather than:
“I reckon, I'm 'bout dat.”
“Elizabeth Davenport;” she said ‘Elizabeth’ to be dignified, “and really my father owns the place.”
Disabling and enabling by setting the processing instruction didn't seem to work.
Here's the code to walk
a tree:
public static void walk(
final Document document, final String xpath,
final Consumer<Node> consumer ) {
assert document != null;
assert consumer != null;
try {
final var expr = lookupXPathExpression( xpath );
final var nodes = (NodeList) expr.evaluate( document, NODESET );
if( nodes != null ) {
for( int i = 0, len = nodes.getLength(); i < len; i++ ) {
consumer.accept( nodes.item( i ) );
}
}
} catch( final Exception ex ) {
clue( ex );
}
}
Here's the code that replaces the quotes with curled equivalents:
walk(
xhtml,
"//*[normalize-space( text() ) != '']",
node -> node.setTextContent( sConverter.apply( node.getTextContent() ) )
);
Where xhtml
is the Document
and sConverter
curls quotes.
How would you instruct the DOM to accept '
and friends without re-encoding the ampersand?
Semi-related questions:
Change the pre-processing to replace straight quotes with Unicode characters, not with invalid XML entities. Those entities are defined by HTML, and is not valid XML.
“
should be “
or \u201C
if written as Java literal”
should be ”
or \u201D
if written as Java literal‘
should be ‘
or \u2018
if written as Java literal’
should be ’
or \u2019
if written as Java literal'
should be '