htmlxmlxslt

Why is xsltproc converting chars with accent in hexadecimal entities?


I have next HTML file, called input.html, from where I want to extract XML fragments:

<!DOCTYPE html>
<div>Text with ó</div>

I apply this XSL stylesheet, named stylesheet.xsl:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:output method="xml" indent="yes" />

  <xsl:template match="div">
    <tag attribute="{child::text()}"></tag>
  </xsl:template>

</xsl:stylesheet>

Executing xsltproc stylesheet.xsl input.html, I want to get next result:

<?xml version="1.0"?>
<tag attribute="Text with ó"/>

but instead, I get unwanted hexadecimal entities into the attribute:

<?xml version="1.0"?>
<tag attribute="Text with &#xF3;"/>

I wonder how I can avoid the introduction of these unwanted hexadecimal entities, without having to translate every possible entity back as explained at XSL: how do I keep xsltproc from tampering with an escaped HTML string in an attribute value?.


Solution

  • Add an attribute of encoding="UTF-8" to your xsl:output instruction.