xmlperlxml-libxml

How do I round-trip XML character references with XML::LibXML?


XML file snippet:

<?xml version="1.0" encoding="UTF-8"?>
<root>
<copyright-statement>Copyright &#x00A9;The authors 2024.</copyright-statement>
</root>

Perl code snippet:

use XML::LibXML;
use File::Slurp;
open my $fh, '<', $ARGV[0] or die $!;
binmode $fh;
my $dom = XML::LibXML->load_xml(IO => $fh);
my $root = $dom->documentElement();
my @cp = $root->findnodes('//copyright-statement');
write_file('file_name.txt', $cp[0]->textContent);

Desired output: Copyright &#x00A9;The authors 2024.

Actual output: Copyright ©The authors 2024.

I am parsing XML file which may have multiple entities. I want to change some XML attributes, values, nodes name etc. and save the file again. But when I am doing so HTML entities gets decoded automatically. I want to keep entities intact (same as input file), what change should I do to the Perl code?


Solution

  • From the point of view of XML, &#x00A9; and © are the same thing. It's in fact not an entity, it's a character reference (see the specification).

    There's no way to enforce the behaviour you requested. And there shouldn't be, as the two things should appear identical to any other tool that respects the specification.