perlxml-libxml

Prevent encoding when tidy HTML using XML::LibXML


I'm using the following code to tidy a snippet of untidied HTML codes.

    perl -Mutf8 -MXML::LibXML -E'
    my $filename = "1.html";
    open $fh, "<", $filename; 
    binmode $fh;
    my $dom = XML::LibXML->load_html(
    IO  => $fh,
    recover   => 1,
    suppress_errors => 1, 
    huge => 10000000,
    );
    say $dom->toString();
    ' > tidy.html

The untidied HTML codes(missing the </p> ending tag):

1.html:

<p>aΩ<span>test</span>

As you can see, there's one special character Ω in the <p> tag, after the tidy process, the Ω is encoded as &#xCE;&#xA9; as followed(tidied HTML codes):

tidy.html:

<html><body><p>a&#xCE;&#xA9;<span>test</span></p></body></html>

Can I keep Ω in its original form, instead of its encoded form in the tidy output?

Or is there any other alternatives to do the tidy process that won't encoding special characters?


Solution

  • The problem is not quite what you think.

    The HTML parser treats the input as Latin1 as specified by the standard, but your input file is really in UTF-8. To make it work, you need to declare the correct encoding, e.g.

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>