I'm using the following code to tidy a snippet of untidied HTML codes.
perl -Mutf8 -MXML::LibXML -E'
my $filename = "1.html";
open $fh, "<", $filename;
binmode $fh;
my $dom = XML::LibXML->load_html(
IO => $fh,
recover => 1,
suppress_errors => 1,
huge => 10000000,
);
say $dom->toString();
' > tidy.html
The untidied HTML codes(missing the </p>
ending tag):
1.html:
<p>aΩ<span>test</span>
As you can see, there's one special character Ω
in the <p>
tag, after the tidy process, the Ω
is encoded as Ω
as followed(tidied HTML codes):
tidy.html:
<html><body><p>aΩ<span>test</span></p></body></html>
Can I keep Ω
in its original form, instead of its encoded form in the tidy output?
Or is there any other alternatives to do the tidy process that won't encoding special characters?
The problem is not quite what you think.
The HTML parser treats the input as Latin1 as specified by the standard, but your input file is really in UTF-8. To make it work, you need to declare the correct encoding, e.g.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>