htmlwindowsperlnewlinehtmltidy

HTML::Tidy on Windows newline issue


When using HTML::Tidy on Windows to clean the output of an HTML::Element as_HTML method I'm getting the wrong type of newline. If I don't specify the newline in the HTML::Tidy constructor, I get my lines terminated by CRCRLF. If I specify 'LF' termination, I get 'CRLF', and if I specify 'CRLF' I get the original CRCRLF termination. I suspect this is a bug in the HTMLtidy library, and it is easy enough to work around by specifying Unix termination explicitly, and getting DOS out, which pretty much any decent editor can parse on any platform.

Per answer, I resolved the issue using binmode ':raw:utf8' on the appropriate handle to disable /n interpolation:

my $output = IO::File->new($ARGV[1], 'w');
$output->binmode(':raw:utf8');
print $output HTML::Tidy->new( { wrap => 80,
                                 indent => 'auto',
                                 'wrap-attributes' => 'yes',
                               }
                             )->clean($tree->as_HTML());

It is pretty generic, but I can't find real mention of others having issues aside from general bugginess of the HTMLtidy library. Has anyone dealt with this issue and can confirm this is a library bug? I'd be surprised if so, as the library's been around for ages, and want to confirm before filing a report.

Edit: I updated the code to show the filehandle creation. The issue can be resolved by setting filehandle binmode to raw, but then I have issues due to Unicode in the HTML content. Is there a way to resolve it without inserting other issues?

Edit 2: I should note that I was originally seeing this as an HTML::Tidy issue because printing a straight $tree->as_HTML() to the filehandle with any binmode resulted in the correct EOL characters. The issue only manifested itself once I wrapped the scalar HTML::Element output as the HTML code with HTML::Tidy.


Solution

  • Tried to make the output file binary:

    binmode($output);
    

    I had a similar issue with the Template Toolkit output.