utf-8rakumalformed

perl6 Malformed UTF-8 causes program crash


I am trying to download a webpage; then analyze it with regex; then get the files discovered by the regex. I have 2 questions:

(1) I use wget to download the webpages and files, using this line

my $webPage = "onePage";
my $result = run <<wget -O $webPage $aSite>>, :out, :err;

where $webPage is the output file from wget. Question: Any perl6 equivalent to wget? I used module URI::FetchFile from perl6 website; it gets some files, but it cannot get webpages.

(2) The $webPage downloaded by wget sometimes has malformed UTF-8 characters what caused my program to crash. When I do

cat onePage

from the shell, those malformed UTF-8 characters shows up as a blob, and this command causes the same error as my program:

cat onePage | perl6 -ne '.say;'

and the error output from perl6 is

Malformed UTF-8
  in block <unit> at -e line 1

and on the terminal or shell, one of the malformed UTF-8 char shows as a blob like this:

h�lt

and if I try to remove non-print chars, then the result is that I miss a huge number of links to files:

$tmpLine ~~ s/<-[print]>//; # this causes my program to miss many files

How do I best handle these malformed UTF-8 chars or any malformed unicodes or even malformed control chars ?


Solution

  • Any perl6 equivalent to wget?

    There are several. HTTP::Agent is now considered the more up-to-date, but you can also use LWP::Simple.

    How do I best handle these malformed UTF-8 chars or any malformed unicodes or even malformed control chars ?

    You might want to try UTF8-C8 encoding. But it's probably not a problem if you obtain directly the page from the perl6 program.

    Crashes, however, are a completely different thing. The best is to create a Rakudo issue