I am trying to download a webpage; then analyze it with regex; then get the files discovered by the regex. I have 2 questions:
(1) I use wget
to download the webpages and files, using this line
my $webPage = "onePage";
my $result = run <<wget -O $webPage $aSite>>, :out, :err;
where $webPage
is the output file from wget
. Question: Any perl6 equivalent to wget? I used module URI::FetchFile
from perl6 website; it gets some files, but it cannot get webpages.
(2) The $webPage downloaded by wget sometimes has malformed UTF-8 characters what caused my program to crash. When I do
cat onePage
from the shell, those malformed UTF-8 characters shows up as a blob, and this command causes the same error as my program:
cat onePage | perl6 -ne '.say;'
and the error output from perl6 is
Malformed UTF-8
in block <unit> at -e line 1
and on the terminal or shell, one of the malformed UTF-8 char shows as a blob like this:
h�lt
and if I try to remove non-print chars, then the result is that I miss a huge number of links to files:
$tmpLine ~~ s/<-[print]>//; # this causes my program to miss many files
How do I best handle these malformed UTF-8 chars or any malformed unicodes or even malformed control chars ?
Any perl6 equivalent to wget?
There are several. HTTP::Agent
is now considered the more up-to-date, but you can also use LWP::Simple
.
How do I best handle these malformed UTF-8 chars or any malformed unicodes or even malformed control chars ?
You might want to try UTF8-C8
encoding. But it's probably not a problem if you obtain directly the page from the perl6 program.
Crashes, however, are a completely different thing. The best is to create a Rakudo issue