perl

How do I download txt web content using Perl


I am trying to download data from this data page. I have tried a number of scripts I googled. On the data page I have to select the countries I want, one at a time. The one script which gets close to what I want is:

#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;

my $url = 'https://www.ogimet.com/ultimos_synops2.php?lang=en&estado=Zamb&fmt=txt&Send=Send';
my $file = 'Zamb.txt';
getstore($url, $file);

However this script gives me the page, not the data. Is it possible to download the data? I would also be open to doing it in PHP.


Solution

  • The link returns text wrapped in HTML. Simplest approach would be to use HTML::FormatText and HTML::Parse to get the text only version.

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    
    use HTML::TreeBuilder;
    use HTML::FormatText;
    
    
    my $url = 'https://www.ogimet.com/ultimos_synops2.php?lang=en&estado=Zamb&fmt=txt&Send=Send';
    my $text = HTML::FormatText->new(leftmargin=>0, rightmargin=>100000000000)->format(HTML::TreeBuilder->new_from_url($url));
    
    my $file = 'Zamb.txt';
    open (my $fh, '>', $file);
    print $fh $text;
    close ($fh);
    

    This is the content of Zamb.txt afterwards.

     $ cat Zamb.txt
    ##########################################################
    # Query made at 02/29/2020 18:15:54 UTC
    ##########################################################
    
    ##########################################################
    # latest SYNOP reports from Zambia before 02/29/2020 18:15:54 UTC
    ##########################################################
    202002291200 AAXX 29124 67855 42775 51401 10310 20168 3//// 48/// 85201
                       333 5//// 85850 83080=
    

    My php fu isn't up to date, but for PHP, I think you can use the following:

    <?php
    $url = 'https://www.ogimet.com/ultimos_synops2.php?lang=en&estado=Zamb&fmt=txt&Send=Send';
    $content = strip_tags(file_get_contents($url));
    echo substr($content, strpos($content, '###############'));
    

    Note: I seem to recall that there are some configuration options that might disable fetching URL via file_get_contents so YMMV.

    However, the same page there is a note:

    NOTE: If you want to get simply files with synop reports in CSV format without HTML tags consider to use the binary getsynop

    This would get you the same data in a easy to use format:

    $ wget "https://www.ogimet.com/cgi-bin/getsynop?begin=$(date +%Y%m%d0000)&state=Zambia" -o /dev/null -O - | tail -1
    67855,2020,02,29,12,00,AAXX 29124 67855 42775 51401 10310 20168 3//// 48/// 85201 333 5//// 85850 83080=