[SOLVED] How do I download txt web content using Perl

How do I download txt web content using Perl

I am trying to download data from this data page. I have tried a number of scripts I googled. On the data page I have to select the countries I want, one at a time. The one script which gets close to what I want is:

#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;

my $url = 'https://www.ogimet.com/ultimos_synops2.php?lang=en&estado=Zamb&fmt=txt&Send=Send';
my $file = 'Zamb.txt';
getstore($url, $file);

However this script gives me the page, not the data. Is it possible to download the data? I would also be open to doing it in PHP.

Solution

The link returns text wrapped in HTML. Simplest approach would be to use HTML::FormatText and HTML::Parse to get the text only version.

#!/usr/bin/perl

use strict;
use warnings;

use HTML::TreeBuilder;
use HTML::FormatText;


my $url = 'https://www.ogimet.com/ultimos_synops2.php?lang=en&estado=Zamb&fmt=txt&Send=Send';
my $text = HTML::FormatText->new(leftmargin=>0, rightmargin=>100000000000)->format(HTML::TreeBuilder->new_from_url($url));

my $file = 'Zamb.txt';
open (my $fh, '>', $file);
print $fh $text;
close ($fh);

HTML::TreeBuilder->new_from_url($url) - download and parse the html
HTML::FormatText->new(leftmargin=>0, rightmargin=>100000000000) - intialize the html format - set the right margin to a big value to prevent wrapping

This is the content of Zamb.txt afterwards.

 $ cat Zamb.txt
##########################################################
# Query made at 02/29/2020 18:15:54 UTC
##########################################################

##########################################################
# latest SYNOP reports from Zambia before 02/29/2020 18:15:54 UTC
##########################################################
202002291200 AAXX 29124 67855 42775 51401 10310 20168 3//// 48/// 85201
                   333 5//// 85850 83080=

My php fu isn't up to date, but for PHP, I think you can use the following:

<?php
$url = 'https://www.ogimet.com/ultimos_synops2.php?lang=en&estado=Zamb&fmt=txt&Send=Send';
$content = strip_tags(file_get_contents($url));
echo substr($content, strpos($content, '###############'));

Note: I seem to recall that there are some configuration options that might disable fetching URL via file_get_contents so YMMV.

However, the same page there is a note:

NOTE: If you want to get simply files with synop reports in CSV format without HTML tags consider to use the binary getsynop

This would get you the same data in a easy to use format:

$ wget "https://www.ogimet.com/cgi-bin/getsynop?begin=$(date +%Y%m%d0000)&state=Zambia" -o /dev/null -O - | tail -1
67855,2020,02,29,12,00,AAXX 29124 67855 42775 51401 10310 20168 3//// 48/// 85201 333 5//// 85850 83080=