Perl - geturls with WWW::Mechanize

I am trying to submit a form on http://bioinfo.noble.org/TrSSP/ and want to extract the result.

My query data looks like this

>ATCG00270
MTIALGKFTKDEKDLFDIMDDWLRRDRFVFVGWSGLLLFPCAYFALGGWFTGTTFVTSWYTHGLASSYLEGCNFLTAA    VSTPANSLAHSLLLLWGPEAQGDFTRWCQLGGLWAFVALHGAFALIGFMLRQFELARSVQLRPYNAIAFSGPIAVFVSVFLIYPLGQSGWFFAPSFGVAAIFRFILFFQGFHNWTLNPFHMMGVAGVLGAALLCAIHGATVENTLFEDGDGANTFRAFNPTQAEETYSMVTANRFWSQIFGVAFSNKRWLHFFMLFVPVTGLWMSALGVVGLALNLRAYDFVSQEIRAAEDPEFETFYTKNILLNEGIRAWMAAQDQPHENLIFPEEVLPRGNAL

My script looks like this

use strict;
use warnings;

use File::Slurp;
use WWW::Mechanize;

my $mech = WWW::Mechanize->new;

my $sequence = $ARGV[0];

$mech->get( 'http://bioinfo.noble.org/TrSSP' );
$mech->submit_form( fields => { 'query_file' => $sequence, }, );

print $mech->content;

#sleep (10);

open( OUT, ">out.txt" );

my @a = $mech->find_all_links();

print OUT "\n", $a[$_]->url for ( 0 .. $#a );

print $mech->content gives a result like this

 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
    "http://www.w3.org/TR/html4/loose.dtd">
   <html>

    <head>
      <title>The job is running, please wait...</title>
      <meta http-equiv="refresh" content="4;url=/TrSSP/?sessionid=1492435151653763">
  <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
  <link rel="stylesheet" href="interface/style.css" type="text/css">
</head>

<body>
<table width="90%" align="center" border="0" cellpadding="0" cellspacing="0"  class="table1">

  <tr align="center">
    <td width="50">&nbsp;</td>
    <td></td>
    <td>&nbsp;</td>
  </tr>

  <tr align="left" height="30" valign="middle">
    <td width="30">&nbsp;</td>
    <td bgColor="#CCCCFF">&nbsp;Your sequences have been submitted to backend pipeline, please wait for result:</td>
    <td width="30">&nbsp;</td>
  </tr>

  <tr align="left">
    <td>&nbsp;</td>
    <td>

<br><br><font color="#0000FF"><strong>
&nbsp;</strong></font>
<BR><BR><BR><BR><BR><BR><br><br><BR><br><br><hr>
If you don't want to wait online, please copy and keep the following link to retrieve your result later:<br>

<strong>http://bioinfo.noble.org/TrSSP/?sessionid=1492435151653763</strong>

<script language="JavaScript" type="text/JavaScript">
function doit()
{
    window.location.href="/TrSSP/?sessionid=1492435151653763";
}
setTimeout("doit()",9000);
</script>

    </td>
    <td>&nbsp;</td>
  </tr>
</table>
</body>
    </html>

I want to extract this link

http://bioinfo.noble.org/TrSSP/?sessionid=1492435151653763

and download the result when the job is completed. But find_all_links() is recognizing /TrSSP/?sessionid=1492434554474809 as a link.

Solution

We don't know how long this is backend process there is going to take. If it's minutes, you could have your program wait. Even if it's hours, waiting is reasonable.

In a browser, the page is going to refresh on its own. There are two auto-refresh mechanisms implemented in the response you are showing.

<script language="JavaScript" type="text/JavaScript">
function doit()
{
    window.location.href="/TrSSP/?sessionid=1492435151653763";
}
setTimeout("doit()",9000);
</script>

The javascript setTimeout takes an argument in milliseconds, so this will be done after 9 seconds.

There is also a meta tag that tells the browser to auto-refresh:

<meta http-equiv="refresh" content="4;url=/TrSSP/?sessionid=1492435151653763">

Here, the 4 in the content means 4 seconds. So this would be done earlier.

Of course we also don't know how long they keep the session around. It might be a safe approach to reload that page every ten seconds (or more often, if you want).

You can do that by building a simple while loop and checking if the refresh is still in the response.

# do the initial submit here

... 

# assign this by grabbing it from the page
$mech->content =~ m{<strong>(\Qhttp://bioinfo.noble.org/TrSSP/?sessionid=\E\d+)</strong>};
my $url = $1; # in this case, regex on HTML is fine

print "Waiting for $url\n";
while (1) {
     $mech->get($url);
     last unless $mech->content =~ m/refresh/;
     sleep 10; # or whatever number of seconds
}

# process the final response ...

We first submit the data. We then extract the URL that you're supposed to call until they are done processing. Since this is a pretty straight-forward document, we can safely use a pattern match. The URL is always the same, and it's clearly marked with the <strong> tag. In general it's not a good idea to use regex to parse HTML, but we're not really parsing, we are just screenscraping a single value. The \Q and \E are the same as quotemeta and make sure that we don't have to escape the . and ?, which is then easier to read than having a buch of backslashes \ in the pattern.

The script will sleep for ten seconds after every attempt before trying again. Once it matches, it breaks out of the endless loop, so you can put the processing of the actual response that has the data you wanted behind that loop.

It might make sense to add some output into the loop so you can see that it's still running.

Note that this needs to really keep running until it's done. Don't stop the process.