perlmechanizewww-mechanizehtml-content-extractionhtml-treebuilder

Extracting Links in Perl using TreeBuilder


I'm working on a script to extract a bunch of information into one HTML file. I'm having some difficulty extracting ONLY a specific set of links from the page in question, however.

Here is a rough structure of the site. There are some other headings and paragraphs in between the innercontent div and what I'm showing below.

<div id="innercontent">
<h1>Download here</h1>
<a href="website.pdf"><img src="stuff"></a>
</div>

Now there are multiple links found in the div ID "innercontent," so I'm looking to find a way to either match a string or otherwise to only get the links I want. Keep in mind all of the links I'm looking to grab will be .pdf, so perhaps that can be of some help. I'm pretty sure TreeBuilder can handle this based on the research I've done. Here are two methods I'm trying. I'd prefer to solve it using the first.

# link to pdf of transcript
for ( $mech->look_down(_tag => 'a') ) {
  next unless $_->as_trimmed_text =~ m/pdf/;
  say $_->as_HTML;
}

my @links = $mech->links();
  for my $link ( @links ) {
  print $link->url;
}

I realize the latter method is just going to search the entire page for links, but I'm including it just in case that method is more efficient, or if both of these methods can be combined.

Any help or advice would be greatly appreciated!


Solution

  • WWW::Mechanize has the ability to extract links based on quite a few attributes, such as the text that's displayed for the link, the actual link, or id.

    For your specific example, you'd fetch the links that are pdfs with:

    my @links = $mech->find_all_links(url_regex=>qr/\.pdf$/)
    

    and then do whatever you needed with the resulting array.

    You can see the documentation. And this doc will show you the options availiabe.