perlparsingscreen-scrapingwww-mechanizehtml-treebuilder

WWW::Mechanize Extraction Help - PERL


I'm try to automate the extraction of a transcript found on a website. The entire transcript is found between dl tags since the site formatted the interview in a description list. The script I have below allows me to search the site and extract the text in a plain-text format, but I'm actually looking for it to include everything between the dl tags, meaning dd's, dt's, etc. This will allow us to develop our own CSS for the interview.

Something to note about the page is that there are break statements inserted at various points during the interview. Some tools we've found that extract information from webpages using pairings have found this to be a problem since it only grabs the information up until the break statement. Just something to keep in mind if you point me in a different direction. Here's what I have so far.

#!/usr/bin/perl -w

use strict;
use WWW::Mechanize;
use WWW::Mechanize::TreeBuilder;

my $mech = WWW::Mechanize->new();
WWW::Mechanize::TreeBuilder->meta->apply($mech);
$mech->get("http://millercenter.org/president/clinton/oralhistory/madeleine-k-albright");

# find all <dl> tags
my @list = $mech->find('dl');

foreach ( @list ) {
print $_->as_text();
}

If there is a tool that essentially prints what I have, only this time as HTML, please let me know of it!


Solution

  • Your code is fine, just change the as_text() method to as_HTML() and it will show the content with HTML tags included.