perlweb-scrapingfind

Check if page contains specific word


How can I check if a page contains a specific word. Example: I want to return true or false if the page contains the word "candybar". Notice that the "candybar" could be in between tags (candybar) sometimes and sometimes not. How do I accomplish this?

Here is my code for "grabing" the site (just dont now how to check through the site):

#!/usr/bin/perl -w

use utf8;

use RPC::XML;
use RPC::XML::Client;
use Data::Dumper;
use Encode;
use Time::HiRes qw(usleep);

print "Content-type:text/html\n\n";

use LWP::Simple; 

$pageURL = "http://example.com"; 

$simplePage=get($pageURL);

if ($simplePage =~ m/candybar/) {   
 print "its there!";
}

Solution

  • I'd suggest that you use some kind of parser, if you're looking for words in HTML or anything else that's tagged in a known way [XML, for example]. I use HTML::Tokeparser but there's many parsing modules on CPAN.

    I've left the explanation of the returns from the parser as comments, in case you use this parser. This is extracted from a live program that I use to machine translate the text in web pages, so I've taken out some bits and pieces.

    The comment above about checking status and content of returns from LWP, is very sensible too, if the website is off-line, you need to know that.

    open( my $fh, "<:utf8", $file ) || die "Can't open $file : $!";

    my $p = HTML::TokeParser->new($fh) || die "Can't open: $!";
    
    $p->empty_element_tags(1);    # configure its behaviour
    # put output into here and it's cumulated
    while ( my $token = $p->get_token ) {
        #["S",  $tag, $attr, $attrseq, $text]
        #["E",  $tag, $text]
        #["T",  $text, $is_data]
        #["C",  $text]
        #["D",  $text]
        #["PI", $token0, $text
        my ($type,$string) = get_output($token) ;             
        # ["T",  $text, $is_data] : rule for text
        if ( $type eq 'T' && $string =~ /^candybar/ ) {
    
        }