javaregexweb-scrapinggoogle-app-engine

How to perform web scraping to find specific linked pages in Java on Google App Engine?


I need to retrieve text from a remote web site that does not provide an RSS feed.

What I know is that the data I need is always on pages linked to from the main page (http://www.example.com/) with a link that contains the text " Invoices Report ".

For example:

<a href="http://www.example.com/data/invoices/2010/10/invoices-report---tuesday-october-12.html">Invoices Report - Tuesday, October 12</a>

So, I need to find all of the links on the main page that match this pattern and then retrieve all of the text from those pages that is situated inside a tag called <div class="invoice-body"> .

Are there Java tools that help with this and is there anything specifically for Google App Engine for Java that can be used to do this?


Solution

  • Check out http://code.google.com/appengine/docs/java/urlfetch/overview.html

    You can use the UrlFetch service to read www.example.com/index.html line-by-line, and use a regular expression to look for "Invoices Report."

    URL url = new URL("http://www.example.com/index.html");
    BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));
    String line;
    
    while ((line = reader.readLine()) != null) {
        checkLineForTextAndAddLinkOrWhatever(line);
    }
    reader.close();
    

    You might need a different kind of reader if the link might be on multiple lines.