I need to retrieve text from a remote web site that does not provide an RSS feed.
What I know is that the data I need is always on pages linked to from the main page (http://www.example.com/
) with a link that contains the text " Invoices Report
".
For example:
<a href="http://www.example.com/data/invoices/2010/10/invoices-report---tuesday-october-12.html">Invoices Report - Tuesday, October 12</a>
So, I need to find all of the links on the main page that match this pattern and then retrieve all of the text from those pages that is situated inside a tag called <div class="invoice-body">
.
Are there Java tools that help with this and is there anything specifically for Google App Engine for Java that can be used to do this?
Check out http://code.google.com/appengine/docs/java/urlfetch/overview.html
You can use the UrlFetch service to read www.example.com/index.html line-by-line, and use a regular expression to look for "Invoices Report."
URL url = new URL("http://www.example.com/index.html");
BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));
String line;
while ((line = reader.readLine()) != null) {
checkLineForTextAndAddLinkOrWhatever(line);
}
reader.close();
You might need a different kind of reader if the link might be on multiple lines.