I want implement a very simple web crawler using Java and I have find this library: crawler4j: http://code.google.com/p/crawler4j/
I need a crawler that do the following thing:
Start from an URL (specificated by me) and recognizes if in the current page there is a specific word such as a own name or a company name (also this word are specified by me)
If find this word, the current page URL have to be saved in a database.
So, there is no semantic analysis but only syntactic analysis (the crawler has to try to match the web page content with some token specified by me)
I would know if this token research (find if a word is contained in the current page) is a feature implemented by the abstract class WebCrawler
of crawler4j or if I have to implement it by myself
As noted by user1887511 it is dead simple to implement. Adapted from here.
static String wordToFind = "...";
public void visit(Page page) {
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String text = htmlParseData.getText();
if(text.indexOf(wordToFind)!=-1)
saveToDB(page.getWebURL().getURL()):
}
}