[SOLVED] some information about pattern matching in a Java web crwaler using crawler4j library

some information about pattern matching in a Java web crwaler using crawler4j library

I want implement a very simple web crawler using Java and I have find this library: crawler4j: http://code.google.com/p/crawler4j/

I need a crawler that do the following thing:

Start from an URL (specificated by me) and recognizes if in the current page there is a specific word such as a own name or a company name (also this word are specified by me)

If find this word, the current page URL have to be saved in a database.

So, there is no semantic analysis but only syntactic analysis (the crawler has to try to match the web page content with some token specified by me)

I would know if this token research (find if a word is contained in the current page) is a feature implemented by the abstract class WebCrawler of crawler4j or if I have to implement it by myself

Solution

As noted by user1887511 it is dead simple to implement. Adapted from here.

  static String wordToFind = "...";
  public void visit(Page page) {          
            if (page.getParseData() instanceof HtmlParseData) {
                    HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
                    String text = htmlParseData.getText();
                    if(text.indexOf(wordToFind)!=-1)
                            saveToDB(page.getWebURL().getURL()):
            }
  }