javapattern-matchingweb-crawlercrawler4j

some information about pattern matching in a Java web crwaler using crawler4j library


I want implement a very simple web crawler using Java and I have find this library: crawler4j: http://code.google.com/p/crawler4j/

I need a crawler that do the following thing:

Start from an URL (specificated by me) and recognizes if in the current page there is a specific word such as a own name or a company name (also this word are specified by me)

If find this word, the current page URL have to be saved in a database.

So, there is no semantic analysis but only syntactic analysis (the crawler has to try to match the web page content with some token specified by me)

I would know if this token research (find if a word is contained in the current page) is a feature implemented by the abstract class WebCrawler of crawler4j or if I have to implement it by myself


Solution

  • As noted by user1887511 it is dead simple to implement. Adapted from here.

      static String wordToFind = "...";
      public void visit(Page page) {          
                if (page.getParseData() instanceof HtmlParseData) {
                        HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
                        String text = htmlParseData.getText();
                        if(text.indexOf(wordToFind)!=-1)
                                saveToDB(page.getWebURL().getURL()):
                }
      }