javasummarizationboilerpipe

How to get the main content of an article from HTML using boilerplate?


I am trying to get the main content of an article from an HTML using boilerpipe code.

Downloaded the latest jars from here.

I am trying to use the following code:

String article = "";
try {
    article = ArticleExtractor.INSTANCE.getText(url);   
    System.out.println("Article ++++ >>" + article);    
} catch (BoilerpipeProcessingException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
}

But this returns an empty string for every URL. Can anyone help me on this?


Solution

  • Have you tried to pass the HTML itself instead of the url? Or maybe there is a problem with the way your url strings are formatted.