jsouphtml-to-text

Trasform txt containing HTML to Plain Text


I am trying to find a tool to parse a TXT file, containing html, to plain text, while keeping it formatted, whith lists and so on

I have been able to find this http://jsoup.org/apidocs/org/jsoup/examples/HtmlToPlainText.html which works perfeclty. Only problem is that it reads an URL, not a file. I tried making some changes to the code but without success

Can someone point me to the right direction on how to have it read my txt file as input?


Solution

  • You can start investigating the source code of the example program: https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/examples/HtmlToPlainText.java

    It is pretty easy to load the html from a file instead of an URL. JSoup can easily parse a string.

    Example

    String fileName = "YOURFILE.htm";
    Scanner scanner = new Scanner( new File(fileName) );
    String content = scanner.useDelimiter("\\A").next();
    scanner.close() // Put this call in a finally block
    
    Document doc = Jsoup.parse(content);
    //do whatever with the JSoup document