I am trying to find a tool to parse a TXT file, containing html, to plain text, while keeping it formatted, whith lists and so on
I have been able to find this http://jsoup.org/apidocs/org/jsoup/examples/HtmlToPlainText.html which works perfeclty. Only problem is that it reads an URL, not a file. I tried making some changes to the code but without success
Can someone point me to the right direction on how to have it read my txt file as input?
You can start investigating the source code of the example program: https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/examples/HtmlToPlainText.java
It is pretty easy to load the html from a file instead of an URL. JSoup can easily parse a string.
Example
String fileName = "YOURFILE.htm";
Scanner scanner = new Scanner( new File(fileName) );
String content = scanner.useDelimiter("\\A").next();
scanner.close() // Put this call in a finally block
Document doc = Jsoup.parse(content);
//do whatever with the JSoup document