My application downloads a certain website as HTML file the first time it is started. The HTML file is very messy ofcourse, so I want to clean it with HtmlCleaner, so that I can then parse it with Jsoup. But how do I get a new cleaned html item after it was cleaned?
I did some research and this is all i could find:
HtmlCleaner htmlCleaner = new HtmlCleaner();
TagNode root = htmlCleaner.clean(url);
HtmlCleaner.getInnerHtml(root);
String html = "<" + root.getName() + ">" + htmlCleaner.getInnerHtml(root) + "</" + root.getName() + ">";
But I can't see where in this code does it write to a new file? If it doesn't, how do I implement it so that the old file will be deleted and the new cleaned html file will be created?
you can do something like following:
HtmlCleaner cleaner = new HtmlCleaner();
final String siteUrl = "http://www.themoscowtimes.com/";
TagNode node = cleaner.clean(new URL(siteUrl));
// serialize to xml file
new PrettyXmlSerializer(props).writeToFile(
node , "cleaned.xml", "utf-8"
);
or
// serialize to html file
SimpleHtmlSerializer serializer = new SimpleHtmlSerializer(htmlCleaner.getProperties());
serializer.writeToFile(node, "c:/temp/cleaned.html");