I want to remove ONLY html tags from text with JSOUP. I used solution from here (my previous question about JSOUP) But after some checkings I discovered that JSOUP gets JAVA heap exception: OutOfMemoryError for big htmls but not for all. For example, it fails on html 2Mb and 10000 lines. Code throws an exception in the last line (NOT on Jsoup.parse):
public String StripHtml(String html){
html = html.replace("<", "<").replace(">", ">");
String[] tags = getAllStandardHtmlTags;
Document thing = Jsoup.parse(html);
for (String tag : tags) {
for (Element elem : thing.getElementsByTag(tag)) {
elem.parent().insertChildren(elem.siblingIndex(),elem.childNodes());
elem.remove();
}
}
return thing.html();
}
Is there a way to fix it?
After many searching in google and after some attempts to implement html stripper by myself, my solution is to use HTMLStripCharFilter class of Solr with replacing escapedTags to blackList with standard html tags.