I'm new to jsoup and I'm having some difficulty working with non-HTML elements (scripts). I have the following HTML:
<$if not dcSnippet$>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="generator" content="Outside In HTML Converter version 8.4.0"/>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title></title>
</head>
<$endif$>
<div style="position:relative">
<p style="text-align: left; font-family: times; font-size: 10pt; font-weight: normal; font-style: normal; text-decoration: none"><span style="font-weight: normal; font-style: normal">This is a test document.</span></p>
</div>
<$if not dcSnippet$>
</body>
</html>
<$endif$>
The application used to display this knows what to do with those <if dcSnippet$> and etc. statements. So, when I simply parse the text with jsoup, the < and > are encoded and the html is reorganized, so it doesn't execute or display properly. Like so:
<html xmlns="http://www.w3.org/1999/xhtml"><head></head><body><$if not dcSnippet$>
<meta http-equiv="generator" content="Outside In HTML Converter version 8.4.0">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title></title>
<$endif$>
<div style="position:relative">
<p style="text-align: left; font-family: times; font-size: 10pt; font-weight: normal; font-style: normal; text-decoration: none"><span style="font-weight: normal; font-style: normal">This is a test document.</span></p>
</div>
<$if not dcSnippet$>
<$endif$>
</body></html>
My end goal here is I want to add some css and js includes, and modify a couple of the element attributes. That's not really a problem, I have that much worked out. The problem is I don't know how to preserve the non-HTML elements and keep the formatting in the same place as the original. My solution so far goes like this:
This works for now, as long as the placement of the non-HTML is predictable, and so far it is. But I want to know if there's a better way to do this so I don't have to 'clean' the HTML first, then manually re-introduce what I removed later. Here's the gist of my code (hopefully I didn't miss too many declarations):
String newLine();
FileReader fr = new FileReader(inputFile);
BufferedReader br = new BufferedReader(fr);
while ((thisLine = br.readLine()) != null) {
if (thisLine.matches(".*<\\$if.*\\$>")) {
ifStatement = thisLine + "\n";
} else if (thisLine.matches(".*<\\$endif\\$>")) {
endifStatement = thisLine + "\n";
} else {
tempHtml += thisLine + "\n";
}
}
br.close();
Document doc = Jsoup.parse(tempHtml, "UTF-8");
doc.outputSettings().prettyPrint(false).escapeMode(EscapeMode.extended);
Element head = doc.head();
Element body = doc.body();
Element firstDiv = body.select("div").first();
[... perform my element and attribute inserts ...]
body.prependText("\n" + endifStatement);
body.appendText("\n" + ifStatement);
String fullHtml = (ifStatement + doc.toString().replaceAll("\\<", "<").replaceAll("\\>", ">") + "\n" + endifStatement);
BufferedWriter htmlWriter = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile), "UTF-8"));
htmlWriter.write(fullHtml);
htmlWriter.flush();
htmlWriter.close();
Thanks so much for any help or input!
The problem is I don't know how to preserve the non-HTML elements and keep the formatting in the same place as the original.
Jsoup is an HTML parser. The "HTML file" you give it doesn't contain HTML. It's more a template file written in an HTML-like language.
As a result, Jsoup will consider this template file as an invalid HTML file at best. This is why all non-HTML elements get escaped.
To acheive what you need, you would have to write your custom template parser. Jsoup does provide some generic classses that would make this task quite easy.
However, by design, those generic classes are reserved for internal use only.
This left us with four options: