I am using JTidy to process XHTML documents, and I now have one containing a <video>
element, which JTidy strips out. Here is the code:
import org.w3c.dom.Node;
import org.w3c.tidy.Tidy;
import java.io.ByteArrayInputStream;
import java.io.InputStream;
import static java.nio.charset.StandardCharsets.UTF_8;
public class Test {
public static void main (String[] args) throws Exception {
// Set up a JTidy instance
Tidy tidy = new Tidy();
tidy.setInputEncoding("UTF8");
// The following make no difference to the output
// whether they are present or not, or whether the
// parameters are changed from true to false or vice versa
tidy.setQuiet(true);
tidy.setShowWarnings(false);
tidy.setXHTML(true);
tidy.setDropEmptyParas(false);
tidy.setTrimEmptyElements(false);
// Process XHTML from a string
String xml = "<div>\n"
+ " Video goes here:<br/>\n"
+ " <video width='640' height='480'>\n"
+ " <source src='foo.mp4' type='video/mp4'/>\n"
+ " </video>\n"
+ "</div>";
byte[] bytes = xml.getBytes(UTF_8);
InputStream in = new ByteArrayInputStream(bytes);
Node node = tidy.parseDOM(in,null).getDocumentElement();
// Display the resulting Node as a sanity check
tidy.pprint(node,System.out);
}
}
For the example HTML fragment used in the above code, the relevant part of the output is this:
<div>Video goes here:<br /> </div>
I have been told (below) that <video>
is an HTML5 tag which is not valid in XHTML (?), so I have tried using tidy.setXHTML(false)
and it makes no difference. I have tried adding <!DOCTYPE html>
at the start. I have tried removing all the tidy.setXXX()
configuration calls. None of these things (in any combination) make any difference. The only thing that works is to use <embed>
instead of <video>
, but (a) this is deprecated, (b) I have to replace the <video>
tag with <embed>
before I parse it, and (c) it doesn't have all the features that <video>
does.
So, what can I do to parse a document which contains a video?
Is this an XHTML problem, or just a problem with JTidy, and if the latter is there an alternative I can use?
Or is there a table somewhere of allowed tags for JTidy that I can patch?
And if so, do I need to add all the new HTML5 tags to this table?
Any advice gratefully received...
Good news: I believe you just need to update the version of JTidy you're using.
The first hit I found when looking for JTidy was the old SourceForge site, where the latest version (r938) was released in 2009. With that version, I can reproduce your problem - so I suspect that's the version you're using.
However, there's a GitHub repository which is more up-to-date (last commit in June 2024). That the latest release is version 1.0.5, released in September 2023... and with your exact code, the warning goes away and the <video>
tag is preserved (whether you have setXHTML(true)
or setXHTML(false)
, interestingly).
So basically, update to 1.0.5 and that should fix the problem.