So I've got a large amount of XML files. For years they've caused trouble because the people that write them do them by hand, so errors naturally occurred. It's high time we get around to validating them and providing feedback on what's wrong when trying to use these XML files.
I'm using the SAX parser and getting a list of errors.
Below is my code
BookValidationErrorHandler errorHandler = new BookValidationErrorHandler();
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setValidating(true);
factory.setNamespaceAware(true);
SchemaFactory schemaFactory =
SchemaFactory.newInstance("http://www.w3.org/2001/XMLSchema");
factory.setSchema(schemaFactory.newSchema(
new Source[] {new StreamSource("test.xsd")}));
javax.xml.parsers.SAXParser parser = factory.newSAXParser();
org.xml.sax.XMLReader reader = parser.getXMLReader();
reader.setErrorHandler(errorHandler);
reader.parse(new InputSource("bad.xml"));
The first couple errors are always:
Line Number: 2: Document is invalid: no grammar found. Line Number: 2: Document root element "credits", must match DOCTYPE root "null".
We can't possibly go and edit these thousands of XML files that needed to be checked.
Is there anything I can easily add to the front of the source to prevent this? Is there a way to tell the parser to ignore these DTD related errors? Not even sure what the grammar one means. I sort of understand what the second one means.
Setting setValidating(true)
requests DTD validation and causes a failure if no DTD exists. If you only want schema validation and not DTD validation then use setValidating(false)
. From the Javadoc for setValidating()
:
To use modern schema languages such as W3C XML Schema or RELAX NG instead of DTD, you can configure your parser to be a non-validating parser by leaving the
setValidating(boolean)
method false, then use thesetSchema(Schema)
method to associate a schema to a parser.