I am looking at brainstorming my problem here, not sure if this will trigger loads of shut down or not!
Simplified: I have a system that reads an XML file and loads it into a database.
The XML has a schema with the following:
<?XML version="1.0" encoding="UTF-8"?>
The culprit field has the following schema excerpt:
<xsd:simpleType name="title">
.....
<xsd:restriction base="xsd:string">
<xsd:minLength value="1"/>
<xsd:maxLength value="2000"/>
</xsd:restriction>
The schema is UTF-8 compliant, so should support 2000 UTF-8 characters whether they are single or double-byte or multiple bytes.
The XML schema already does a character length check, as defined in the excerpt above.
The problem is sometimes the XSD validates successfully, but the database insert fails, crashes the server with DB error when some multi-byte UTF-8 characters occur in the 'title' XML field.
The database 'title' column is defined as `varchar(2000)`
When the database insert operation fails, ops need to manually reduce the length of the XML field and re-process XML file to fix it.
I have been researching about :
Could the solution be doing a string byte count check which matches the character count?
I can do a string.getBytes("UTF-8").length in Java, but how would that match the <xsd:maxLength value="2000"/>
in the XSD and the varchar(2000)
?
What would you suggest as the best way to ensure the XML data for the title field does not exceed a specified length, as defined in XSD. And that the XML data is successfully inserted into the DB as long as XSD is conformed to?
Am I right in assuming a <xsd:maxLength value="2000"/>
in the XSD matches the varchar(2000)
column definition ?
The schema is UTF-8 compliant
Not exactly, but I think I know what you mean. The XML declaration that you quoted is not specifying anything about the XML instance documents that match this schema. It is simply saying that the XSD itself (i.e. the XML document with root tag <xs:schema>) uses UTF-8 as its character encoding.
XML Schema never concerns itself with the raw bytes of the XML document. It is the XML info set that is being validated. So the maxLength facet on the simple type is saying that you can have up to 2000 characters in this field. As you rightly point out, the actual length in bytes could easily exceed 2000 characters, but the XML processor will not know or care.
sometimes the XSD validates successfully, but the database insert fails
I agree with lunatikz - the most likely explanation is that the DB is incorrectly configured.
Could the solution be doing a string byte count check which matches the character count?
No, that would be fixing the wrong problem. The problem is probably in the database, not in your Java code.
What would you suggest as the best way to ensure the XML data for the title field does not exceed a specified length, as defined in XSD.
I don't think you need to do anything to ensure that. Your XML validator is already checking that for you, and it's probably working just fine.
And that the XML data is successfully inserted into the DB as long as XSD is conformed to?
Configure the DB or its table/column definition so that it stops trying to interpret the input using a single-byte character encoding.
Am I right in assuming a <xsd:maxLength value="2000"/> in the XSD matches the varchar(2000) column definition ?
Yes, both are specifying a field with up to 2000 characters. But the database interprets the word 'character' in a different way from the XML processor.