c++xml-parsinglibxml2xml-formatting

libxml2 not pretty print if xml contains whitespaces?


I have the following sample XML. if I feed it to libxml2 without any formatting or whitespace in between, then it would pretty-print fine when calling xmlNodeDump() with 1:

const char *xml= "<root><a>test</a></root>";

However, if I preformat it, or have spaces in between, then libxml2 refuses to pretty-print it, for example:

const char *xml =
"<root>"
"  <a>"
"    test"
"  </a>"
"</root>";

Then I call the function to read it like this:

#define MY_PARSER_OPTIONS (XML_PARSE_RECOVER | XML_PARSE_NOENT | XML_PARSE_DTDLOAD | XML_PARSE_DTDATTR | XML_PARSE_HUGE)
...

doc = xmlReadDoc((const xmlChar *) xml, NULL, NULL, MY_PARSER_OPTIONS);
...
xmlNodePtr root = xmlDocGetRootElement(doc);
xmlBufferPtr buf = xmlBufferCreate();
xmlNodeDump(buf, doc, root, 0, 1);

The output would not be formatted.

When the input contains spaces, the output is:

<root>  <a>    test  </a></root>

When the input contains no spaces, the output is:

<root>
  <a>test</a>
</root>

Is this a bug in libxml2? How can I have it pretty-print/format correctly? There is no error from either of the input.

UPDATE:

With minimal reproducible example:

https://github.com/totszwai/libxml2-troubleshoot1

As we can see, when the input contains some spacing, libxml2 cannot format it, for some reason.

image


Solution

  • Try including the XML_PARSE_NOBLANKS option when calling xmlReadDoc(). Per the libxml2 documentation:

    Remove some text nodes containing only whitespace from the result document. Which nodes are removed depends on DTD element declarations or a conservative heuristic. The reindenting feature of the serialization code relies on this option to be set when parsing.