c++xmlxerces-c

Xerces-c SaxParser issues


I am using xerces-c to parse an XML file but I am getting some strange results.

I create my own DocumentHandler (derived from HandlerBase) and override:

void characters(const XMLCh* const chars, const unsigned int length);

this way I receive notification of character data inside an element.

To parse a file I create a parser, create an inputbuffer, create my handler and call parse.

SAXParser* lp_parser = new SAXParser();

XMLCh* lp_fileName = XMLString::transcode("myfile.xml");
LocalFileInputSource l_fileBuf(lp_fileName);
XMLString::release(&lp_fileName);

MyHandler l_handler;

lp_parser->setDocumentHandler((DocumentHandler *)&l_handler);

lp_parser->parse(l_fileBuf);

delete lp_parser;

The problem is that characters([...]) is not only being called with character data, but also (sometimes several times) for each tag it is called giving me a set of spaces and a newline as character data.

i.e. <Tag>Value</Tag> yields two calls to characters([...]), one where the data is 'Value' and another (or multiple ones) where the data is something like '     \n                     '

The xml file itself doesn't contain these characters. I have user xerces-c to parse XML like this many times without any problems, although this is the first time I use a LocalFileInputSource (I usually use a MemBufInputSource).

Any ideas?


Solution

  • I had a similar problem with SAX2XMLReader. What I understood is that with SAX parsers it is up to the developer to know where he is in the XML structure while parsing.

    It is possible that these subsequent call to characters() are for other tags in the file or ignorable whitespaces.

    Depending on the length of the data it is also possible that callback characters be called several times for the same tag. And it is up to you to concatenate the data you receive on each call.

    So what I would do is detect the start and end of tag <Tag> with callback functions startElement() and endElement(). In this way you can discard subsequent call to characters() once you have received the endElement() for your tag.