common-lispstp

How to parse XML with cxml and stp containing ampersand


I want to parse the following XML-Code:

(cxml:parse "<BEGIN><URL>www.some.de/url?some=data&bad=stuff</URL></BEGIN>" (stp:make-builder))

this results in

 #<CXML:WELL-FORMEDNESS-VIOLATION "~A" {1003C5E163}>

as '&' is a XML special character. But if I use &amp;? instead the result is:

(cxml:parse "<BEGIN><URL>www.some.de/url?some=data&amp;bad=stuff</URL></BEGIN>" (stp:make-builder))
=>#.(CXML-STP-IMPL::DOCUMENT
   :CHILDREN '(#.(CXML-STP:ELEMENT
                  #| :PARENT of type DOCUMENT |#
                  :CHILDREN '(#.(CXML-STP:ELEMENT
                                 #| :PARENT of type ELEMENT |#
                                 :CHILDREN '(#.(CXML-STP:TEXT
                                                #| :PARENT of type ELEMENT |#
                                                :DATA "www.some.de/url?some=data")
                                             #.(CXML-STP:TEXT
                                                #| :PARENT of type ELEMENT |#
                                                :DATA "&")
                                             #.(CXML-STP:TEXT
                                                #| :PARENT of type ELEMENT |#
                                                :DATA "bad=stuff"))
                                 :LOCAL-NAME "URL"))
                  :LOCAL-NAME "BEGIN")))

Which is not exactly what I expected as there should only be one CXML-STP:TEXT child with DATA "www.some.de/url?some=data&bad=stuff"

How can I fix this wrong(?) behavior?


Solution

  • This behavior, although, not very convenient, is, actually, present in many other XML parsers as well. Probably the reason for it is to be able to parse arbitrary ​XML entities and apply some user-defined rules to them. Although, it may be just a by-product of the parser implementation. I couldn't find out yet.

    For the SAX variant of the parser I came to the following approach:

    (defclass my-sax (sax:sax-parser-mixin)
      ((title :accessor title :initform nil)
       (tag :accessor tag :initform nil)
       (text :accessor text :initform "")))
    
    (defmethod sax:start-element ((sax my-sax) namespace-uri local-name
                                  qname attributes)
      (with-slots (tag tagcount text) sax
                  (setf tag local-name
                        text "")))
    
    (defmethod sax:characters ((sax my-sax) data)
      (with-slots (title tag text) sax
        (switch (tag :test 'string=)
          ("text"  (setf text (conatenate 'string text data)))
          ("title" (setf title data)))))
    
    (defmethod sax:end-element ((sax my-sax) namespace-uri local-name qname)
      (with-slots (title tag text) sax
        (when (string= "text" local-name)
          ;; process (text sax)
        )))
    

    I.e. I collect the text in sax:characters and process it in sax:end-element. In STP you, probably, can get away even easier by just concatenating neighboring text elements.