xmlscalaapache-sparkxml-parsing

How to determine invalid XML strings in Apache spark and scala


I have a user-defined function which looks like this

  case class bodyresults(text:String,code:String)

val bodyudf = udf{ (body: String)  =>

//Appending body tag explicitly to the xml before parsing  
val xmlElems = xml.XML.loadString(s"""<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE body [<!ENTITY nbsp "&#160;"> <!ENTITY ndash "&#8211;"><!ENTITY mdash "&#8212;">]><body>${body}></body>""")//.replace("&lt;", "<").replace("&gt;", ">")
// extract the code inside the req

val code = (xmlElems \\ "body" \\"code").text
val text = (xmlElems \\ "body").text.replace(s"${code}" ,"" )
bodyresults(text, code)
}

This function divides input string into code and text.

The input strings looks like this

<p>I want to use a track-bar to change a form's opacity.</p>
<p>This is my code:</p>
 <pre><code>decimal trans = trackBar1.Value / 5000;
this.Opacity = trans;
 </code></pre>
<p>When I build the application, it gives the following error:</p>

This Dataframe column contains 100,000 records and some of the input strings are not in proper XML format and they are creating errors while parsing

I want to check whether the input strings are in proper XML format or not?

If they are not in proper XML format I want assign code="0", text="0" so that I can filter them later.

I have spent a lot of time to achieve this using regular expressions but not able to do it.

Can someone please suggest me a way to do this.


Solution

  • There is no regular expression that will distinguish well-formed XML documents from strings that are not well-formed XML documents.

    (It's a shame you spent a lot of time trying, because computer science theory would tell you that the problem is insoluble...)

    The practical way to achieve this is to submit the input to an XML parser, and catch the exception if the parsing fails.