rubyxmlxmppxmpp4r

Using Ruby, how can I confirm that an XML snippit is valid?


As some of you make know, I'm working on XMPP (Jabber) integration for the StackOverflow chat system, as an XMPP component written in Ruby using the xmpp4r package.

I'm struggling with one issue (well, many issues, but one issue at the moment :-) I am taking the JSON feed from the chat and extracting the HTML for the messages. I am using The Ruby TidyHTML bindings to convert the HTML from the JSON fed to XHTML, so that I can send it as an XMPP message -- since XMPP messages are just XML, converting the HTML to XHTMl should make it valid XML which I can just stick into the <message> stanza.

For most messages, it works great!

My Mind Is Blown

However for other messages, it completely chokes -- the XMPP server closes the stream and the script grinds to a halt. (And rchern and others in The Tavern get upset. Well, maybe not upset, but they laugh at me. This makes me sad!)

I am almost certain that what's happing is, for some reason or other, the messages are not valid XML, and so the XMPP server is closing the connection because it encounters a parse error in the XML stream from the Ruby component. Here's an example of one such message:

<message to='jeswah@smart-safe-secure.com/Token' type='groupchat' xmlns='jabber:client'><body>&lt;div class=&quot;onebox ob-message&quot;&gt;&lt;a class=&quot;roomname&quot; href=&quot;/transcript/message/263372#263372&quot;&gt;&lt;span title=&quot;2010-11-04 19:20:23Z&quot;&gt;1 hour ago&lt;/span&gt;&lt;/a&gt;, by &lt;span class=&quot;user-name&quot;&gt;Fosco&lt;/span&gt; &lt;br/&gt;&lt;div class=&quot;quote&quot;&gt;&lt;div class=&quot;room-mini&quot;&gt;&lt;div class=&quot;room-mini-header&quot;&gt;&lt;h3&gt;&lt;img class=&quot;small-site-logo&quot; title=&quot;Gaming&quot; alt=&quot;Gaming&quot; width=&quot;16&quot; height=&quot;16&quot; src=&quot;http://sstatic.net/gaming/img/favicon.ico&quot; /&gt;&amp;nbsp;&lt;span class=&quot;room-name&quot;&gt;&lt;a href=&quot;http://chat.stackexchange.com/rooms/28/minecraft-talk&quot;&gt;Minecraft Talk&lt;/a&gt;&lt;/span&gt;&lt;/h3&gt;&lt;div class=&quot;room-mini-description&quot;&gt;Everything Minecraft, including classic and survival mode&lt;/div&gt;&lt;/div&gt;&lt;div class=&quot;room-current-user-count&quot; title=&quot;current users&quot;&gt;9&lt;/div&gt;&lt;div class=&quot;mspark&quot; style=&quot;height:25px;width:205px&quot;&gt;
&lt;div class=&quot;mspbar&quot; style=&quot;width:8px;height:13px;left:0px;&quot;&gt;&lt;/div&gt;&lt;div class=&quot;mspbar&quot; style=&quot;width:8px;height:9px;left:8px;&quot;&gt;&lt;/div&gt;&lt;div class=&quot;mspbar&quot; style=&quot;width:8px;height:2px;left:16px;&quot;&gt;&lt;/div&gt;&lt;div class=&quot;mspbar&quot; style=&quot;width:8px;height:8px;left:24px;&quot;&gt;&lt;/div&gt;&lt;div class=&quot;mspbar&quot; style=&quot;width:8px;height:1px;left:32px;&quot;&gt;&lt;/div&gt;&lt;div class=&quot;mspbar&quot; style=&quot;width:8px;height:1px;left:56px;&quot;&gt;&lt;/div&gt;&lt;div class=&quot;mspbar&quot; style=&quot;width:8px;height:0px;left:64px;&quot;&gt;&lt;/div&gt;&lt;div class=&quot;mspbar&quot; style=&quot;width:8px;height:0px;left:88px;&quot;&gt;&lt;/div&gt;&lt;div class=&quot;mspbar&quot; style=&quot;width:8px;height:0px;left:96px;&quot;&gt;&lt;/div&gt;&lt;div class=&quot;mspbar&quot; style=&quot;width:8px;height:11px;left:104px;&quot;&gt;&lt;/div&gt;&lt;div class=&quot;mspbar&quot; style=&quot;width:8px;height:7px;left:112px;&quot;&gt;&lt;/div&gt;&lt;div class=&quot;mspbar&quot; style=&quot;width:8px;height:7px;left:120px;&quot;&gt;&lt;/div&gt;&lt;div class=&quot;mspbar&quot; style=&quot;width:8px;height:25px;left:128px;&quot;&gt;&lt;/div&gt;&lt;div class=&quot;mspbar&quot; style=&quot;width:8px;height:14px;left:136px;&quot;&gt;&lt;/div&gt;&lt;div class=&quot;mspbar&quot; style=&quot;width:8px;height:4px;left:144px;&quot;&gt;&lt;/div&gt;&lt;div class=&quot;mspbar&quot; style=&quot;width:8px;height:7px;left:152px;&quot;&gt;&lt;/div&gt;&lt;div class=&quot;mspbar&quot; style=&quot;width:8px;height:19px;left:160px;&quot;&gt;&lt;/div&gt;&lt;div class=&quot;mspbar&quot; style=&quot;width:8px;height:19px;left:168px;&quot;&gt;&lt;/div&gt;&lt;div class=&quot;mspbar&quot; style=&quot;width:8px;height:12px;left:176px;&quot;&gt;&lt;/div&gt;&lt;div class=&quot;mspbar&quot; style=&quot;width:8px;height:11px;left:184px;&quot;&gt;&lt;/div&gt;&lt;div class=&quot;mspbar now&quot; style=&quot;height:25px;left:154px;&quot;&gt;&lt;/div&gt;&lt;/div&gt;
&lt;div class=&quot;clear-both&quot;&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;</body><html xmlns='http://jabber.org/protocol/xhtml-im'><body xmlns='http://www.w3.org/1999/xhtml'><div class="onebox ob-message"><a class="roomname" href="/transcript/message/263372#263372"><span title="2010-11-04 19:20:23Z">1 hour ago</span></a>, by <span class="user-name">Fosco</span><br />
<div class="quote">
<div class="room-mini"><div class="room-mini-header">
<h3><img class="small-site-logo" title="Gaming" alt="Gaming" width="16" height="16" src="http://sstatic.net/gaming/img/favicon.ico" />&nbsp;<span class="room-name"><a href="http://chat.stackexchange.com/rooms/28/minecraft-talk">Minecraft Talk</a></span></h3>
<div class="room-mini-description">Everything Minecraft, including classic and survival mode</div>
</div>
<div class="room-current-user-count" title="current users">9</div>
<div class="mspark" style="height:25px;width:205px">
<div class="mspbar" style="width:8px;height:13px;left:0px;"></div>
<div class="mspbar" style="width:8px;height:9px;left:8px;"></div>
<div class="mspbar" style="width:8px;height:2px;left:16px;"></div>
<div class="mspbar" style="width:8px;height:8px;left:24px;"></div>
<div class="mspbar" style="width:8px;height:1px;left:32px;"></div>
<div class="mspbar" style="width:8px;height:1px;left:56px;"></div>
<div class="mspbar" style="width:8px;height:0px;left:64px;"></div>
<div class="mspbar" style="width:8px;height:0px;left:88px;"></div>
<div class="mspbar" style="width:8px;height:0px;left:96px;"></div>
<div class="mspbar" style="width:8px;height:11px;left:104px;"></div><div class="mspbar" style="width:8px;height:7px;left:112px;"></div><div class="mspbar" style="width:8px;height:7px;left:120px;"></div><div class="mspbar" style="width:8px;height:25px;left:128px;"></div><div class="mspbar" style="width:8px;height:14px;left:136px;"></div>
<div class="mspbar" style="width:8px;height:4px;left:144px;"></div>
<div class="mspbar" style="width:8px;height:7px;left:152px;"></div>
<div class="mspbar" style="width:8px;height:19px;left:160px;"></div>
<div class="mspbar" style="width:8px;height:19px;left:168px;"></div><div class="mspbar" style="width:8px;height:12px;left:176px;"></div>
<div class="mspbar" style="width:8px;height:11px;left:184px;"></div>
<div class="mspbar now" style="height:25px;left:154px;"></div>
</div>
<div class="clear-both"></div>
</div>
</div>
</div>
</body></html></message>

(This message happened to be a quote of a oneboxed link to a chat room)

Here was the error Ruby gave me:

IOError: stream closed
/usr/lib/ruby/1.8/xmpp4r/stream.rb:594:in `empty?'
/usr/lib/ruby/1.8/rexml/parsers/baseparser.rb:153:in `empty?'
/usr/lib/ruby/1.8/rexml/parsers/baseparser.rb:193:in `pull'
/usr/lib/ruby/1.8/rexml/parsers/sax2parser.rb:92:in `parse'
/usr/lib/ruby/1.8/xmpp4r/streamparser.rb:79:in `parse'
/usr/lib/ruby/1.8/xmpp4r/stream.rb:75:in `start'
/usr/lib/ruby/1.8/xmpp4r/stream.rb:72:in `initialize'
/usr/lib/ruby/1.8/xmpp4r/stream.rb:72:in `new'
/usr/lib/ruby/1.8/xmpp4r/stream.rb:72:in `start'
/usr/lib/ruby/1.8/xmpp4r/connection.rb:119:in `start'
/usr/lib/ruby/1.8/xmpp4r/component.rb:70:in `start'
/usr/lib/ruby/1.8/xmpp4r/connection.rb:77:in `connect'
/usr/lib/ruby/1.8/xmpp4r/component.rb:47:in `connect'
./classes/SOXMPP_Bridge.rb:20:in `initialize'
./soxmpp.rb:81:in `new'
./soxmpp.rb:81

Finally, my question!

Given that sending invalid XML to the XMPP server kicks me off, is there any way using Ruby I can validate (and, preferably, correct) the XML before sending it to the XMPP server? Most likely, correcting it will be a matter of my writing additional code for each case where Tidy isn't producing valid XML, but I'd at least like to stop the script from crashing. So, how can I validate the XML before sending it to the XMPP server?


Solution

  • The actual error in this case is your &nbsp;. According to XEP-0071, section 8, point 5:

    Section 11.1 of XMPP Core stipulates that character entities other than the five general entities defined in Section 4.6 of the XML specification (i.e., &lt;, &gt;, &amp;, &apos;, and &quot;) MUST NOT be sent over an XML stream. Therefore implementations of XHTML-IM MUST NOT include predefined XHTML 1.0 entities such as &nbsp; -- instead, implementations MUST use the equivalent character references as specified in Section 4.1 of the XML specification (even in non-obvious places such as URIs that are included in the 'href' attribute).

    So this issue is about more than just generating well-formed XML, which is a pre-requisite. You'll also want to ensure that you're only using XHTML from the approved set in section 6.

    In short, you need to read XEP-0071.