rubyescapingxml-entities

Converting escaped XML entities back into UTF-8


So I've got this UTF-8 string in an XML file:

Horrible place. ☠☠☠

And when I feed it to an external application, the funny characters come back escaped as XML entities:

Horrible place. ☠☠☠

In Ruby, how do I convert that string back to UTF-8? There's probably a really easy solution for this, but I'm unable to find anything in the standard libraries; eg. CGI.unescapeHTML (which work nicely for things like >) seem to ignore them completely.

ree-1.8.7-2010.02 > CGI.unescapeHTML('>')
 => ">" 
ree-1.8.7-2010.02 > CGI.unescapeHTML('☠')
 => "☠" 

Solution

  • Well, since it's XML encoded I'd go for an XML parser:

    require 'nokogiri'
    
    frag = 'Horrible place. ☠☠☠'
    doc = Nokogiri::XML.fragment(frag)
    puts doc.text
    # >> Horrible place. ☠☠☠