I need to clean up various Word 'smart' characters in user input, including but not limited to the following:
– EN DASH
‘ LEFT SINGLE QUOTATION MARK
’ RIGHT SINGLE QUOTATION MARK
Are there any Ruby functions or libraries for mapping these into their ASCII (near-) equivalents, or do I really need to just do a bunch of manual gsubs?
The HTMLEntities gem will decode the entities to UTF-8.
You could use iconv to transliterate to the closest ASCII equivalents or simple gsub
or tr
calls. James Grey has some blogs about converting between various character sets showing how to do the transliterations.
require 'htmlentities'
chars = [
'–', # EN DASH
'‘', # LEFT SINGLE QUOTATION MARK
'’' # RIGHT SINGLE QUOTATION MARK
]
decoder = HTMLEntities.new('expanded')
chars.each do |c|
puts "#{ c } => #{ decoder.decode(c) } => #{ decoder.decode(c).tr('–‘’', "-'")} => #{ decoder.decode(c).encoding }"
end
# >> – => – => - => UTF-8
# >> ‘ => ‘ => ' => UTF-8
# >> ’ => ’ => ' => UTF-8