rubyms-wordspecial-characterssmart-quotes

Cleaning up 'smart' characters from Word in Ruby


I need to clean up various Word 'smart' characters in user input, including but not limited to the following:

– EN DASH
‘ LEFT SINGLE QUOTATION MARK
’ RIGHT SINGLE QUOTATION MARK

Are there any Ruby functions or libraries for mapping these into their ASCII (near-) equivalents, or do I really need to just do a bunch of manual gsubs?


Solution

  • The HTMLEntities gem will decode the entities to UTF-8.

    You could use iconv to transliterate to the closest ASCII equivalents or simple gsub or tr calls. James Grey has some blogs about converting between various character sets showing how to do the transliterations.

    require 'htmlentities'
    
    chars = [
      '–', # EN DASH
      '‘', # LEFT SINGLE QUOTATION MARK
      '’'  # RIGHT SINGLE QUOTATION MARK
    ]
    
    decoder = HTMLEntities.new('expanded')
    chars.each do |c|
      puts "#{ c } => #{ decoder.decode(c) } => #{ decoder.decode(c).tr('–‘’', "-'")} => #{ decoder.decode(c).encoding }"
    end
    
    # >> – => – => - => UTF-8
    # >> ‘ => ‘ => ' => UTF-8
    # >> ’ => ’ => ' => UTF-8