jsonrubyparsingutf-8ucs2

How to read json encoded in ibm437 in Ruby


I have a json file that has the following data in it:

{"help":true}

Platform in Windows 2016, when I open the text file in notepad++ the encoding shows as UCS-2 LE BOM and when I use ruby to display the encoding it is ibm437, when I try to parse the json it errors with the following:

ruby/2.5.0/json/common.rb:156:in `parse': 765: unexpected token at ' ■{' (JSON::ParserError)

My code is as follow:

require 'json'
def current_options
    dest='C:/test.json'
    file = File.read(dest)
    if(File.exist?(dest)) 
      p file.encoding
      p file
      @data_hash ||= JSON.parse(file)
      return @data_hash
    else
      return {}
    end
end

p current_options

And the output looks like this:

PS C:\> & "C:\ruby\bin\ruby.exe" .\ruby.rb #this is the file that contains my above code
#<Encoding:IBM437>
"\xFF\xFE{\x00\"\x00h\x00e\x00l\x00p\x00\"\x00:\x00t\x00r\x00u\x00e\x00}\x00"
Traceback (most recent call last):
        3: from ./ruby.rb:20:in `<main>'
        2: from ./ruby.rb:13:in `current_options'
        1: from C:/ruby/lib/ruby/2.5.0/json/common.rb:156:in `parse'
C:/ruby/lib/ruby/2.5.0/json/common.rb:156:in `parse': 765: unexpected token at ' ■{' (JSON::ParserError)

If I use notepad++ to change the encoding to utf-8 from UCS-2 LE BOM and then parse it in my code, it works without issues, the problem is that another application manages this file and creates it under that encoding format.

PS C:\> & "C:\ruby\bin\ruby.exe" .\ruby.rb #this is the file that contains my above code
#<Encoding:IBM437>
"{\"help\":true}"
{"help"=>true}

I tried specifying the encoding and forcing it to use utf-8 but it still fails:

require 'json'
def current_options
    dest='C:/test.json'
    file = File.read(dest,:external_encoding => 'ibm437',:internal_encoding => 'utf-8')
    if(File.exist?(dest)) 
      p file.encoding
      p file
      @data_hash ||= JSON.parse(file)
      return @data_hash
    else
      return {}
    end
end

p current_options

Will output this:

PS C:\> & "C:\ruby\bin\ruby.exe" .\ruby.rb #this is the file that contains my above code
#<Encoding:UTF-8>
"\u00A0\u25A0{\u0000\"\u0000h\u0000e\u0000l\u0000p\u0000\"\u0000:\u0000t\u0000r\u0000u\u0000e\u0000}\u0000"
Traceback (most recent call last):
        3: from ./ruby.rb:20:in `<main>'
        2: from ./ruby.rb:13:in `current_options'
        1: from C:/ruby/lib/ruby/2.5.0/json/common.rb:156:in `parse'
C:/ruby/lib/ruby/2.5.0/json/common.rb:156:in `parse': 765: unexpected token at ' ■{' (JSON::ParserError)

I am not sure how I can parse this file, any suggestions? Thank you,


Solution

  • Your file really is in UCS2-LE with a BOM, so Notepad++ is telling you the truth.

    Ruby does not attempt to figure out the encoding, as far as I know. When you do this:

    file = File.read(dest)
    if(File.exist?(dest)) 
        p file.encoding
    

    What you see is not the encoding Ruby has deduced from the contents of the file. Rather, it is the OS default locale encoding. On USian OEM installs of Windows, the default encoding is IBM 437, which is the original DOS encoding. The actual encoding of the file is irrelevant.

    You should be able to convert the file to UTF-8 by supplying external_encoding => 'utf-16' since the BOM provides endianness information.