ruby-on-railsrubygb2312

Read GB2312 encoding page using Ruby


I am trying to parse GB2312 encoded page (http://news.qq.com/a/20140824/015032.htm), and this is my code.

I am not yet into the parsing part, just in the open and read, and I got error.

This is my code:

require 'open-uri'
open("http://news.qq.com/a/20140824/015032.htm").read

And this is the error:

Encoding::InvalidByteSequenceError: "\x8B" on GB2312

I am using Ruby 2.0.0p247

Any solution?


Solution

  • I don't know exactly why this happens when calling .read, but you can work around it if you are using Nokogiri. Just pass the file object directly to Nokogiri without calling .read:

    require 'open-uri'
    file = open("http://news.qq.com/a/20140824/015032.htm")
    document = Nokogiri(file)