I am trying to parse GB2312 encoded page (http://news.qq.com/a/20140824/015032.htm), and this is my code.
I am not yet into the parsing part, just in the open and read, and I got error.
This is my code:
require 'open-uri'
open("http://news.qq.com/a/20140824/015032.htm").read
And this is the error:
Encoding::InvalidByteSequenceError: "\x8B" on GB2312
I am using Ruby 2.0.0p247
Any solution?
I don't know exactly why this happens when calling .read
, but you can work around it if you are using Nokogiri. Just pass the file object directly to Nokogiri without calling .read
:
require 'open-uri'
file = open("http://news.qq.com/a/20140824/015032.htm")
document = Nokogiri(file)