htmlrubynokogiriopen-uricss-parsing

Why does OpenURI return different HTML content from the original source?


I'm trying to get style contents from HTML source using OpenUri and Nokogiri.

require 'open-uri'
require 'nokogiri'
require 'css_parser'

url  = open('https://google.com')
html = Nokogiri::HTML(url)
css  = CssParser::Parser.new
css.add_block!(html.search('style#gstyle').text)

This returns nil, but the HTML of the Google page contains id="gstyle". Here is an image of the output result:

enter image description here

  1. Why is the Google HTML page, in this example, different from that which OpenUri returns?
  2. How can I find this tag style#gstyle?
  3. Why does Firebug see the correct HTML document and OpenUri does not?

Solution

  • Google renders its page differently for different clients, based on the agent string, and the agent string is the only clue the server has about what kind of client is accessing the page. open-uri by default declares itself to be "Ruby". If you are visiting with a clearly automated script, you will not get the same page as if you were visiting with a browser.

    Try this:

    url = open('https://google.com', "User-Agent" => "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36")