rubyparsinghpricot

Parsing an HTML table using Hpricot (Ruby)


I am trying to parse an HTML table using Hpricot but am stuck, not able to select a table element from the page which has a specified id.

Here is my ruby code:-

require 'rubygems'
require 'mechanize'
require 'hpricot'

agent = WWW::Mechanize.new

page = agent.get('http://www.indiapost.gov.in/pin/pinsearch.aspx')

form = page.forms.find {|f| f.name == 'form1'}
form.fields.find {|f| f.name == 'ddl_state'}.options[1].select
page = agent.submit(form, form.buttons[2])

doc = Hpricot(page.body)

puts doc.to_html # Here the doc contains the full HTML page

puts doc.search("//table[@id='gvw_offices']").first # This is NIL

Can anyone help me to identify what's wrong with this.


Solution

  • Mechanize will use hpricot internally (it's mechanize's default parser). What's more, it'll pass the hpricot stuff on to the parser, so you don't have to do it yourself:

    require 'rubygems'
    require 'mechanize'
    
    #You don't really need this if you don't use hpricot directly
    require 'hpricot'
    
    agent = WWW::Mechanize.new
    
    page = agent.get('http://www.indiapost.gov.in/pin/pinsearch.aspx')
    
    form = page.forms.find {|f| f.name == 'form1'}
    form.fields.find {|f| f.name == 'ddl_state'}.options[1].select
    page = agent.submit(form, form.buttons[2])
    
    puts page.parser.to_html # page.parser returns the hpricot parser
    
    puts page.at("//table[@id='gvw_offices']") # This passes through to hpricot
    

    Also note that page.search("foo").first is equivalent to page.at("foo").