rubynokogirimechanizehpricot

Parsing Data in Frames with Mechanize and Ruby


I'm trying to scrape the following website, since the XML is malformed and does not contain all of the data I need:

http://www.cafebonappetit.com/menu/your-cafe/pitzer

When I fetch the document with Mechanize, however, I only get:

{meta_refresh}
{title "Collins  | Claremont McKenna Cafés | Café Bon Appétit"}
{iframes}
{frames}
{links
 #<Mechanize::Page::Link "Welcome" "http://www.cafebonappetit.com/">
 #<Mechanize::Page::Link "Our Approach" "javascript://">
 #<Mechanize::Page::Link
 "Kitchen Principles"
 "http://www.cafebonappetit.com/our-approach/kitchen-principles">
 .....
 }

Unfortunately, I obviously need to get at what is in the tables (I guess they are iFrames). Any thoughts?

Thanks!


Solution

  • Here's a simple mech + Nokogiri script that scrapes the menu items.

    require 'rubygems'
    require 'mechanize'
    require 'pp'
    
    agent = Mechanize.new
    url   = "http://www.cafebonappetit.com/menu/your-cafe/pitzer"
    page  = agent.get(url)
    
    #Grab each daily menu
    page.search('div#menu-items > table.my-day-menu-table').each do |menu|
      day = menu.xpath('preceding-sibling::div[1]/a').text.strip
      puts day
      fare = []
      #Collect the menu items
      menu.xpath('tr').each do |item|
        fare << item.xpath('td/strong').map(&:text).join(": ")
      end
      pp fare
    end
    

    Result (excerpt):

    Sunday, May 6th, 2012
    ["Brunch",
     "chef's table: custom omelet bar",
     "main plate: chicken sanchez",
     "meatless chicken and sauce",
     "options: banana pancakes",
     "stocks: beed barley",
     "vegetable minestrone",
     "Lunch",
     "main plate: steamed broccoli",
     "Dinner",
     "chef's table: pasta bar",
     "farm to fork: sauteed rainbow chard",
     "options: mozzarella sticks",
     "ovens: pizza bar",
     "main plate: roasted herb chicken",
     "baked ziti pasta",
     "steamed carrots and parsnips"]