htmlruby-on-railsgoogle-mapsscreen-scrapinghpricot

Html / Script Scraping Google Map using Hpricot (Ruby On Rails)


I am having a problem Scraping Code i require to extract information for a Web MashUp i'm creating.

Basically, I am trying to Scrape Code from:

http://yellowpages.com.mt/Meranti-Ltd-In-Malta-Gozo;/Hair-Accessories;Hijjhkikke=Hiojhhfokje.aspx

This is just one of the pages i will need to scrape and hence i cannot feed the program directly the code i need =/.

When i Scrape the Page using the following code (in Hpricot)

puts open(ypUrl, 'User-Agent'=>'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2') { |f| Hpricot(f) }

I am noticing that instead of the part of code i require, i am only seeing the script reference, namely

<script type="text/javascript" src="http://maps.google.com/maps?file=api&amp;v=2&amp;sensor=false&amp;key=ABQIAAAA8JYIIyGmC1BLOU85GKKkPRSNQenRT-s-Gs-9sYb3ZSBhRRTdcRTMq3zWEID1E35uXl9bdQKIPQIjNQ"></script><title>

Beautimport Ltd (Balmain Hair Extensions) in Malta | Yellow Pages?? (Malta) Ltd | YellowPages.com.mt

This is also what i see when i do view source on Firefox. However when i hover over the elements in Firebug, I am able to get an XPath, which unfortunately is not working due to the script reference remaining such. (i'm not sure if i'm explaining is correct). I would really require all the code that is generated on the page due to the script (which is far only viewable in firebug). I would need this so that i can extract the following (taken from firebug by hovering over the Google Icon on the map:

<a title="Click to see this area on Google Maps" href="http://maps.google.com/maps?ll=35.88805,14.46627&spn=0.006988,0.015922&z=16&key=ABQIAAAA8JYIIyGmC1BLOU85GKKkPRSNQenRT-s-Gs-9sYb3ZSBhRRTdcRTMq3zWEID1E35uXl9bdQKIPQIjNQ&sensor=false&mapclient=jsapi&oi=map_misc&ct=api_logo" target="_blank">

which gives the following Xpath (//denotes a tbody), but as i mentioned, as it is not giving the entire code in Hpricot, its pretty useless as it can't get to it!

/html/body/form/table//tr/td/div/table[2]//tr[2]/td[2]/div/div[2]/table//tr/td/div/div[2]/a

In this manner i would be able to extract the Lng and Lat which i really require for my project. I really dont know how to go about this in another manner using Hpricot as its not giving me all the code i require. Any Help would be extremely appreciate.


Solution

  • This was a fun one. It can be done, but it's going to take more that hpricot. I noticed while sniffing that a webservice is being called to populate the latitude and longitude. Here's what you can do to get to that information:

    Scrape the site like you're normally doing, but look for a call to the LoadMap javascript function. The line will look something like:

    <script type='text/javascript'>LoadMapByDetail(1668154, 0, 1)</script>
    

    Parse the id out and call the webservice. This will end up looking something like:

    require 'rubygems'
    require 'hpricot' 
    require 'open-uri' 
    require 'soap/wsdlDriver'
    
    WSDL_URL="http://yellowpages.com.mt/Web_Service/SearchMap.asmx?WSDL" 
    soap = SOAP::WSDLDriverFactory.new(WSDL_URL).create_rpc_driver 
    response = soap.GetCoordByDetail(:mainDetailID => '1668154', :type => '1')
    soap.reset_stream response.getCoordByDetailResult.anyType.each { |x| puts x.anyType }
    

    You see the latitude and longitude in the output:

    35.88805
    14.46627
    

    Hope this helps. Good luck!