pythonweb-scrapinglxmlscraperwiki

Scraperwiki + lxml. How to get the href attribute of a child of an element with a class?


On the link that contains 'alpha' in the URL has many links (hrefs) which I would like to collect from 20 different pages and paste onto the end of the general url (second last line). The href are found in a table which class is mys-elastic mys-left for the td and the a is obviously the element which contains the href attribute. Any help would greatly be appreciated for I have been working at this for about a week.

for i in range(1, 11):
# The HTML Scraper for the 20 pages that list all the exhibitors
 url = 'http://ahr13.mapyourshow.com/5_0/exhibitor_results.cfm?alpha=%40&type=alpha&page='         + str(i) + '#GotoResults'
print url
list_html = scraperwiki.scrape(url)
root = lxml.html.fromstring(list_html)
href_element = root.cssselect('td.mys-elastic mys-left a')

for element in href_element:
#   Convert HTMl to lxml Object 
 href = href_element.get('href')
 print href

 page_html = scraperwiki.scrape('http://ahr13.mapyourshow.com' + href)
 print page_html

Solution

  • No need to muck about with javascript - it's all there in the html:

    import scraperwiki
    import lxml.html
    
    html = scraperwiki.scrape('http://ahr13.mapyourshow.com/5_0/exhibitor_results.cfm?  alpha=%40&type=alpha&page=1')
    
    root = lxml.html.fromstring(html)
    # get the links
    hrefs = root.xpath('//td[@class="mys-elastic mys-left"]/a')
    
    for href in hrefs:
       print 'http://ahr13.mapyourshow.com' + href.attrib['href']