pythonweb-scrapingcss-selectorslxmlscraperwiki

Scraperwiki Python Loop Issue


I'm creating a scraper through ScraperWiki using Python, but I'm having an issue with the results I get. I'm basing my code off the basic example on ScraperWiki's docs and everything seems very similar, so I'm not sure where my issue is. For my results, I get the first documents title/URL that is on the page, but there seems to be a problem with the loop, as it does not return the remaining documents after that one. Any advice is appreciated!

import scraperwiki
import requests
import lxml.html

html = requests.get("http://www.store.com/us/a/productDetail/a/910271.htm").content
dom = lxml.html.fromstring(html)

for entry in dom.cssselect('.downloads'):
    document = {
        'title': entry.cssselect('a')[0].text_content(),
        'url': entry.cssselect('a')[0].get('href')
    }
    print document

Solution

  • You need to iterate over the a tags inside the div with class downloads:

    for entry in dom.cssselect('.downloads a'):
        document = {
            'title': entry.text_content(),
            'url': entry.get('href')
        }
        print document
    

    Prints:

    {'url': '/webassets/kpna/catalog/pdf/en/1012741_4.pdf', 'title': 'Rough In/Spec Sheet'}
    {'url': '/webassets/kpna/catalog/pdf/en/1012741_2.pdf', 'title': 'Installation and Care Guide with Service Parts'}
    {'url': '/webassets/kpna/catalog/pdf/en/1204921_2.pdf', 'title': 'Installation and Care Guide without Service Parts'}
    {'url': '/webassets/kpna/catalog/pdf/en/1011610_2.pdf', 'title': 'Installation Guide without Service Parts'}