pythonweb-scrapinglxml

lxml length of web parsed content


I scrape web pages with lxml in Python. Yet, to get the number of table rows I first get them all and then use len() function. I feel it's wasteful, is there other way to get their number (dynamic one) for further scraping?

import lxml.html
doc = ''
try:
    doc = lxml.html.parse('url')
except SkipException: pass 

if doc: 
    buf = ''
    #get the total number of rows in table
    tr = doc.xpath("/html/body/div[1]/div[1]/table[1]/tbody/tr")
    table = []
    # iterate over the table rows limited to max number
    for i in range(3, len(tr)):
            # get the rows content                                              
            table += doc.xpath("body/div[1]/div[1]/table[1]/tbody/tr[%s]/td" % i)

Solution

  • You can use the tr elements you matched as the starting point, you can simply iterate over them like you would with a python list:

    tr = doc.xpath("/html/body/div[1]/div[1]/table[1]/tbody/tr")
    for row in tr[3:]:
        table += row.findall('td')
    

    The above uses .findall() to grab all contained td elements, but you could use further .xpath() calls if you need more control.