I scrape web pages with lxml
in Python. Yet, to get the number of table rows I first get them all and then use len()
function. I feel it's wasteful, is there other way to get their number (dynamic one) for further scraping?
import lxml.html
doc = ''
try:
doc = lxml.html.parse('url')
except SkipException: pass
if doc:
buf = ''
#get the total number of rows in table
tr = doc.xpath("/html/body/div[1]/div[1]/table[1]/tbody/tr")
table = []
# iterate over the table rows limited to max number
for i in range(3, len(tr)):
# get the rows content
table += doc.xpath("body/div[1]/div[1]/table[1]/tbody/tr[%s]/td" % i)
You can use the tr
elements you matched as the starting point, you can simply iterate over them like you would with a python list:
tr = doc.xpath("/html/body/div[1]/div[1]/table[1]/tbody/tr")
for row in tr[3:]:
table += row.findall('td')
The above uses .findall()
to grab all contained td
elements, but you could use further .xpath()
calls if you need more control.