[SOLVED] Most efficient way to count nodes using XPath in Python

Most efficient way to count nodes using XPath in Python

In Python, how could I count the nodes using XPath? For example, using this webpage and this code:

from lxml import html, etree
import requests
url = "http://intelligencesquaredus.org/debates/past-debates/item/587-islam-is-dominated-by-radicals"
r = requests.get(url)
tree = html.fromstring(r.content)
count = tree.xpath('count(//*[@id="body"])')
print count

It prints 1. But it has 5 div nodes. Please explain this to me, and how can I do this correctly?

Solution

It prints 1 (or 1.0) because there is just one such element with id="body" in the HTML file you are fetching.

I downloaded the file and verified this is the case. E.g.:

$ curl -O http://intelligencesquaredus.org/debates/past-debates/item/587-islam-is-dominated-by-radicals

Grabs a file 587-islam-is-dominated-by-radicals

$ grep --count 'id="body"' 587-islam-is-dominated-by-radicals

Answers 1. Just to be extra sure, I hand-searched in the file as well, using vi. Just the one!

Perhaps you are looking for another div node? One with a different id?

Update: By the way, XPath and other HTML/XML parsing is pretty challenging to work with. A lot of bad data out there, and a lot of complex markup, times the complexity of the retrieval, parsing, and traversal process. You will probably be running your tests and trials a lot of times. It will be a lot faster if you do not "hit the net" for every one of them. Cache the live results. Raw code looks something like this:

from lxml import html, etree
import requests

filepath = "587-islam-is-dominated-by-radicals"
try:
    contents = open(filepath).read()
    print "(reading cached copy)"
except IOError:
    url = "http://intelligencesquaredus.org/debates/past-debates/item/587-islam-is-dominated-by-radicals"
    print "(getting file from the net; please stand by)"
    r = requests.get(url)
    contents = r.content
tree = html.fromstring(contents)
count = tree.xpath('count(//*[@id="body"])')
print count

But you can simplify a lot of that by using a generic caching front-end to requests, such as requests-cache. Happy parsing!