In Python, how could I count the nodes using XPath? For example, using this webpage and this code:
from lxml import html, etree
import requests
url = "http://intelligencesquaredus.org/debates/past-debates/item/587-islam-is-dominated-by-radicals"
r = requests.get(url)
tree = html.fromstring(r.content)
count = tree.xpath('count(//*[@id="body"])')
print count
It prints 1. But it has 5 div
nodes.
Please explain this to me, and how can I do this correctly?
It prints 1 (or 1.0) because there is just one such element with id="body"
in the HTML file you are fetching.
I downloaded the file and verified this is the case. E.g.:
$ curl -O http://intelligencesquaredus.org/debates/past-debates/item/587-islam-is-dominated-by-radicals
Grabs a file 587-islam-is-dominated-by-radicals
$ grep --count 'id="body"' 587-islam-is-dominated-by-radicals
Answers 1. Just to be extra sure, I hand-searched in the file as well, using vi. Just the one!
Perhaps you are looking for another div
node? One with a different id
?
Update: By the way, XPath and other HTML/XML parsing is pretty challenging to work with. A lot of bad data out there, and a lot of complex markup, times the complexity of the retrieval, parsing, and traversal process. You will probably be running your tests and trials a lot of times. It will be a lot faster if you do not "hit the net" for every one of them. Cache the live results. Raw code looks something like this:
from lxml import html, etree
import requests
filepath = "587-islam-is-dominated-by-radicals"
try:
contents = open(filepath).read()
print "(reading cached copy)"
except IOError:
url = "http://intelligencesquaredus.org/debates/past-debates/item/587-islam-is-dominated-by-radicals"
print "(getting file from the net; please stand by)"
r = requests.get(url)
contents = r.content
tree = html.fromstring(contents)
count = tree.xpath('count(//*[@id="body"])')
print count
But you can simplify a lot of that by using a generic caching front-end to requests
, such as requests-cache. Happy parsing!