pythonxpathxml.etree

XPath text() does not get the text of a link node


from lxml import etree
import requests
htmlparser = etree.HTMLParser()
f = requests.get('https://rss.orf.at/news.xml')
# without the ufeff this would fail because it tells me: "ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration."
tree = etree.fromstring('\ufeff'+f.text, htmlparser)
print(tree.xpath('//item/title/text()')) #<- this does produce a liste of titles  
print(tree.xpath('//item/link/text()')) #<- this does NOT produce a liste of links why ?!?!

Okay this is a bit of mystery to me, and maybe I'm just overlooking the simplest thing, but the XPath '//item/link/text()' does only produce an empty list while '//item/title/text()' works exactly like expected. Does the <link> node hold any special purpose? I can select all of them with '//item/link' I just can't get the text() selector to work on them.


Solution

  • You're using etree.HTMLParser to parse an XML document. I suspect this was an attempt to deal with XML namespacing, but I think it's probably the wrong solution. It's possible treating the XML document as HTML is ultimately the source of your problem.

    If we use the XML parser instead, everything pretty much works as expected.

    First, if we look at the root element, we see that it sets a default namespace:

    <rdf:RDF
      xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
      xmlns:dc="http://purl.org/dc/elements/1.1/"
      xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
      xmlns:orfon="http://rss.orf.at/1.0/"
      xmlns="http://purl.org/rss/1.0/"
    >
    

    That means when we see an item element in the document, it's actually an "item in the http://purl.org/rss/1.0/ namespace" element. We need to provide that namespace information in our xpath queries by passing in a namespaces dictionary and use a namespace prefix on the element names, like this:

    >>> tree.xpath('//rss:item', namespaces={'rss': 'http://purl.org/rss/1.0/'})
    [<Element {http://purl.org/rss/1.0/}item at 0x7f0497000e80>, ...]
    

    Your first xpath expression (looking at /item/title/text()) becomes:

    >>> tree.xpath('//rss:item/rss:title/text()', namespaces={'rss': 'http://purl.org/rss/1.0/'})
    ['Amnesty dokumentiert Kriegsverbrechen', ..., 'Moskauer Börse startet abgeschirmten Handel']
    

    And your second xpath expression (looking at /item/link/text()) becomes:

    >>> tree.xpath('//rss:item/rss:link/text()', namespaces={'rss': 'http://purl.org/rss/1.0/'})
    ['https://orf.at/stories/3255477/', ..., 'https://orf.at/stories/3255384/']
    

    This makes the code look like:

    from lxml import etree
    import requests
    f = requests.get('https://rss.orf.at/news.xml')
    tree = etree.fromstring(f.content)
    print(tree.xpath('//rss:item/rss:title/text()', namespaces={'rss': 'http://purl.org/rss/1.0/'}))
    print(tree.xpath('//rss:item/rss:link/text()', namespaces={'rss': 'http://purl.org/rss/1.0/'}))
    

    Note that by using f.content (which is a byte string) instead of f.text (a unicode string), we avoid the whole unicode parsing error.