xpathscrapymailing-listscrapy-shell

Scrapy can't identify "tbody" and "ul" elements as listed by Firebug


I am trying to extract every title of this mailing list while registering how many replies each thread has.

According to Firebug, the Xpath to the <ul> that contains all the titles is:

/html/body/table[2]/tbody/tr1/td[2]/table/tbody/tr/td/ul

However, if I paste this directly in Scrapy Shell, it will yield an empty list:

scrapy shell http://seclists.org/fulldisclosure/2002/Jul/index.html
response.xpath('/html/body/table[2]/tbody/tr[1]/td[2]/table/tbody/tr/td/ul')
[]

After some trial and error (since I couldn't figure out from the documentation any way to list the immediate sub-elements from a given Selector (please let em know if you know of one), I figured out that the element "tbody" didn't work on Xpath. By removing them, I was able to navigate up to /td:

almost_email_threads = response.xpath('/html/body/table[2]/tr[1]/td[2]/table/tr/td')

However, if I attempt now to reach "ul" it will not work:

email_threads.xpath('/ul')
[]

Now, what confuses me the most is that running:

response.xpath('/html/body/table[2]/tr[1]/td[2]/table/tr/td//ul')

will give me the ul's, but not in the same order as appearing on the website. It skips threads and in different orders. Furthermore it seems impossible to be able to count the amount of replies per thread.

What am I missing here? It's been a while since I've used Scrapy, but I don't recollect being this hard to figure out, and tutorials for whatever reason do not pull out either on Bing or Google for me.


Solution

  • I have never used Firebug, but looking at the HTML page you refer, I'd say that the following XPath expression will give you all top level threads:

    //li[not(ancestor::li) and ./a/@name]
    

    Starting from each list element, you then need to count the amount of list children to get the amount of replies to any given thread.

    Using the Scrapy shell, this results in:

    > scrapy shell http://seclists.org/fulldisclosure/2002/Jul/index.html
    In [1]: threads = response.xpath('//li[not(ancestor::li) and ./a/@name]')
    In [2]: for thread in threads:
       ...:     print thread, len(thread.xpath('descendant::li'))
    <Selector xpath='//li[not(ancestor::li) and ./a/@name]' data=u'<li><a name="0" href="0">Testing</a> <em'> 0
    <Selector xpath='//li[not(ancestor::li) and ./a/@name]' data=u'<li><a name="1" href="1">full disclosure'> 4
    <Selector xpath='//li[not(ancestor::li) and ./a/@name]' data=u'<li><a name="3" href="3">The Death Of TC'> 1
    <Selector xpath='//li[not(ancestor::li) and ./a/@name]' data=u'<li><a name="7" href="7">Re: Announcing '> 24
    [...]
    

    Regarding your question on how to list all sub-elements from a given selector, you just need to realize that the result of running an XPath query on a selector is a SelectorList where each list element implements the Selector interface. So you can simply use XPath again to e.g. list all the children:

    In [3]: thread.xpath('child::*')
    Out[3]: 
    [<Selector xpath='child::*' data=u'<a name="309" href="309">it\'s all about '>,
     <Selector xpath='child::*' data=u'<em>Florin Andrei (Jul 31)</em>'>,
     <Selector xpath='child::*' data=u'<ul>\n<li><a name="313" href="313">it\'s a'>]