pythonpyquery

PyQuery find the sub element node text


Here is the code:

from pyquery import PyQuery

content = '''<td field="exceptions"><div style="white-space:normal;height:auto;" \
class="datagrid-cell datagrid-cell-c2-exceptions">Traceback (most recent call last):<br>\
  File "./crawler.py", line 381, in &lt;module&gt;<br>   \
   crawler.start()<br>  File "./crawler.py", line 153, in start<br> \
      raise RemoteTransportException(e)<br>RemoteTransportException: \
      This socket is already used by another greenlet: &lt;bound method Waiter.\
      switch of &lt;gevent.hub.Waiter object at 0x7f64d499d6e0&gt;&gt;<br></div></td>'''
pq = PyQuery(content)

for content in pq('td div'):
    print content.text # get Traceback (most recent call last):


for content in pq('td div'):
    for sub in content.getchildren():
        print sub.text


# Traceback (most recent call last):
# None
# None
# None
# None
# None
# None

As you get, I want to get the content in the td div element, it should be

Traceback (most recent call last):
File "./crawler.py", line 381, in <module>
crawler.start()
File "./crawler.py", line 153, in start
raise RemoteTransportException(e)
RemoteTransportException: This socket is already used by another greenlet: <bound method Waiter.switch of <gevent.hub.Waiter object at 0x7f64d499d6e0>>

But I just got Traceback (most recent call last):. So how to find out all the text in td div which with sub label in it?


Solution

  • You could use BeautifulSoup instead:

    import bs4
    soup = bs4.BeautifulSoup(content)
    soup.find('td').find('div').text
    u'Traceback (most recent call last):  File "./crawler.py", line 381, in <module>      crawler.start()  File "./crawler.py", line 153, in start       raise RemoteTransportException(e)RemoteTransportException:       This socket is already used by another greenlet: <bound method Waiter.      switch of <gevent.hub.Waiter object at 0x7f64d499d6e0>>'