Here is the code:
from pyquery import PyQuery
content = '''<td field="exceptions"><div style="white-space:normal;height:auto;" \
class="datagrid-cell datagrid-cell-c2-exceptions">Traceback (most recent call last):<br>\
File "./crawler.py", line 381, in <module><br> \
crawler.start()<br> File "./crawler.py", line 153, in start<br> \
raise RemoteTransportException(e)<br>RemoteTransportException: \
This socket is already used by another greenlet: <bound method Waiter.\
switch of <gevent.hub.Waiter object at 0x7f64d499d6e0>><br></div></td>'''
pq = PyQuery(content)
for content in pq('td div'):
print content.text # get Traceback (most recent call last):
for content in pq('td div'):
for sub in content.getchildren():
print sub.text
# Traceback (most recent call last):
# None
# None
# None
# None
# None
# None
As you get, I want to get the content in the td div
element, it should be
Traceback (most recent call last):
File "./crawler.py", line 381, in <module>
crawler.start()
File "./crawler.py", line 153, in start
raise RemoteTransportException(e)
RemoteTransportException: This socket is already used by another greenlet: <bound method Waiter.switch of <gevent.hub.Waiter object at 0x7f64d499d6e0>>
But I just got Traceback (most recent call last):
.
So how to find out all the text in td div
which with sub label in it?
You could use BeautifulSoup instead:
import bs4
soup = bs4.BeautifulSoup(content)
soup.find('td').find('div').text
u'Traceback (most recent call last): File "./crawler.py", line 381, in <module> crawler.start() File "./crawler.py", line 153, in start raise RemoteTransportException(e)RemoteTransportException: This socket is already used by another greenlet: <bound method Waiter. switch of <gevent.hub.Waiter object at 0x7f64d499d6e0>>'