web-crawlerpyquery

PyQuery html how to choose ( second tag ,nth-child)


a='''
<p id="A" class="hello beauty"></p>
<v id="XXX" c=1234>
<p id="B" class="beauty"></p>
<v id="YYY" c=5678>
<p id="C" class="beauty" ></p>
<p id="D" class="beauty" ></p>'''

from pyquery import PyQuery
html = PyQuery(a)

1.Questions

I try to get the second the value c -5678-

html('v')[1].attr('c')

this will show error 'HtmlElement' object has no attribute 'attr'

So how can i do that?

2.Questions

I try to find how to solve the first questions,but i happened another problem.

html('p:nth-child(1)').attr('id')

I get 'A'

html('p:nth-child(2)').attr('id')

I get 'D'

html('p:nth-child(3)').attr('id')

I get nothing

where is 'B'and'C'?

I think

html('p:nth-child(2)').attr('id') will get 'B'

html('p:nth-child(3)').attr('id') will get 'C'

html('p:nth-child(4)').attr('id') will get 'D'

But it is wrong


Solution

  • You've fallen for a commonly occuring confusion. It often arises also with jQuery.

    while html is a PyQuery object, html('v') returns a list of HtmlElement-s (non-PyQuery objects). In order to call PyQuery methods on it, you need to wrap it in PyQuery again. So in the case of your Question 1, you'll have to rewrite like this:

    PyQuery(html('v')[1]).attr('c')
    

    As for your second question, adding the PyQuery wrapper will not get you to the desired result. This is because if you do html.html() to see the rendered code, you'll get:

    '<p id="A" class="hello beauty"/>\n<v id="XXX" c="1234">\n<p id="B" class="beauty"/>\n<v id="YYY" c="5678">\n<p id="C" class="beauty"/>\n<p id="D" class="beauty"/></v></v>'
    

    Notice that this is not your original code, but a modification of it that tries to makes it valid XML. As a consequence, it has closed your tags whenever it found appropriate, in particular at the end. Formatted it looks like this:

    <p id="A" class="hello beauty"/>
    <v id="XXX" c="1234">
      <p id="B" class="beauty"/>
      <v id="YYY" c="5678">
        <p id="C" class="beauty"/>
        <p id="D" class="beauty"/>
      </v>
    </v>
    

    Here you can see that there are no 3rd and 4th children of html. Accordingly, the following give you empty responses:

    PyQuery(html('p:nth-child(3)')).attr('id')
    PyQuery(html('p:nth-child(4)')).attr('id')
    

    What you're trying to do could be rather achieved via:

    PyQuery(html('p')[1]).attr('id')
    PyQuery(html('p')[2]).attr('id')
    PyQuery(html('p')[3]).attr('id')
    

    Notice that these indices are each with one less, because they are list indices, and thus 0-indexed.

    Something that one might find confusing is that PyQuery(html('p:nth-child(2)')).attr('id') actually returns 'D'. This is because the corresponding <p> is a second child within the innermost`. Here's a page where one can get a better feeling about how nth-child works.