pythonweb-scrapingbeautifulsoupfindsiblings

Find next siblings until a certain one using beautifulsoup


The webpage is something like this:

<h2>section1</h2>
<p>article</p>
<p>article</p>
<p>article</p>

<h2>section2</h2>
<p>article</p>
<p>article</p>
<p>article</p>

How can I find each section with articles within them? That is, after finding h2, find nextsiblings

until the next h2.

If the webpage were like: (which is normally the case)

<div>
<h2>section1</h2>
<p>article</p>
<p>article</p>
<p>article</p>
</div>

<div>
<h2>section2</h2>
<p>article</p>
<p>article</p>
<p>article</p>
</div>

I can write codes like:

for section in soup.findAll('div'):
...
    for post in section.findAll('p')

But what should I do with the first webpage if I want to get the same result?


Solution

  • I think you can do something like this:

    for section in soup.findAll('h2'):
        nextNode = section
        while True:
            nextNode = nextNode.nextSibling
            try:
                tag_name = nextNode.name
            except AttributeError:
                tag_name = ""
            if tag_name == "p":
                print nextNode.string
            else:
                print "*****"
                break
    

    Given:

    <h2>section1</h2>
    <p>article1</p>
    <p>article2</p>
    <p>article3</p>
    
    <h2>section2</h2>
    <p>article4</p>
    <p>article5</p>
    <p>article6</p>
    

    Output:

    article1
    article2
    article3
    *****
    article4
    article5
    article6
    *****