pythonpython-3.xweb-scrapingbeautifulsoup

bs4 `next_sibling` VS `find_next_sibling`


I struggling with usage of next_sibling (and similarly with next_element). If used as attributes I don't get anything back but if used as find_next_sibling (or find_next) then it works. From the doc:

So, find_next_sibling depends on next_siblings. On what does next_sibling depends on and why do they return nothing?

from bs4 import BeautifulSoup

html = """
<div class="......>
 <div class="one-ad-desc">
  <div class="one-ad-title">
   <a class="one-ad-link" href="www this is the URL!">
    <h5>
     Text needed
    </h5>
   </a>
  </div>
  <div class="one-ad-desc">
    ...and some more needed text here!
  </div>
 </div>
</div>
"""

soup = BeautifulSoup(html, 'lxml')

for div in soup.find_all('div', class_="one-ad-title"):
    print('-> ', div.next_element)
    print('-> ', div.next_sibling)
    print('-> ', div.find_next_sibling())-> ')
    break

Output

->  

->  

->  <div class="one-ad-desc">
    ...and some more needed text here!
  </div>

Solution

  • The main point here in my opinion is that .find_next_sibling() scope is on next level on the tree.

    While .next_element and .next_sibling scope is on the same level of the parse tree.

    So take a look and print the name of the elements and you will see next element is not a tag, cause there is nothing on same level of the tree :

    from bs4 import BeautifulSoup
    
    html = """
    <div class="......>
      <div class="one-ad-desc">
        <div class="one-ad-title">
          <a class="one-ad-link" href="www this is the URL!">
            <h5>Text needed</h5>
          </a>
        </div>
        <div class="one-ad-desc">
          ...and some more needed text here!
        </div>
      </div>
    </div>"""
    
    soup = BeautifulSoup(html, 'lxml')
    
    for div in soup.find_all('div', class_="one-ad-title"):
        print('-> ', div.next_element.name)
        print('-> ', div.next_sibling.name)
        print('-> ', div.find_next_sibling().name)
    
    #output
    ->  None
    ->  None
    ->  div
    

    So if you change your input to one line and no spaces,... between tags you got the following result:

    from bs4 import BeautifulSoup
    
    html = """
    <div class="......><div class="one-ad-desc"><div class="one-ad-title"><a class="one-ad-link" href="www this is the URL!"><h5>Text needed</h5></a></div><div class="one-ad-desc">...and some more needed text here!</div></div></div>"""
    
    soup = BeautifulSoup(html, 'lxml')
    
    for div in soup.find_all('div', class_="one-ad-title"):
        print('-> ', div.next_element)
        print('-> ', div.next_sibling)
        print('-> ', div.find_next_sibling())
    

    Output:

    ->  <a class="one-ad-link" href="www this is the URL!"><h5>Text needed</h5></a>
    ->  <div class="one-ad-desc">...and some more needed text here!</div>
    ->  <div class="one-ad-desc">...and some more needed text here!</div>
    

    Note "text needed" is not in a sibling of your selected tag, it is in one of its children. To select "text needed" -> print('-> ', div.find_next().text)