htmlpython-3.xweb-scrapingbeautifulsoup

Finding <div> with Beautifulsoup


I want to extract a block of text within the div tag. I've seen several posts discussing various div attributes, but the tag I want has no attributes - it's just < div>.

Below is an excerpt of the html. There are dozens of div tags above and below it, but this is the only one that is just < div>.

<div>
      <!-- Some text. -->
      <i>
       [Text I want block 1]
      </i>
      text I want 1
      <br/>
      text I want 2
      <br/>
      text I want 3
      <br/>
      <br/>
 </div>

However, any find method with "div" returns too much. I tried the following:

1) String and tag searches pickup every tag containing div

soup.find("div")

soup.div

3) Isolating the parent, then div searching within still returns too much.

divParent = soup.find("div", class_="col-xs-12 col-lg-8 text-center")
divParent.find("div")

Any ideas? Div seems to be too common of a tag/string to isolate.


Solution

  • This can be one way of doing the job:

    from bs4 import BeautifulSoup
    
    content='''
    <div>
          <!-- Some text. -->
          <i>
           [Text I want block 1]
          </i>
          text I want 1
          <br/>
          text I want 2
          <br/>
          text I want 3
          <br/>
          <br/>
     </div>
    '''
    
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(content,"lxml")
    data = ''.join([item.parent.text.strip() for item in soup.select('div i')])
    print(data)