pythonhtmlbeautifulsoup

Beautiful Soup 4: extracting text only from a tag containing children tags


I've got this HTML snippet from a bigger document and I want to scrape the "$ 430000" string from the main div with class="title" only:

<div class="title">
 $ 430000
 <div class="container">
  <span class="price">
   $ 505000
  </span>
  <span class="discount">
   (-14.9%)
  </span>
  <div class="inner-container">
   <p class="text--bold">
    Discounted $ 75000
    <span class="discount">
     (-14.9%)
    </span>
   </p>
   <p>
    18/02/2010
   </p>
  </div>
 </div>
</div>

I know I could access the desired string through tag.stripped_strings and then yielding the first value from the generator:

tag = soup.find('div', {'class': 'title'})
print(next(tag.stripped_strings))

$ 430000

However, I am wondering if there is BS4 attribute or method with which I could target the text from the <div class="title"> only, the "$ 430000" string. If I called tag.text I'd get

\n                    $ 430000\n                                                                    $ 505000(-14.9%)\n                                    Discounted $ 75.000(-14.9%)18/02/2021```

Solution

  • You may be looking for .next_element[docs] attribute which points to immediate afterwards of whatever was grabbed. So, in your case, it will look something like this.

    result = soup.find('div', class_='title').next_element.strip()
    # -> $ 430000