pythonbeautifulsoup

Suggestions on get_text() in BeautifulSoup


I am using BeautifulSoup to parse some content from a html page.

I can extract from the html the content I want (i.e. the text contained in a span defined by the class myclass).

result = mycontent.find(attrs={'class':'myclass'})

I obtain this result:

<span class="myclass">Lorem ipsum<br/>dolor sit amet,<br/>consectetur...</span>

If I try to extract the text using:

result.get_text()

I obtain:

Lorem ipsumdolor sit amet,consectetur...

As you can see when the tag <br> is removed there is no more spacing between the contents and two words are concated.

How can I solve this issue?


Solution

  • If you are using bs4 you can use strings:

    " ".join(result.strings)