I am now confused by something I thought I understood, but turns out I've been taking for granted.
One frequently encounters this type of for
loop:
from bs4 import BeautifulSoup as bs
mystring = 'some string'
soup = bs(mystring,'html.parser')
for elem in soup.find_all():
[do something with elem]
What I haven't paid much attention to, is what the elem
actually is until I ran into a version of this simplified string:
mystring = 'opening text<p>text one<BR> text two.<br></p>\
<p align="right">text three<br/> text four.</p><p class="myclass">text five. </p>\
<p>text six <span style="some style">text seven</span></p>\
<p>text 8. <span style="some other style">text nine</span></p>closing text'
I'm not sure anymore what I expected the output to be but when I ran this code:
counter = 1 #using 'normal' counting for simplification
for elem in soup.find_all():
print('elem ',counter,elem)
counter +=1
The output was:
elem 1 <p>text one<br/> text two.<br/></p>
elem 2 <br/>
elem 3 <br/>
elem 4 <p align="right">text three<br> text four.</br></p>
elem 5 <br> text four.</br>
elem 6 <p class="myclass">text five. </p>
elem 7 <p>text six <span style="some style">text seven</span></p>
elem 8 <span style="some style">text seven</span>
elem 9 <p>text 8. <span style="some other style">text nine</span></p>
elem 10 <span style="some other style">text nine</span>
So bs4+html.parser found 10 elements in the string. Their selection and presentation seemed unintuitive to me (for example, skipping opening text
and closing text
). Not only that, but the output ofprint(len(soup))
turned out to be 7
!
So just to make sure, I swapped out html.parser
for both lxml
and html5lib
. In both cases the print(len(soup))
was not only 1
, but the number of elem
s jumped up to 13! And, naturally, the extra elements were different. From the 4th elem
thru the end, both libraries were identical to html.parser
. For the first three, however...
With html5lib
you get:
elem 1 <html><head></head><body>opening text<p>text one<br/> text two.<br/></p><p align="right">text three<br/> text four.</p><p class="myclass">text five. </p><p>text six <span style="some style">text seven</span></p><p>text 8. <span style="some other style">text nine</span></p>closing text</body></html>
elem 2 <head></head>
elem 3 <body>opening text<p>text one<br/> text two.<br/></p><p align="right">text three<br/> text four.</p><p class="myclass">text five. </p><p>text six <span style="some style">text seven</span></p><p>text 8. <span style="some other style">text nine</span></p>closing text</body>
With lxml
, on the other hand, you get:
elem 1 <html><body><p>opening text</p><p>text one<br/> text two.<br/></p><p align="right">text three<br/> text four.</p><p class="myclass">text five. </p><p>text six <span style="some style">text seven</span></p><p>text 8. <span style="some other style">text nine</span></p>closing text</body></html>
elem 2 <body><p>opening text</p><p>text one<br/> text two.<br/></p><p align="right">text three<br/> text four.</p><p class="myclass">text five. </p><p>text six <span style="some style">text seven</span></p><p>text 8. <span style="some other style">text nine</span></p>closing text</body>
elem 3 <p>opening text</p>
So what is the philosophy behind all this? Whose 'fault' is it? Is there a 'right' or 'wrong' answer? And, practically speaking, should I just follow religiously one parser or is there a time and place for each?
Apologies for the length of the question.
First, the root object, in your case, the soup
variable, is a BeautifulSoup
object. You can think of it like the document
object in a browser. In BeautifulSoup, the BeautifulSoup
object is derived from the Element
object, but isn't really an "element" per se, it is more like the document.
When you call len
on an element (or BeautifulSoup object), you get the number of nodes in the contents
member of the object. This can contain comments, document processing statements, text nodes, element nodes, etc.
A well formed document should have one root element, but comments and document processing statements are okay at the root level as well. In your case, with no comments and no processing statements, I would normally expect a length of 1.
lxml
and html5lib
try to make sure you have a well formed document, if it sees you have multiple root elements, they will wrap it in html
and body
tags and give you a single root element. Though, as mentioned before, you may have a length > 1 if your document already has a proper root html
element and also has comments or processing statements at the root level. Depending on the parser, they may manipulate other content to adhere to whatever rules they also enforce when provided with weird malformed HTML.
On the other hand. html.parser
is very lenient. It doesn't try to correct what you are doing, and just parses things as they are. In your case, it returns a weird document with multiple text nodes at the root level, along with multiple <p>
elements at the root level. So when you call length on soup
, you get a value much greater than 1.
In general. The initial element returned by BeautifulSoup is the BeautifulSoup
object. It may contain Element
nodes or NaviagableString
nodes (text) which can be of various sub types if depending if they are a comment, document decleration, CDATA, or other processing statement. NaviagableStrings
(and related sub types) are not Element
nodes, but are usually contained within the contents of an Element
or BeautifulSoup
object.
Depending on whether you favor leniency, speed, HTML5 correctness, XML support, etc., it may sway which parser you wish to use. Also, you may sometimes wish to use other parsers for very specific use cases.