beautifulsouphtml-parsinglxmlhtml5lib

What exactly is a BS4 'element', how are elements counted, which parser gets to decide? Obviously confused


I am now confused by something I thought I understood, but turns out I've been taking for granted.

One frequently encounters this type of for loop:

from bs4 import BeautifulSoup as bs
mystring = 'some string'
soup = bs(mystring,'html.parser')
for elem in soup.find_all():
    [do something with elem]

What I haven't paid much attention to, is what the elem actually is until I ran into a version of this simplified string:

mystring = 'opening text<p>text one<BR> text two.<br></p>\
<p align="right">text three<br/> text four.</p><p class="myclass">text five. </p>\
<p>text six <span style="some style">text seven</span></p>\
<p>text 8. <span style="some other style">text nine</span></p>closing text'

I'm not sure anymore what I expected the output to be but when I ran this code:

counter = 1 #using 'normal' counting for simplification
for elem in soup.find_all():
    print('elem ',counter,elem)
    counter +=1

The output was:

elem  1 <p>text one<br/> text two.<br/></p>
elem  2 <br/>
elem  3 <br/>
elem  4 <p align="right">text three<br> text four.</br></p>
elem  5 <br> text four.</br>
elem  6 <p class="myclass">text five. </p>
elem  7 <p>text six <span style="some style">text seven</span></p>
elem  8 <span style="some style">text seven</span>
elem  9 <p>text 8. <span style="some other style">text nine</span></p>
elem  10 <span style="some other style">text nine</span>

So bs4+html.parser found 10 elements in the string. Their selection and presentation seemed unintuitive to me (for example, skipping opening text and closing text). Not only that, but the output ofprint(len(soup)) turned out to be 7!

So just to make sure, I swapped out html.parser for both lxml and html5lib. In both cases the print(len(soup)) was not only 1, but the number of elems jumped up to 13! And, naturally, the extra elements were different. From the 4th elem thru the end, both libraries were identical to html.parser. For the first three, however...

With html5lib you get:

elem  1 <html><head></head><body>opening text<p>text one<br/> text two.<br/></p><p align="right">text three<br/> text four.</p><p class="myclass">text five. </p><p>text six <span style="some style">text seven</span></p><p>text 8. <span style="some other style">text nine</span></p>closing text</body></html>
elem  2 <head></head>
elem  3 <body>opening text<p>text one<br/> text two.<br/></p><p align="right">text three<br/> text four.</p><p class="myclass">text five. </p><p>text six <span style="some style">text seven</span></p><p>text 8. <span style="some other style">text nine</span></p>closing text</body>

With lxml, on the other hand, you get:

elem  1 <html><body><p>opening text</p><p>text one<br/> text two.<br/></p><p align="right">text three<br/> text four.</p><p class="myclass">text five. </p><p>text six <span style="some style">text seven</span></p><p>text 8. <span style="some other style">text nine</span></p>closing text</body></html>
elem  2 <body><p>opening text</p><p>text one<br/> text two.<br/></p><p align="right">text three<br/> text four.</p><p class="myclass">text five. </p><p>text six <span style="some style">text seven</span></p><p>text 8. <span style="some other style">text nine</span></p>closing text</body>
elem  3 <p>opening text</p>

So what is the philosophy behind all this? Whose 'fault' is it? Is there a 'right' or 'wrong' answer? And, practically speaking, should I just follow religiously one parser or is there a time and place for each?

Apologies for the length of the question.


Solution

  • First, the root object, in your case, the soup variable, is a BeautifulSoup object. You can think of it like the document object in a browser. In BeautifulSoup, the BeautifulSoup object is derived from the Element object, but isn't really an "element" per se, it is more like the document.

    When you call len on an element (or BeautifulSoup object), you get the number of nodes in the contents member of the object. This can contain comments, document processing statements, text nodes, element nodes, etc.

    A well formed document should have one root element, but comments and document processing statements are okay at the root level as well. In your case, with no comments and no processing statements, I would normally expect a length of 1.

    lxml and html5lib try to make sure you have a well formed document, if it sees you have multiple root elements, they will wrap it in html and body tags and give you a single root element. Though, as mentioned before, you may have a length > 1 if your document already has a proper root html element and also has comments or processing statements at the root level. Depending on the parser, they may manipulate other content to adhere to whatever rules they also enforce when provided with weird malformed HTML.

    On the other hand. html.parser is very lenient. It doesn't try to correct what you are doing, and just parses things as they are. In your case, it returns a weird document with multiple text nodes at the root level, along with multiple <p> elements at the root level. So when you call length on soup, you get a value much greater than 1.

    In general. The initial element returned by BeautifulSoup is the BeautifulSoup object. It may contain Element nodes or NaviagableString nodes (text) which can be of various sub types if depending if they are a comment, document decleration, CDATA, or other processing statement. NaviagableStrings (and related sub types) are not Element nodes, but are usually contained within the contents of an Element or BeautifulSoup object.

    Depending on whether you favor leniency, speed, HTML5 correctness, XML support, etc., it may sway which parser you wish to use. Also, you may sometimes wish to use other parsers for very specific use cases.