I'm generating some HTML with python and BeautifulSoup4. At the end, I'd like to prettify the generated HTML. If I prettify as follows:
soup.prettify()
BeautifulSoup converts all the   characters to spaces. Unfortunately, my webpage relies on having these   characters. After some guidance, I realized that this can be overcome by supplying a formatter to prettify:
soup.prettify(formatter='html')
Unfortunately, when I do this, though the   characters are preserved, BeautifulSoup encodes the Cyrillic (Russian) characters in my HTML, making them unreadable to me. This leaves the formatter='html' option off limits to me.
(formatter='minimal'
and formatter=None
also don't work; they leave Cyrillic alone, but take away the  .)
After looking at the BeautifulSoup docs, I realized you can specify your own custom formatter using BeautifulSoup's Formatter class. Unfortunately, I am unsure how this class works. I have tried to find documentation for the Formatter class but I am unable. Does anyone know if it's possible to write a custom formatter, that will tell BeautifulSoup to preserve   characters (and leave my Cyrillic characters alone)? Or, is there any documentation for how this class works exactly? There are some examples in that section of the BS documentation, but after reading them, I am still unclear how to accomplish what I'm trying to accomplish.
EDIT: I have found different documentation, which makes it much clearer. The custom formatter is just a function you pass to the 'formatter' arg (i.e. prettify(formatter=my_func)
, where my_func is a function you define on your own); it gets called once for every String and attribute value encountered, passing that value to the function and using whatever the function returns as the output in prettify. I have experimented writing my own formatter function, and I'm able to detect if an   is there, but unsure what to return from the function, so that prettify will output the  . See 'Example 3' below for my dummy formatter to detect &nsbp.
Here is a dummy example demonstrating the problem:
EXAMPLE 1: Using prettify without a formatter
from bs4 import BeautifulSoup
hello = '<span>Привет, мир</span>'
soup = BeautifulSoup(hello, 'html.parser')
print("\nBefore prettify:\n{}".format(soup))
soup = soup.prettify()
print("\nAfter prettify:\n{}".format(soup))
Output - Cyrillic characters are fine, but   are converted to ws
Before prettify:
<span>Привет, мир</span>
After prettify:
<span>
Привет, мир
</span>
EXAMPLE 2: Using prettify with formatter='html'
from bs4 import BeautifulSoup
hello = '<span>Привет, мир</span>'
soup = BeautifulSoup(hello, 'html.parser')
print("\nBefore prettify:\n{}".format(soup))
soup = soup.prettify(formatter='html')
print("\nAfter prettify:\n{}".format(soup))
output:   are preserved, but Cyrillic characters get converted unreadable
Before prettify:
<span>Привет, мир</span>
After prettify:
<span>
Привет, мир
</span>
Example 3: Supplying a custom formatter. This is just a dummy formatter for the sake of the example, to detect if   is there. What should I return from this function, if I want   to be preserved? (p.s., it seems   are parsed as \xa0, which is why I'm checking for it this way)
def check_for_nbsp(str):
if '\xa0' in str:
return str+" <-- HAS"
else:
return str+" <-- DOESN'T HAVE"
hello = '<span>Привет, мир</span>'
soup = BeautifulSoup(hello, 'html.parser')
print("\nBefore prettify:\n{}".format(soup))
soup = soup.prettify(formatter=check_for_nbsp)
print("\nAfter prettify:\n{}".format(soup))
Output:
Before prettify:
<span>Привет, мир</span>
After prettify:
<span>
Привет, мир <-- HAS
</span>
Is there a way to get the best of both worlds - preserve the   AND the Cyrillic characters? Alternatively, is there a realiable python package that prettifies HTML other than BeautifulSoup?
Here is a previous Stackoverflow question I posted regarding mangling the Cyrillic characters - that's what led me to understand I should remove the formatter='html' option, unfortunately this removes the   characters, which is equally as problematic.
I was able to solve this problem. I discovered in these docs, about the EntitySubstitution
class in the bs4.dammit
module. It implements Beautiful Soup’s standard formatters as class methods - the “html” formatter (which preserves   chars) is EntitySubstitution.substitute_html
. This will allow you to get that formatter behavior, but then do extra things.
(p.s.,   are parsed in BeautifulSoup as \xa0)
Here is the code:
from bs4 import BeautifulSoup
from bs4.dammit import EntitySubstitution # don't miss this import statement!
'''
this is the custom formatter.
prettify will call this function every String and attribute value encountered;
it is going to display whatever you return, in the prettified output
Strategy:
- Split the string on   characters.
- For portion that's not   - return as is.
- For portion that's   - run it through EntitySubstitution.substitute_html,
which will preserve the  )
'''
def preserve_nbsp_and_ru(str):
newstr = ""
split_str = str.split('\xa0') #   are parsed as \xa0 in BS
# (this will split a b&nsbp&c --> [a,b,c])
for i, space_between in enumerate(split_str):
# space_between will be regular text, preserve it as is
newstr += space_between
# add an   after it, unless you're on the last
# item in the list, after which there would not be an  
if i < len(split_str) - 1:
# put the nbsp through the EntitySubstitution function
# which will preserve it
newstr += EntitySubstitution.substitute_html('\xa0')
return newstr
hello = '<span>Привет, мир</span>'
soup = BeautifulSoup(hello, 'html.parser')
print("\nBefore prettify:\n{}".format(soup))
soup = soup.prettify(formatter=preserve_nbsp_and_ru)
print("\nAfter prettify:\n{}".format(soup))
Output:
Before prettify:
<span>Привет, мир</span>
After prettify:
<span>
Привет, мир
</span>