pythonhtmlencodingbeautifulsouplxml

'charmap' codec can't decode byte 0x8d in position 33222: character maps to <undefined>


I'm trying to parse a very long html file with lxml through BeautifulSoup. I know that the the html file's character encoding is UTF-8 with BOM but whenever I try to run contents = f.read() I get the following error:

'charmap' codec can't decode byte 0x8d in position 33222: character maps to <undefined>

This is the first (and problematic) bit of my code:

from bs4 import BeautifulSoup

with open("doc.html", "r") as f:

    contents = f.read()

    soup = BeautifulSoup(contents, 'lxml')

    print(soup.h2)
    print(soup.head)
    print(soup.li)

This is the error display:

    UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-1-4805460879e0> in <module>
      3 with open("doc.html", "r") as f:
      4 
----> 5     contents = f.read()
      6 
      7     soup = BeautifulSoup(contents, 'lxml')

~\Anaconda3\lib\encodings\cp1252.py in decode(self, input, final)
     21 class IncrementalDecoder(codecs.IncrementalDecoder):
     22     def decode(self, input, final=False):
---> 23         return codecs.charmap_decode(input,self.errors,decoding_table)[0]
     24 
     25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 33222: character maps to <undefined>

Solution

  • with open("doc.html", "r", encoding="UTF-8") as f should solve your issue.