python-3.xcontent-typeweb-development-servercontent-encoding

Python shows a different view of the content than web browser


I have a Python inventory update script which runs nightly, pulling inventory from a website. I recently started having issues, and upon further investigation found that when I view the source content via a web browser (view source), it looks normal. However, when I print it to console with python, it looks very strange (and is breaking the script). Wondering if anyone has seen anything like this or knows what caused it?

The web browser shows this (url redacted):

<ul class='vnav vnav__subnav vnav--level2'>
<li class='vnav__item'><a href='https://xx.htm' class='vnav__link'>Folding Tables</a>
</li><li class='vnav__item'><a href='https://xx.htm' class='vnav__link'>Bookcases</a>
</li><li class='vnav__item'><a href='https://xx.htm' class='vnav__link'>Printer Stands</a>
</li><li class='vnav__item'><a href='https://xx.htm' class='vnav__link'>Computer Desks</a>
</li><li class='vnav__item'><a href='https://xx.htm' class='vnav__link'>Office Chairs</a>
</li><li class='vnav__item'><a href='https://xx.htm' class='vnav__link'>Filing Cabinets</a>
</li><li class='vnav__item'><a href='https://xx.htm' class='vnav__link'>Letter Holders</a>
</li></ul>
</li>
<li class='vnav__item'><a href='https://xx.htm' class='vnav__link'>Bathroom</a>
<ul class='vnav vnav__subnav vnav--level2'>
<li class='vnav__item'><a href='https://xx.htm' class='vnav__link'>Bathroom Mirrors</a>
</li><li class='vnav__item'><a href='https://xx.htm' class='vnav__link'>Bathroom Sinks</a>
</li><li class='vnav__item'><a href='https://xx.htm' class='vnav__link'>Bathroom Cabinets</a>
</li><li class='vnav__item'><a href='https://xx.htm' class='vnav__link'>Bathroom Vanities</a>
</li><li class='vnav__item'><a href='https://xx.htm' class='vnav__link'>Laundry Hampers</a>
</li><li class='vnav__item'><a href='https://xx.htm' class='vnav__link'>Bath Towel Sets</a>
</li><li class='vnav__item'><a href='https://xx.htm' class='vnav__link'>Shower Curtains</a>
</li></ul>

But Python print() in a console shows this (URL redacted):

<ul class="vnav vnav__subnav vnav--level2">
<li class="vnav__item"><a class="vnav__link" href="https://xx.htm">Folding Tables</a>
</li><li class="vnav__item"><a class="vnav__link" href="https://xx.htm">Bookcases</a>
</li><li class="vnav__item"><a class="vnav__link" href="https://xx.htm">Printer Stands</a>
</li><li class="vnav__item"><a class="vnav__link" href="https://xx.htm">Computer Desks</a>
</li><li class="vnav__item"><a class="vnav__link" href="https://xx.htm">Office Chairs</a>
</li><li class="vnav__item"><a class="vnav__link" href="https://xx.htm">Filing Cabinets</a>
</li><li class="vnav__item"><a class="vnav__link" href="https://xx.htm">Letter Holders</a>
</li></ul>
</li>
<li class="vnav__item"><a href="https:">/   /   x   x   .   h   t   m   '       c   l   a   s   s   =   '   v   n   a   v   _   _   l   i   n   k   '   &gt;   B   a   t   h   r   o   o   m   /   a   &gt;   
   u   l       c   l   a   s   s   =   '   v   n   a   v       v   n   a   v   _   _   s   u   b   n   a   v       v   n   a   v   -   -   l   e   v   e   l   2   '   &gt;   
   l   i       c   l   a   s   s   =   '   v   n   a   v   _   _   i   t   e   m   '   &gt;   a       h   r   e   f   =   '   h   t   t   p   s   :   /   /   x   x   .   h   t   m   '       c   l   a   s   s   =   '   v   n   a   v   _   _   l   i   n   k   '   &gt;   B   a   t   h   r   o   o   m       M   i   r   r   o   r   s   /   a   &gt;   
   /   l   i   &gt;   l   i       c   l   a   s   s   =   '   v   n   a   v   _   _   i   t   e   m   '   &gt;   a       h   r   e   f   =   '   h   t   t   p   s   :   /   /   x   x   .   h   t   m   '       c   l   a   s   s   =   '   v   n   a   v   _   _   l   i   n   k   '   &gt;   B   a   t   h   r       o   m       S   i   n   k   s   /   a   &gt;   
   /   l   i   &gt;   l   i       c   l   a   s   s   =   '   v   n   a   v   _   _   i   t   e   m   '   &gt;   a       h   r   e   f   =   '   h   t   t   p   s   :   /   /   x   x   .   h   t   m   '       c   l   a   s   s   =   '   v   n   a   v   _   _   l   i   n   k   '   &gt;   B   a       h   r   o   o   m       C   a   b   i   n   e   t   s   /   a   &gt;   
   /   l   i   &gt;   l   i       c   l   a   s   s   =   '   v   n   a   v   _   _   i   t   e   m   '   &gt;   a       h   r   e   f   =   '   h   t   t   p   s   :   /   /   x   x   .   h   t   m   '       c   l   a   s   s   =   '   v   n   a   v   _   _   l   i   n   k   '   &gt;   B   a       h   r   o   o   m       V   a   n   i   t   i   e   s   /   a   &gt;   
   /   l   i   &gt;   l   i       c   l   a   s   s   =   '   v   n   a   v   _   _   i   t   e   m   '   &gt;   a       h   r   e   f   =   '   h   t   t   p   s   :   /   /   x   x   .   h   t   m   '       c   l   a   s   s   =   '   v   n   a   v   _   _   l   i   n   k   '   &gt;   L   a   u   n       r   y       H   a   m   p   e   r   s   /   a   &gt;   
   /   l   i   &gt;   l   i       c   l   a   s   s   =   '   v   n   a   v   _   _   i   t   e   m   '   &gt;   a       h   r   e   f   =   '   h   t   t   p   s   :   /   /   x   x   .   h   t   m   '       c   l   a   s   s   =   '   v   n   a   v   _   _   l   i   n   k   '   &gt;   B   a   t   h       T   o   w   e   l       S   e   t   s   /   a   &gt;   
   /   l   i   &gt;   l   i       c   l   a   s   s   =   '   v   n   a   v   _   _   i   t   e   m   '   &gt;   a       h   r   e   f   =   '   h   t   t   p   s   :   /   /   x   x   .   h   t   m   '       c   l   a   s   s   =   '   v   n   a   v   _   _   l   i   n   k   '   &gt;   S   h   o   w   e           C   u   r   t   a   i   n   s   /   a   &gt;   
   /   l   i   &gt;   /   u   l   &gt;

The content-type is "text/html" and the encoding is "ISO-8859-1" in the web browser, but shows "UTF-8" when printed via Python. Also, on the Python console print(), the entire remainder of the html appears with all the spaces and characters, except for right at the end, which goes back to normal (except that it looks like there are 2 tags, which is a different issue):

 /   b   o   d   y   &gt;   
   /   h   t   m   l   &gt;   
</a></li></ul></div></nav></body></html>

Finally, if I try to decode using UTF-8 instead of ISO-8859-1, I get the following error:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 74396: invalid continuation byte

Solution

  • Nevermind, figured it out.

    Top tip: when working in different VirtualEnvs, always be sure that your python version is the same. I didn't check this originally, but as I was hopping back and forth, just decided to verify. The python version I had assumed was being used, wasn't. Once I made the switch... yup! Better.