htmlpython-3.4urllibstring-decoding

How to decode a HTML-file when open it from a local file instead of the web?


I try to write a HTML-parser but during testing I do not want to query the website every time so I saved the website as HTML-file locally.

For reading I use:

urltext = urllib.request.urlopen(urlfile).read().decode("utf-8")

from the website directly I get a correct stringto parse but when I open it from my local pc it seems to have a wrong decoding:

<span id="line845"></span>                          </span><span>&lt;<span class="start-tag">h2</span> <span class="attribute-name">class</span>="<a class="attribute-value">article-title</a>"&gt;</span><span>
<span id="line846"></span>                                          </span><span>&lt;<span class="start-tag">span</span> <span class="attribute-name">class</span>="<a class="attribute-value">headline-intro</a>"&gt;</span><span>Intro:</span><span>&lt;/<span class="end-tag">span</span>&gt;</span><span> </span><span>&lt;<span class="start-tag">span</span> <span class="attribute-name">class</span>="<a class="attribute-value">headline</a>"&gt;</span><span>Main text</span><span>&lt;/<span class="end-tag">span</span>&gt;</span><span></span><span>&lt;/<span class="end-tag">h2</span>&gt;</span><span>

originally it should look like this:

<h2 class="article-title">
                                            <span class="headline-intro">Intro:</span> <span class="headline">Main Text</span></h2>

Any ideas what I do wrong?

Thanx

Kev


Solution

  • You downloaded the HTML file incorrectly, but your method of opening it looks correct.

    It sounds like you opened the web page's source code in your browser, copy-pasted that into Libre Office, and used Libre Office's "Save as HTML" feature. This won't work, because HTML is a plain-text markup format and Libre Office is a rich-text word processor -- that means Libre Office saves information like font, size, color, tecorations, images, etc. right in the file.

    The "Save as HTML" feature in Libre Office is meant to convert a normal document into a webpage -- not to save HTML markup that you typed into the document.

    In order to download a document the proper way, find your browser's "save" functionality. In most browsers, you can just press Ctrl / Cmd + S. When you're finished, open the file in a plain-text editor (such as Notepad, Gedit, or TextEdit) to be sure it looks as expected.