I try to write a HTML-parser but during testing I do not want to query the website every time so I saved the website as HTML-file locally.
For reading I use:
urltext = urllib.request.urlopen(urlfile).read().decode("utf-8")
from the website directly I get a correct stringto parse but when I open it from my local pc it seems to have a wrong decoding:
<span id="line845"></span> </span><span><<span class="start-tag">h2</span> <span class="attribute-name">class</span>="<a class="attribute-value">article-title</a>"></span><span>
<span id="line846"></span> </span><span><<span class="start-tag">span</span> <span class="attribute-name">class</span>="<a class="attribute-value">headline-intro</a>"></span><span>Intro:</span><span></<span class="end-tag">span</span>></span><span> </span><span><<span class="start-tag">span</span> <span class="attribute-name">class</span>="<a class="attribute-value">headline</a>"></span><span>Main text</span><span></<span class="end-tag">span</span>></span><span></span><span></<span class="end-tag">h2</span>></span><span>
originally it should look like this:
<h2 class="article-title">
<span class="headline-intro">Intro:</span> <span class="headline">Main Text</span></h2>
Any ideas what I do wrong?
Thanx
Kev
You downloaded the HTML file incorrectly, but your method of opening it looks correct.
It sounds like you opened the web page's source code in your browser, copy-pasted that into Libre Office, and used Libre Office's "Save as HTML" feature. This won't work, because HTML is a plain-text markup format and Libre Office is a rich-text word processor -- that means Libre Office saves information like font, size, color, tecorations, images, etc. right in the file.
The "Save as HTML" feature in Libre Office is meant to convert a normal document into a webpage -- not to save HTML markup that you typed into the document.
In order to download a document the proper way, find your browser's "save" functionality. In most browsers, you can just press Ctrl / Cmd + S. When you're finished, open the file in a plain-text editor (such as Notepad, Gedit, or TextEdit) to be sure it looks as expected.