python-3.xhtmltidy

Can't figure out how to invoke html5Tidy from Python 3


For Python 3.5.

Can someone please point me to some documentation for using html5tidy with Python 3? I'm amazed that multiple searches don't return anything.

In Python 3, the documentation in html5tidy.py states:

"""
HTML5Tidy
=========

Simple wrapper around html5lib & lxml.etree to "tidy" html in the wild to
well-formed xml/html

Usage
-----

    >>> from html5tidy import tidy
    >>> tidy('some text')
    '<html><head/><body>some text</body></html>'

Dependencies
------------

* [html5lib](http://code.google.com/p/html5lib/)
* [lxml](http://lxml.de/)

Okay, so I have all the pieces:

>>> import html5lib
>>> dir(html5lib)
['HTMLParser', '__all__', '__builtins__', '__cached__', [and so on]]
>>> 
>>> import lxml
>>> dir(lxml)
['__builtins__', '__cached__', '__doc__', '__file__', [and so on]]

BUT I note that dir(tidy) returns only double-underscore results:

>>> from html5tidy import tidy
>>> dir(tidy)
['__annotations__', '__call__', '__class__', [and so on...]'__subclasshook__']

So I open a file containing HTML as untidiedHTML.

>>> print(untidiedHTML)
<!DOCTYPE html>
<html id="ng-app" lang="en" ng-app="TH" style="" xmlns:ng="http://angularjs.org">
 <head ng-controller="DZHeadController">
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <title ng-bind="service.title">
   What the Heck Is OAuth? - DZone Security
  </title>
  <link href="WhatIsOAuth0200_files/tranquility.css" rel="stylesheet" type="text/css"/>
 </head>
 <body class="tranquility" >
 ... and so on...

Then per the HTML5 tidy documentation I try:

from html5tidy import tidy
tidiedHTML = tidy(untidiedHTML)

That produces:

Traceback (most recent call last):
  File "[path to my Python source file].py", line 50, in <module>
    tidiedHTML = tidy(untidiedHTML)
  File "/usr/local/lib/python3.5/dist-packages/html5tidy.py", line 61, in tidy
    parts = [parser.parse(src, encoding=encoding, parseMeta=parseMeta, useChardet=useChardet)]
  File "/usr/local/lib/python3.5/dist-packages/html5lib/html5parser.py", line 289, in parse
    self._parse(stream, False, None, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/html5lib/html5parser.py", line 130, in _parse
    self.tokenizer = _tokenizer.HTMLTokenizer(stream, parser=self, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/html5lib/_tokenizer.py", line 36, in __init__
    self.stream = HTMLInputStream(stream, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/html5lib/_inputstream.py", line 149, in HTMLInputStream
    return HTMLUnicodeInputStream(source, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'parseMeta'

I have NO idea what to do. I've searched for documentation that explains how to invoke html5tidy from Python 3 but I've come up empty...


Solution

  • That library is broken and/or doesn't work with python 3.5. I installed and ran into errors related to html5lib.HTMLParser https://github.com/aleray/html5tidy/blob/master/html5tidy.py#L57

    Theres one contributor and the package has not been updated in 6 years

    Your options are