Can't figure out how to invoke html5Tidy from Python 3
Asked Answered
P

1

0

For Python 3.5.

Can someone please point me to some documentation for using html5tidy with Python 3? I'm amazed that multiple searches don't return anything.

In Python 3, the documentation in html5tidy.py states:

"""
HTML5Tidy
=========

Simple wrapper around html5lib & lxml.etree to "tidy" html in the wild to
well-formed xml/html

Usage
-----

    >>> from html5tidy import tidy
    >>> tidy('some text')
    '<html><head/><body>some text</body></html>'

Dependencies
------------

* [html5lib](http://code.google.com/p/html5lib/)
* [lxml](http://lxml.de/)

Okay, so I have all the pieces:

>>> import html5lib
>>> dir(html5lib)
['HTMLParser', '__all__', '__builtins__', '__cached__', [and so on]]
>>> 
>>> import lxml
>>> dir(lxml)
['__builtins__', '__cached__', '__doc__', '__file__', [and so on]]

BUT I note that dir(tidy) returns only double-underscore results:

>>> from html5tidy import tidy
>>> dir(tidy)
['__annotations__', '__call__', '__class__', [and so on...]'__subclasshook__']

So I open a file containing HTML as untidiedHTML.

>>> print(untidiedHTML)
<!DOCTYPE html>
<html id="ng-app" lang="en" ng-app="TH" style="" xmlns:ng="http://angularjs.org">
 <head ng-controller="DZHeadController">
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <title ng-bind="service.title">
   What the Heck Is OAuth? - DZone Security
  </title>
  <link href="WhatIsOAuth0200_files/tranquility.css" rel="stylesheet" type="text/css"/>
 </head>
 <body class="tranquility" >
 ... and so on...

Then per the HTML5 tidy documentation I try:

from html5tidy import tidy
tidiedHTML = tidy(untidiedHTML)

That produces:

Traceback (most recent call last):
  File "[path to my Python source file].py", line 50, in <module>
    tidiedHTML = tidy(untidiedHTML)
  File "/usr/local/lib/python3.5/dist-packages/html5tidy.py", line 61, in tidy
    parts = [parser.parse(src, encoding=encoding, parseMeta=parseMeta, useChardet=useChardet)]
  File "/usr/local/lib/python3.5/dist-packages/html5lib/html5parser.py", line 289, in parse
    self._parse(stream, False, None, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/html5lib/html5parser.py", line 130, in _parse
    self.tokenizer = _tokenizer.HTMLTokenizer(stream, parser=self, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/html5lib/_tokenizer.py", line 36, in __init__
    self.stream = HTMLInputStream(stream, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/html5lib/_inputstream.py", line 149, in HTMLInputStream
    return HTMLUnicodeInputStream(source, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'parseMeta'

I have NO idea what to do. I've searched for documentation that explains how to invoke html5tidy from Python 3 but I've come up empty...

Perzan answered 16/5, 2018 at 22:28 Comment(0)
S
1

That library is broken and/or doesn't work with python 3.5. I installed and ran into errors related to html5lib.HTMLParser https://github.com/aleray/html5tidy/blob/master/html5tidy.py#L57

Theres one contributor and the package has not been updated in 6 years

Your options are

  • fork the repo, fix the issues and submit a pull-request
  • extract the code you need and roll-your-own
  • find another library
Strickler answered 17/5, 2018 at 1:35 Comment(2)
Wow -- thank you for that! I was amazed that numerous Web searches weren't returning results. You've explained why that's the case.Perzan
Glad I could help!Strickler

© 2022 - 2024 — McMap. All rights reserved.