Validate an HTML fragment using html5lib
Asked Answered
J

2

1

I'm using Python and html5lib to check if a bit of HTML code entered on a form field is valid.

I tried the following code to test a valid fragment but I'm getting an unexpected error (at least for me):

>>> import html5lib
>>> from html5lib.filters import lint
>>> fragment = html5lib.parseFragment('<p><script>alert("Boo!")</script></p>')
>>> walker = html5lib.getTreeWalker('etree')
>>> [i for i in lint.Filter(walker(fragment))]
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/xyz/html5lib-1.0b3-py2.7.egg/html5lib/filters/lint.py", line 28, in __iter__
    raise LintError(_("Tag name is not a string: %(tag)r") % {"tag": name})
LintError: Tag name is not a string: u'p'

What I'm doing wrong?

My default encoding is utf-8:

>>> import sys
>>> sys.getdefaultencoding()
'utf-8'
Jubbulpore answered 10/4, 2015 at 17:52 Comment(0)
A
2

The lint filter doesn't attempt to validate HTML (uh, yeah, documentation is needed, badly… this is a large part of the reason there is no 1.0 release yet), it merely validates that the treewalker API is adhered to. Except it doesn't because it's broken because of issue #172.

html5lib doesn't attempt to provide any validator, as it's a lot of work to implement an HTML validator.

I'm unaware of any reasonably complete validator except for Validator.nu, though that is written in Java. It provides a web API which might be suitable for your purposes, however.

Agra answered 1/5, 2015 at 18:0 Comment(0)
M
1

The "strict" parsing mode can be used to detect errors:

>>> import html5lib
>>> html5parser = html5lib.HTMLParser(strict=True)
>>> html5parser.parseFragment('<p>Lorem <a href="/foobar">ipsum</a>')
<Element 'DOCUMENT_FRAGMENT' at 0x7f1d4a58fd60>
>>> html5parser.parseFragment('<p>Lorem </a>ipsum<a href="/foobar">')
Traceback (most recent call last):
  ...
html5lib.html5parser.ParseError: Unexpected end tag (a). Ignored.
>>> html5parser.parseFragment('<p><form></form></p>')
Traceback (most recent call last):
  ...
html5lib.html5parser.ParseError: Unexpected end tag (p). Ignored.
>>> html5parser.parseFragment('<option value="example" />')
Traceback (most recent call last):
  ...
html5lib.html5parser.ParseError: Trailing solidus not allowed on element option
Monochromatic answered 27/3, 2020 at 12:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.