How to parse HTML with entities such as   using builtin library ElementTree in Python 2 & Python 3?
Asked Answered
P

2

2

There are times that you want to parse some reasonably well-formed HTML pages, but you are reluctant to introduce extra library dependency such as BeautifulSoup or lxml. So you will probably like to try the builtin ElementTree first, because it is a standard library, it is fast (implemented in C), and it supports much better interface (such as XPATH support) than the basic HTMLParser. Not to mention, HTMLParser has its own limitations.

ElementTree will work, until it encounters some entities, such as  , which are not handled by default.

import xml.etree.ElementTree as ET

html = '''<html>
    <div>Some reasonably well-formed HTML content.</div>
    <form action="login">
    <input name="foo" value="bar"/>
    <input name="username"/><input name="password"/>

    <div>It is not unusual to see &nbsp; in an HTML page.</div>

    </form></html>'''
et = ET.fromstring(html)

Run it on Python 2 or Python 3, you will see this error:

xml.etree.ElementTree.ParseError: undefined entity: line 7, column 38

There are some Q&A out there, such as this one and that one. They hint to use ElementTree.XMLParser().parser.UseForeignDTD(True) but I can not get it work in Python 3.3 and Python 3.4.

$ python3.3
Python 3.3.5 (v3.3.5:62cf4e77f785, Mar  9 2014, 01:12:57) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import xml.etree.ElementTree as ET
>>> ET.XMLParser().parser
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'xml.etree.ElementTree.XMLParser' object has no attribute 'parser'
>>> 
Pontius answered 24/2, 2016 at 1:14 Comment(0)
P
7

Inspired by this post, we can just prepend some XML definition to the incoming raw HTML content, and then ElementTree would work out of box.

This works for both Python 2.6, 2.7, 3.3, 3.4.

import xml.etree.ElementTree as ET

html = '''<html>
    <div>Some reasonably well-formed HTML content.</div>
    <form action="login">
    <input name="foo" value="bar"/>
    <input name="username"/><input name="password"/>

    <div>It is not unusual to see &nbsp; in an HTML page.</div>

    </form></html>'''

magic = '''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
            "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
            <!ENTITY nbsp ' '>
            ]>'''  # You can define more entities here, if needed

et = ET.fromstring(magic + html)
Pontius answered 24/2, 2016 at 1:14 Comment(2)
You can shorten this by omitting PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd". It still works without that bit.Trilbie
I was even able to retain the escaped characters with <!ENTITY nbsp '&amp;nbsp;'>Saraband
C
0

As another alternative answer, setting the attribute "entity" of the parser worked for me:

parser.entity["nbsp"] = ' '
Comras answered 12/9, 2018 at 22:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.