I found HTMLParser
for SAX and xml.minidom
for XML. I have a pretty well formed HTML so I don't need a too strong parser - any suggestions?
Take a look at BeautifulSoup. It's popular and excellent at parsing HTML.
I would recommend lxml. I like BeautifulSoup, but there are maintenance issues generally and compatibility issues with the later releases. I've been happy using lxml.
Later: the best recommendations are to use lxml, html5lib, or BeautifulSoup 3.0.8. BeautifulSoup 3.1.x is meant for python 3.x and is known to have problems with earlier python versions, as noted on the BeautifulSoup website.
Ian Bicking has a good article on using lxml.
ElementTree is a further recommendation, but I have never used it.
2012-01-18: someone has come by and decided to downvote me and Bartosz because we recommended python packages that are easily obtained but not part of the python distribution. So for the highly literal StackOverflowers: "You can use xml.dom.minidom, but no one will recommend this over the alternatives."
BeautifulSoup and lxml are great, but not appropriate answers here since the question is about builtins. Here is an example of using the builtin minidom module to parse an HTML string. Tested with cPython 3.5.2:
from xml.dom.minidom import parseString
html_string = """
<!DOCTYPE html>
<html><head><title>title</title></head><body><p>test</p></body></html>
"""
# extract the text value of the document's <p> tag:
doc = parseString(html_string)
paragraph = doc.getElementsByTagName("p")[0]
content = paragraph.firstChild.data
print(content)
However, as indicated in Jesse Hogan's comment, this will fail on HTML entities not recognized by mindom. Here is an updated solution using the Python3 html.parser module:
from html.parser import HTMLParser
html_string = """
<!DOCTYPE html>
<html><head><title>title</title></head><body><p> test</p><div>not in p</div></body></html>
"""
class Parser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.in_p = []
def handle_starttag(self, tag, attrs):
if (tag == 'p'):
self.in_p.append(tag)
def handle_endtag(self, tag):
if (tag == 'p'):
self.in_p.pop()
def handle_data(self, data):
if self.in_p:
print("<p> data :", data)
parser = Parser()
parser.feed(html_string)
or ®
. –
Rozella Take a look at BeautifulSoup. It's popular and excellent at parsing HTML.
To handle DOM objects, you can use HTMLDOM for python.
There is a trick using only python3 builtin functions (3.4+)
Use html.unescape
to decode all html5 entitities.
Then use html.escape
to encode <>"&
back to entities for the xml parser leaving the other entities as unicode characters in the string.
#! /usr/bin/python3
import re
import xml.dom.minidom
from html import escape, unescape
def minidom_parseHtml(text: str):
"parse html text with non-xml html-entities as minidom"
textXML = re.sub("\\&\\w+\\;", lambda x: escape(unescape(x.group(0))), text)
return xml.dom.minidom.parseString(textXML)
© 2022 - 2024 — McMap. All rights reserved.