Is there a built in package to parse html into dom?

I

5

50

I found HTMLParser for SAX and xml.minidom for XML. I have a pretty well formed HTML so I don't need a too strong parser - any suggestions?

Inescutcheon answered 6/5, 2010 at 15:6 Comment(1)

Could you accept velotron's answer please, since it's the one that solves the builtin requirement? meta.stackexchange.com/questions/120568/… – Belsen 12/4, 2018 at 13:39

L

16

Take a look at BeautifulSoup. It's popular and excellent at parsing HTML.

Laverne answered 6/5, 2010 at 15:10 Comment(7)

it's not built in if I'm not mistaken – Inescutcheon 6/5, 2010 at 15:12

No, it's not built-in. But you can easily install it using easy_install or just download from the website and put into PYTHONPATH. Whole BeautifulSoup is contained in a single file, so it's not much of a burden. – Laverne 6/5, 2010 at 15:17

BeautifulSoup is supposed to parse dirty HTML not "pretty well formed" one. – Hippocrene 5/1, 2015 at 11:20

I have added a working example of using the builtin xml.dom.minidom, which answers the original question. – Handbook 22/11, 2016 at 19:7

this is not a buit in package therefore this answer should be taken as a valid answer ! – Hartwig 25/4, 2020 at 20:56

@Hartwig you mean it should NOT be taken as the valid answer. – Sprout 27/1, 2021 at 23:28

Yes should « NOT » – Hartwig 28/1, 2021 at 6:40

F

31

I would recommend lxml. I like BeautifulSoup, but there are maintenance issues generally and compatibility issues with the later releases. I've been happy using lxml.

Later: the best recommendations are to use lxml, html5lib, or BeautifulSoup 3.0.8. BeautifulSoup 3.1.x is meant for python 3.x and is known to have problems with earlier python versions, as noted on the BeautifulSoup website.

Ian Bicking has a good article on using lxml.

ElementTree is a further recommendation, but I have never used it.

2012-01-18: someone has come by and decided to downvote me and Bartosz because we recommended python packages that are easily obtained but not part of the python distribution. So for the highly literal StackOverflowers: "You can use xml.dom.minidom, but no one will recommend this over the alternatives."

Francisco answered 6/5, 2010 at 15:57 Comment(2)

for what it's worth, i tried to parse some HTML using both ElementTree and xml minidom, and they both choked with parse errors in script tags (javascript)! – Adrianople 8/10, 2014 at 22:8

I just added an answer with a working example of xml.dom.minidom. In some situations, installing an external module is burdensome or impossible. Plus that is what the original question asked for. – Handbook 22/11, 2016 at 19:6

H

22

BeautifulSoup and lxml are great, but not appropriate answers here since the question is about builtins. Here is an example of using the builtin minidom module to parse an HTML string. Tested with cPython 3.5.2:

from xml.dom.minidom import parseString

html_string = """
<!DOCTYPE html>
<html><head><title>title</title></head><body><p>test</p></body></html>
"""

# extract the text value of the document's <p> tag:
doc = parseString(html_string)
paragraph = doc.getElementsByTagName("p")[0]
content = paragraph.firstChild.data

print(content)

However, as indicated in Jesse Hogan's comment, this will fail on HTML entities not recognized by mindom. Here is an updated solution using the Python3 html.parser module:

from html.parser import HTMLParser

html_string = """
<!DOCTYPE html>
<html><head><title>title</title></head><body><p>&nbsp;test</p><div>not in p</div></body></html>
"""

class Parser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.in_p = []

    def handle_starttag(self, tag, attrs):
        if (tag == 'p'):
            self.in_p.append(tag)

    def handle_endtag(self, tag):
        if (tag == 'p'):
            self.in_p.pop()

    def handle_data(self, data):
        if self.in_p:
            print("<p> data :", data)

parser = Parser()
parser.feed(html_string)

Handbook answered 22/11, 2016 at 19:2 Comment(1)

This would raise an exception on common HTML entities such as   or ®. – Rozella 16/5, 2018 at 5:10

L

16

Take a look at BeautifulSoup. It's popular and excellent at parsing HTML.

Laverne answered 6/5, 2010 at 15:10 Comment(7)

it's not built in if I'm not mistaken – Inescutcheon 6/5, 2010 at 15:12

No, it's not built-in. But you can easily install it using easy_install or just download from the website and put into PYTHONPATH. Whole BeautifulSoup is contained in a single file, so it's not much of a burden. – Laverne 6/5, 2010 at 15:17

BeautifulSoup is supposed to parse dirty HTML not "pretty well formed" one. – Hippocrene 5/1, 2015 at 11:20

I have added a working example of using the builtin xml.dom.minidom, which answers the original question. – Handbook 22/11, 2016 at 19:7

this is not a buit in package therefore this answer should be taken as a valid answer ! – Hartwig 25/4, 2020 at 20:56

@Hartwig you mean it should NOT be taken as the valid answer. – Sprout 27/1, 2021 at 23:28

Yes should « NOT » – Hartwig 28/1, 2021 at 6:40

O

3

To handle DOM objects, you can use HTMLDOM for python.

Ovation answered 19/4, 2014 at 14:3 Comment(0)

R

0

There is a trick using only python3 builtin functions (3.4+)

Use html.unescape to decode all html5 entitities. Then use html.escape to encode <>"& back to entities for the xml parser leaving the other entities as unicode characters in the string.

#! /usr/bin/python3
import re
import xml.dom.minidom
from html import escape, unescape

def minidom_parseHtml(text: str):
     "parse html text with non-xml html-entities as minidom"
     textXML = re.sub("\\&\\w+\\;", lambda x: escape(unescape(x.group(0))), text)
     return xml.dom.minidom.parseString(textXML)

Reitareiter answered 15/5, 2023 at 1:8 Comment(0)

Recommended topics

Hot tags