Don't put html, head and body tags automatically, beautifulsoup
Asked Answered
C

9

41

I'm using beautifulsoup with html5lib, it puts the html, head and body tags automatically:

BeautifulSoup('<h1>FOO</h1>', 'html5lib') # => <html><head></head><body><h1>FOO</h1></body></html>

Is there any option that I can set, turn off this behavior ?

Credential answered 11/2, 2013 at 22:33 Comment(2)
What are you actually trying to do? If you're trying to parse it as a fragment of a document (like innerHTML does), then you want a different API.Hardener
I created a feature request to update the docs. This issue should be explained in the porting docs. Feature Request: bugs.launchpad.net/beautifulsoup/+bug/1370364 Porting docs: crummy.com/software/BeautifulSoup/bs4/doc/#porting-code-to-bs4Indefensible
Z
55
In [35]: import bs4 as bs

In [36]: bs.BeautifulSoup('<h1>FOO</h1>', "html.parser")
Out[36]: <h1>FOO</h1>

This parses the HTML with Python's builtin HTML parser. Quoting the docs:

Unlike html5lib, this parser makes no attempt to create a well-formed HTML document by adding a <body> tag. Unlike lxml, it doesn’t even bother to add an <html> tag.


Alternatively, you could use the html5lib parser and just select the element after <body>:

In [61]: soup = bs.BeautifulSoup('<h1>FOO</h1>', 'html5lib')

In [62]: soup.body.next
Out[62]: <h1>FOO</h1>
Zsa answered 11/2, 2013 at 22:45 Comment(9)
Note that this response is actually broken in case of multiple elements within the body. If you would have <h1>a</h1><h1>b</h1> it would only return <h1>a</h1>Exemption
One can do "soup.html.hidden=True (;) soup.head.hidden=True (;) soup.body.hidden=True" before printing and you get the desired result. Humongously inefficient, though, if you're iterating through a lot of cells in a table.Impunity
soup.body.next grabs the next element in the body, as the name suggests: Is there a way to grab everything inside the body "as is" (that is, not as text)? In other words, grab the complement of soup.body.decompose() Thanks.Dilemma
I'm going to throw this out there: body = soup.find('body') and body.findChildren(recursive=False)[0] (I had no luck with .unwrap())Dilemma
@Wolph: then use the children generator and join the output? "".join(map(str, soup.body.children)).Benjie
@PatrickT: Use soup.body.contents, which is a list of direct descendants (including text nodes). Or use iter(soup.body) and iterate over that. Or use list(soup.body) to produce a list of the direct descending nodes. They are all the same thing, really; list() and iter() call the __iter__ method, and the __iter__ method returns iter(self.contents).Benjie
@MartijnPieters I think it's just soup.children in this case. Unless you're parsing a full document you won't have a bodyExemption
@Wolph: you may want to read the question ;-) The whole point of the question is that a <html>, <head> and <body> tag have been added. That's because BeautifulSoup deals with whole documents, not with fragments, and html5lib (and lxml too, by the way) repair broken HTML documents by adding the required but missing elements back in.Benjie
@Exemption I don't get the same result...Audieaudience
S
8

This aspect of BeautifulSoup has always annoyed the hell out of me.

Here's how I deal with it:

# Parse the initial html-formatted string
soup = BeautifulSoup(html, 'lxml')

# Do stuff here

# Extract a string repr of the parse html object, without the <html> or <body> tags
html = "".join([str(x) for x in soup.body.children])

A quick breakdown:

# Iterator object of all tags within the <body> tag (your html before parsing)
soup.body.children

# Turn each element into a string object, rather than a BS4.Tag object
# Note: inclusive of html tags
str(x)

# Get a List of all html nodes as string objects
[str(x) for x in soup.body.children]

# Join all the string objects together to recreate your original html
"".join()

I still don't like this, but it gets the job done. I always run into this when I use BS4 to filter certain elements and/or attributes from HTML documents before doing something else with them where I need the entire object back as a string repr rather than a BS4 parsed object.

Hopefully, the next time I Google this, I'll find my answer here.

Sikorsky answered 17/12, 2019 at 23:23 Comment(5)
This worked fine. The other answer with findChildren() seem to discard text directly inside the body and not in another tag.Principe
There's an undocumented function decode_contents() now, see my answer below, which is aimed to stop the annoyance I also had before.Keller
Why does it annoy you? Because BeautifulSoup is designed to deal with HTML documents, not fragments? Note that your approach gives a string, not a BeautifulSoup object tree.Benjie
@MartijnPieters It annoys me in situations where a fragment is parsed and the html and body tags are added to it, by default. WRT to the str vs BS4 parse object; did you feel my post didn't draw attention to that appropriately at the bottom?Sikorsky
You did, but the question never mentioned needing just the string.Benjie
A
7

Let's first create a soup sample:

soup=BeautifulSoup("<head></head><body><p>content</p></body>")

You could get html and body's child by specify soup.body.<tag>:

# python3: get body's first child
print(next(soup.body.children))

# if first child's tag is rss
print(soup.body.rss)

Also you could use unwrap() to remove body, head, and html

soup.html.body.unwrap()
if soup.html.select('> head'):
    soup.html.head.unwrap()
soup.html.unwrap()

If you load xml file, bs4.diagnose(data) will tell you to use lxml-xml, which will not wrap your soup with html+body

>>> BS('<foo>xxx</foo>', 'lxml-xml')
<foo>xxx</foo>
Alejoa answered 14/8, 2018 at 8:2 Comment(2)
soup.html.unwrap() then soup.body.unwrap() worked for me!Beaumarchais
No, parsing a document as XML will indeed not add required HTML elements. But treating HTML as XML introduces other issues, because XML is not HTML. Like in HTML, the tags <H1> and <h1> are the same thing, but in XML, where tag names are case-sensitive, they are not. So <H1>Foo</h1> is valid HTML, but not valid XML.Benjie
B
4

You may have misunderstood BeautifulSoup here. BeautifulSoup deals with whole HTML documents, not with HTML fragments. What you see is by design.

Without a <html> and <body> tag, your HTML document is broken. BeautifulSoup leaves it to the specific parser to repair such a document, and different parsers differ in how much they can repair. html5lib is the most thorough of the parsers, but you'll get similar results with the lxml parser (but lxml leaves out the <head> tag). The html.parser parser is the least capable, it can do some repair work but it doesn't add back required but missing tags.

So this is a deliberate feature of the html5lib library, it fixes HTML that is lacking, such as adding back in missing required elements.

There is not option for BeautifulSoup to treat the HTML you pass in as a fragment. At most you can 'break' the document and remove the <html> and <body> elements again with the standard BeautifulSoup tree manipulation methods.

E.g. using Element.replace_with() lets you replace the html element with your <h1> element:

>>> soup = BeautifulSoup('<h1>FOO</h1>', 'html5lib')
>>> soup
<html><head></head><body><h1>FOO</h1></body></html>
>>> soup.html.replace_with(soup.body.contents[0])
<html><head></head><body></body></html>
>>> soup
<h1>FOO</h1>

Take into account however, that html5lib can add other elements to your tree too, such as tbody elements:

>>> BeautifulSoup(
...     '<table><tr><td>Foo</td><td>Bar</td></tr></table>', 'html5lib'
... ).table
<table><tbody><tr><td>Foo</td><td>Bar</td></tr></tbody></table>

The HTML standard states that a table should always have a <tbody> element, and if it is missing, a parser should treat the document as if the element is there anyway. html5lib follows the standard very, very closely.

Benjie answered 11/2, 2013 at 22:42 Comment(2)
But if the data isn't meant to be a complete HTML document, it's not missing at all.Lois
@KefSchecter: html5lib expects complete HTML documents, not HTML fragments. So whatever the data was meant to be, html5lib is not the library to use for HTML fragments. Hell, BeautifulSoup is not the library to use, a BeautifulSoup object expects to handle complete documents, but it's up to the parser to repair things if not.Benjie
S
1

Yet another solution:

from bs4 import BeautifulSoup
soup = BeautifulSoup('<p>Hello <a href="http://google.com">Google</a></p><p>Hi!</p>', 'lxml')
# content handling example (just for example)
# replace Google with StackOverflow
for a in soup.findAll('a'):
  a['href'] = 'http://stackoverflow.com/'
  a.string = 'StackOverflow'
print ''.join([unicode(i) for i in soup.html.body.findChildren(recursive=False)])
Schild answered 18/7, 2016 at 5:20 Comment(0)
B
0
html=str(soup)
html=html.replace("<html><body>","")
html=html.replace("</body></html>","")

will remove the html/body tag bracket. A more sophisticated version would also check for startsWith, endsWith ...

Beginner answered 14/3, 2021 at 12:50 Comment(1)
Why string replacement? Why not just use str(soup.body.contents[0]) (for single-element bodies, or use looping and str.join()) or even simpler, soup.body.decode_contents()? The latter is, unfortunately, not documented but is the method that Element.decode() calls to render the contents when rendering itself, and Element.decode() is the base implementation for __str__.Benjie
A
-1

If you want it to look better, try this:

BeautifulSoup([contents you want to analyze].prettify())

Antler answered 1/10, 2018 at 12:53 Comment(0)
K
-1

Since v4.0.1 there's a method decode_contents():

>>> BeautifulSoup('<h1>FOO</h1>', 'html5lib').body.decode_contents()
'<h1>FOO</h1>' 

More details in a solution to this question: https://mcmap.net/q/88849/-beautifulsoup-innerhtml

Update:

As rightfully noted by @MartijnPieters in the comments this way you'll still get some extra tags like tbody (in the tables) which you might or might not want.

Keller answered 9/7, 2020 at 17:51 Comment(6)
This doesn't answer the question though. Note that if you passed in a snippet with a table but no tbody tag, you'll find that html5lib has added one; '<table><tr><td>Foo</td></tr></table>' -> '<html><head></head><body><table><tbody><tr><td>Foo</td></tr></tbody></table></body></html>', and that might be unexpected too.Benjie
@MartijnPieters Yes, it might be unexpected, but it does answer the question. OP said nothing about tbody.Keller
@AntonyHatchkins: nor do they ask about getting the contents as a string: Is there any option that I can set, turn off this behavior ? is what they asked.Benjie
@MartijnPieters You're not reading it carefully enough: "it puts the html, head and body tags automatically" - this is the behavior they want to turn off.Keller
@AntonyHatchkins: I've read it just fine. Which is why I pointed out that there are other tags that are added automatically, not just those observed by the OP. Your solution doesn't turn those 'off'. Nor is it stated that the OP wanted a string. Maybe they wanted a BeautifulSoup object without those tags?Benjie
@MartijnPieters That's a useful addition, thanks. Yet it does not make the answer better or worse as it is not something OP originally asked for. He might as well be pleased with the addition of tbody tags. "Maybe they wanted a BeautifulSoup object without those tags?" - or maybe not. We do not know for sure. In the meanwhile maybe another reader is interested in strings particularly.Keller
T
-1

Here is how I do it

a = BeautifulSoup()
a.append(a.new_tag('section'))
#this will give you <section></section>
Tiu answered 21/11, 2020 at 22:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.