You may have misunderstood BeautifulSoup here. BeautifulSoup deals with whole HTML documents, not with HTML fragments. What you see is by design.
Without a <html>
and <body>
tag, your HTML document is broken. BeautifulSoup leaves it to the specific parser to repair such a document, and different parsers differ in how much they can repair. html5lib
is the most thorough of the parsers, but you'll get similar results with the lxml
parser (but lxml
leaves out the <head>
tag). The html.parser
parser is the least capable, it can do some repair work but it doesn't add back required but missing tags.
So this is a deliberate feature of the html5lib
library, it fixes HTML that is lacking, such as adding back in missing required elements.
There is not option for BeautifulSoup to treat the HTML you pass in as a fragment. At most you can 'break' the document and remove the <html>
and <body>
elements again with the standard BeautifulSoup tree manipulation methods.
E.g. using Element.replace_with()
lets you replace the html
element with your <h1>
element:
>>> soup = BeautifulSoup('<h1>FOO</h1>', 'html5lib')
>>> soup
<html><head></head><body><h1>FOO</h1></body></html>
>>> soup.html.replace_with(soup.body.contents[0])
<html><head></head><body></body></html>
>>> soup
<h1>FOO</h1>
Take into account however, that html5lib
can add other elements to your tree too, such as tbody
elements:
>>> BeautifulSoup(
... '<table><tr><td>Foo</td><td>Bar</td></tr></table>', 'html5lib'
... ).table
<table><tbody><tr><td>Foo</td><td>Bar</td></tr></tbody></table>
The HTML standard states that a table should always have a <tbody>
element, and if it is missing, a parser should treat the document as if the element is there anyway. html5lib
follows the standard very, very closely.
innerHTML
does), then you want a different API. – Hardener