BeautifulSoup innerhtml?
Asked Answered
H

8

79

Let's say I have a page with a div. I can easily get that div with soup.find().

Now that I have the result, I'd like to print the WHOLE innerhtml of that div: I mean, I'd need a string with ALL the html tags and text all toegether, exactly like the string I'd get in javascript with obj.innerHTML. Is this possible?

Harney answered 13/11, 2011 at 16:26 Comment(0)
S
105

TL;DR

With BeautifulSoup 4 use element.encode_contents() if you want a UTF-8 encoded bytestring or use element.decode_contents() if you want a Python Unicode string. For example the DOM's innerHTML method might look something like this:

def innerHTML(element):
    """Returns the inner HTML of an element as a UTF-8 encoded bytestring"""
    return element.encode_contents()

These functions aren't currently in the online documentation so I'll quote the current function definitions and the doc string from the code.

encode_contents - since 4.0.4

def encode_contents(
    self, indent_level=None, encoding=DEFAULT_OUTPUT_ENCODING,
    formatter="minimal"):
    """Renders the contents of this tag as a bytestring.

    :param indent_level: Each line of the rendering will be
       indented this many spaces.

    :param encoding: The bytestring will be in this encoding.

    :param formatter: The output formatter responsible for converting
       entities to Unicode characters.
    """

See also the documentation on formatters; you'll most likely either use formatter="minimal" (the default) or formatter="html" (for html entities) unless you want to manually process the text in some way.

encode_contents returns an encoded bytestring. If you want a Python Unicode string then use decode_contents instead.


decode_contents - since 4.0.1

decode_contents does the same thing as encode_contents but returns a Python Unicode string instead of an encoded bytestring.

def decode_contents(self, indent_level=None,
                   eventual_encoding=DEFAULT_OUTPUT_ENCODING,
                   formatter="minimal"):
    """Renders the contents of this tag as a Unicode string.

    :param indent_level: Each line of the rendering will be
       indented this many spaces.

    :param eventual_encoding: The tag is destined to be
       encoded into this encoding. This method is _not_
       responsible for performing that encoding. This information
       is passed in so that it can be substituted in if the
       document contains a <META> tag that mentions the document's
       encoding.

    :param formatter: The output formatter responsible for converting
       entities to Unicode characters.
    """

BeautifulSoup 3

BeautifulSoup 3 doesn't have the above functions, instead it has renderContents

def renderContents(self, encoding=DEFAULT_OUTPUT_ENCODING,
                   prettyPrint=False, indentLevel=0):
    """Renders the contents of this tag as a string in the given
    encoding. If encoding is None, returns a Unicode string.."""

This function was added back to BeautifulSoup 4 (in 4.0.4) for compatibility with BS3.

Seeress answered 3/9, 2013 at 22:4 Comment(2)
This is the correct answer. @peewhy's answer does not work for the reasons ChrisD outlined.Bellicose
Anyone know why this is undocumented? Seems like it would be a common use case.Brindisi
S
17

One of the options could be use something like that:

 innerhtml = "".join([str(x) for x in div_element.contents]) 
Syncarpous answered 13/11, 2011 at 16:39 Comment(2)
There are a few other problems with this. Firstly it doesn't escape html entities (such as greater than and less than) within string elements. Secondly, it will write the content of comments but not the comment tags itself.Seeress
Adding another reason not to use this to @Seeress comments: This will throw a UnicodeDecodeError on content that includes non-ASCII characters.Klingensmith
A
17

Given a BS4 soup element like <div id="outer"><div id="inner">foobar</div></div>, here are some various methods and attributes that can be used to retrieve its HTML and text in different ways along with an example of what they'll return.


InnerHTML:

inner_html = element.encode_contents()

'<div id="inner">foobar</div>'

OuterHTML:

outer_html = str(element)

'<div id="outer"><div id="inner">foobar</div></div>'

OuterHTML (prettified):

pretty_outer_html = element.prettify()

'''<div id="outer">
 <div id="inner">
  foobar
 </div>
</div>'''

Text only (using .text):

element_text = element.text

'foobar'

Text only (using .string):

element_string = element.string

'foobar'
Autotomize answered 18/11, 2017 at 10:21 Comment(0)
G
3

str(element) helps you to get outerHTML, then remove outer tag from the outer html string.

Gangway answered 21/9, 2020 at 9:23 Comment(1)
How do you remove outer tag from the outer html string?Howlond
V
1

How about just unicode(x)? Seems to work for me.

Edit: This will give you the outer HTML and not the inner.

Vanwinkle answered 30/1, 2016 at 10:30 Comment(2)
This will return the div including the outer element, not just the contents.Littlest
You're right. Leaving this here for now in case this helps someone else.Vanwinkle
U
1

If I do not misunderstand, you mean that for an example like this:

<div class="test">
    text in body
    <p>Hello World!</p>
</div>

the output should de look like this:

text in body
    <p>Hello World!</p>

So here is your answer:

''.join(map(str,tag.contents))
Uneducated answered 23/3, 2021 at 7:32 Comment(0)
H
1

The easiest way is to use the children property.

inner_html = soup.find('body').children

it will return a list. So, you can get the full code using a simple for loop.

for html in inner_html:
    print(html)
Hamlin answered 21/11, 2021 at 20:37 Comment(1)
to get back content : "".join(map(str,soup.find('body').children)).strip()Holliman
F
-4

For just text, Beautiful Soup 4 get_text()

If you only want the human-readable text inside a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string:

markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup, 'html.parser')

soup.get_text()
'\nI linked to example.com\n'
soup.i.get_text()
'example.com' 

You can specify a string to be used to join the bits of text together:

soup.get_text("|")
'\nI linked to |example.com|\n' 

You can tell Beautiful Soup to strip whitespace from the beginning and end of each bit of text:

soup.get_text("|", strip=True)
'I linked to|example.com' 

But at that point you might want to use the .stripped_strings generator instead, and process the text yourself:

[text for text in soup.stripped_strings]
# ['I linked to', 'example.com'] 

As of Beautiful Soup version 4.9.0, when lxml or html.parser are in use, the contents of <script>, <style>, and <template> tags are not considered to be ‘text’, since those tags are not part of the human-visible content of the page.

Refer here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#get-text

Faveolate answered 20/6, 2018 at 17:28 Comment(2)
...was the question a different question at some point?Nanananak
@Nanananak It's been a while and tbh I forgot.Faveolate

© 2022 - 2024 — McMap. All rights reserved.