BeautifulSoup - how should I obtain the body contents

Asked 30/1, 2014 at 9:44 Answered 11/2, 2020 at 18:44

Solved python django beautifulsoup html5lib

I'm parsing HTML with BeautifulSoup. At the end, I would like to obtain the body contents, but without the body tags. But BeautifulSoup adds html, head, and body tags. I this googlegrops discussion one possible solution is proposed:

>>> from bs4 import BeautifulSoup as Soup
>>> soup = Soup('<p>Some paragraph</p>')
>>> soup.body.hidden = True
>>> soup.body.prettify()
u' <p>\n  Some paragraph\n </p>'

This solution is a hack. There should be a better and obvious way to do it.

Ibnsina answered 30/1, 2014 at 9:44 Comment(1)

Despite all the answers, I still find the .hiddden=True approach the cleanest one. Another hack, if a string result will suffice, would be to truncate the body tags: str(soup.body)[6:-7] or soup.body.prettify()[6:-7] – Redcap 5/10, 2020 at 7:3

Do you mean getting everything inbetween the body tags?

In this case you can use :

import urllib2
from bs4 import BeautifulSoup
page = urllib2.urlopen('some_site').read()
soup = BeautifulSoup(page)
body = soup.find('body')
the_contents_of_body_without_body_tags = body.findChildren(recursive=False)

Cystic answered 30/1, 2014 at 10:2 Comment(4)

Thanks! When I have two paragraphs, should I use something like ''.join(['%s' % x for x in soup.body.findChildren()]), or is there a better way? – Ibnsina 30/1, 2014 at 10:12

I had some issues using findChildren where some things appearing redundantly, as they are nested withing multiple layers and were added for each containing layer. To get the contents from the body as it is in the original without any redundancy or weirdness I used pagefilling = ''.join(['%s' % x for x in soup.body.contents]) – Spousal 27/7, 2016 at 17:22

body.findChildren(recursive=False); helps you not to get nested elements twice. – Ambiguous 8/9, 2018 at 0:16

What about simply: body_content = "".join(str(item) for item in soup.body.contents)? – Ryswick 22/4 at 13:30

I've found the easiest way to get just the contents of the body is to unwrap() your contents from inside the body tags.

>>> html = "<p>Hello World</p>"
>>> soup = BeautifulSoup(html, "html5lib")
>>> print(soup)
<html><head></head><body><p>Hello World</p></body></html>
>>>
>>> soup.html.unwrap()
<html></html>
>>>
>>> print(soup)
<head></head><body><p>Hello World</p></body>
>>>
>>> soup.head.unwrap()
<head></head>
>>>
>>> print(soup)
<body><p>Hello World</p></body>
>>>
>>> soup.body.unwrap()
<body></body>
>>>
>>> print(soup)
<p>Hello World</p>

To be more efficient and reusable you could put those undesirable elements in a list and loop through them...

>>> def get_body_contents(html):
...  soup = BeautifulSoup(html, "html5lib")
...  for attr in ['head','html','body']:
...    if hasattr(soup, attr):
...      getattr(soup, attr).unwrap()
...  return soup
>>>
>>> html = "<p>Hello World</p>"
>>> print(get_body_contents(html))
<p>Hello World</p>

Printmaking answered 11/2, 2020 at 18:44 Comment(0)

Recommended topics

Hot tags