BeautifulSoup - how should I obtain the body contents
Asked Answered
I

2

22

I'm parsing HTML with BeautifulSoup. At the end, I would like to obtain the body contents, but without the body tags. But BeautifulSoup adds html, head, and body tags. I this googlegrops discussion one possible solution is proposed:

>>> from bs4 import BeautifulSoup as Soup
>>> soup = Soup('<p>Some paragraph</p>')
>>> soup.body.hidden = True
>>> soup.body.prettify()
u' <p>\n  Some paragraph\n </p>'

This solution is a hack. There should be a better and obvious way to do it.

Ibnsina answered 30/1, 2014 at 9:44 Comment(1)
Despite all the answers, I still find the .hiddden=True approach the cleanest one. Another hack, if a string result will suffice, would be to truncate the body tags: str(soup.body)[6:-7] or soup.body.prettify()[6:-7]Redcap
C
39

Do you mean getting everything inbetween the body tags?

In this case you can use :

import urllib2
from bs4 import BeautifulSoup
page = urllib2.urlopen('some_site').read()
soup = BeautifulSoup(page)
body = soup.find('body')
the_contents_of_body_without_body_tags = body.findChildren(recursive=False)
Cystic answered 30/1, 2014 at 10:2 Comment(4)
Thanks! When I have two paragraphs, should I use something like ''.join(['%s' % x for x in soup.body.findChildren()]), or is there a better way?Ibnsina
I had some issues using findChildren where some things appearing redundantly, as they are nested withing multiple layers and were added for each containing layer. To get the contents from the body as it is in the original without any redundancy or weirdness I used pagefilling = ''.join(['%s' % x for x in soup.body.contents])Spousal
body.findChildren(recursive=False); helps you not to get nested elements twice.Ambiguous
What about simply: body_content = "".join(str(item) for item in soup.body.contents)?Ryswick
P
5

I've found the easiest way to get just the contents of the body is to unwrap() your contents from inside the body tags.

>>> html = "<p>Hello World</p>"
>>> soup = BeautifulSoup(html, "html5lib")
>>> print(soup)
<html><head></head><body><p>Hello World</p></body></html>
>>>
>>> soup.html.unwrap()
<html></html>
>>>
>>> print(soup)
<head></head><body><p>Hello World</p></body>
>>>
>>> soup.head.unwrap()
<head></head>
>>>
>>> print(soup)
<body><p>Hello World</p></body>
>>>
>>> soup.body.unwrap()
<body></body>
>>>
>>> print(soup)
<p>Hello World</p>

To be more efficient and reusable you could put those undesirable elements in a list and loop through them...

>>> def get_body_contents(html):
...  soup = BeautifulSoup(html, "html5lib")
...  for attr in ['head','html','body']:
...    if hasattr(soup, attr):
...      getattr(soup, attr).unwrap()
...  return soup
>>>
>>> html = "<p>Hello World</p>"
>>> print(get_body_contents(html))
<p>Hello World</p>
Printmaking answered 11/2, 2020 at 18:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.