BeautifulSoup not extracting all html (automatically deleting much of a page's html)
Asked Answered
R

2

9

I am trying to use BeautifulSoup to extract the contents from a website (http://brooklynexposed.com/events/). As an example of the problem I can run the following code:

import urllib
import bs4 as BeautifulSoup

url = 'http://brooklynexposed.com/events/'
html = urllib.urlopen(url).read()

soup = BeautifulSoup.BeautifulSoup(html)
print soup.prettify().encode('utf-8')

The output seems to cut off the html as follows:

       <li class="event">
        9:00pm - 11:00pm
        <br/>
        <a href="http://brooklynexposed.com/events/entry/5432/2013-07-16">
         Comedy Sh
        </a>
       </li>
      </ul>
     </div>
    </div>
   </div>
  </div>
 </body>
</html>

It is cutting off the listing with the name Comedy Show along with all html that comes after until the final closing tags. Majority of the html is being automatically removed. I have noticed similar things on numerous website, that if the page is too long, BeautifulSoup fails to parse the entire page and just cuts out text. Does anyone have a solution for this? If BeautifulSoup is not capable of handling such pages, does anyone know any other libraries with functions similar to prettify()?

Rosecan answered 15/7, 2013 at 17:25 Comment(6)
For me, the entire content is there. Starting with your code to create the soup object: >>> len(unicode(soup)) returns 107578 Which version of BS are you using? I am using 4.2.0.Lepsy
Your code works for me just fine. If the network transfer was interrupted at that exact point (so you only loaded up until Comedy Sh) then the HTML parser will 'close' all still-open tags and you see exactly what you got.Thrombocyte
Interesting, I was using 4.2.1 with Python 2.7. However when I use 3.2 it seems to work. It couldn't have been a timeout issue because if I printed the original HTML to a file then all of the text appeared. Any other ideas on a solution for 2.7? Otherwise it's time to start porting my code.Rosecan
Different HTML parser used? BeautifulSoup will use the 'best' parser available, so if lxml is installed that'll be used for example. Different parsers handle broken HTML differently. You may want to run the .diagnose() method to see what BeautifulSoup tells you about that. If you cannot figure out what that tells you, paste the output here in your question.Thrombocyte
Not sure if you figured it out or not, but it worked fine for me with beautiful soup 4.1.1 and python 2.7. I upgraded to 4.3.1 and it still worked.Heron
I guess this may help you: #13761664Popular
C
5

I had troubles that bs4 cuts html on some machines and on some not. It was not reproducable....

I switched to this:

soup = bs4.BeautifulSoup(html, 'html5lib')

.. and it works now.

Comprehend answered 8/4, 2016 at 13:8 Comment(0)
P
0

It's working fine for me, but I get error when I say soup.prettify().encode('utf-8')

>>> from BeautifulSoup import BeautifulSoup as bs
>>> 
>>> import urllib
>>> url = 'http://brooklynexposed.com/events/'
>>> html = urllib.urlopen(url).read()
>>> 
>>> 
>>> soup = bs(html)
>>> soup.prettify().encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 8788: ordinal not in range(128)
>>>
>>> soup.prettify()
'<!doctype html>\n<!--[if lt IE 7 ]&gt; 
&lt;html class="no-js ie6" lang="en"&gt; &lt;![endif]-->\n
<!--[if IE 7 ]&gt;
...
...
...
...
</body>\n</html>\n'

. . . . I guess this may help you: BeautifulSoup, where are you putting my HTML?

Popular answered 28/10, 2013 at 18:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.