I am trying to use BeautifulSoup to extract the contents from a website (http://brooklynexposed.com/events/). As an example of the problem I can run the following code:
import urllib
import bs4 as BeautifulSoup
url = 'http://brooklynexposed.com/events/'
html = urllib.urlopen(url).read()
soup = BeautifulSoup.BeautifulSoup(html)
print soup.prettify().encode('utf-8')
The output seems to cut off the html as follows:
<li class="event">
9:00pm - 11:00pm
<br/>
<a href="http://brooklynexposed.com/events/entry/5432/2013-07-16">
Comedy Sh
</a>
</li>
</ul>
</div>
</div>
</div>
</div>
</body>
</html>
It is cutting off the listing with the name Comedy Show along with all html that comes after until the final closing tags. Majority of the html is being automatically removed. I have noticed similar things on numerous website, that if the page is too long, BeautifulSoup fails to parse the entire page and just cuts out text. Does anyone have a solution for this? If BeautifulSoup is not capable of handling such pages, does anyone know any other libraries with functions similar to prettify()?
soup
object:>>> len(unicode(soup))
returns107578
Which version of BS are you using? I am using 4.2.0. – LepsyComedy Sh
) then the HTML parser will 'close' all still-open tags and you see exactly what you got. – Thrombocytelxml
is installed that'll be used for example. Different parsers handle broken HTML differently. You may want to run the.diagnose()
method to see what BeautifulSoup tells you about that. If you cannot figure out what that tells you, paste the output here in your question. – Thrombocyte