BeautifulSoup, where are you putting my HTML?
Asked Answered
C

1

2

I'm using BS4 with python2.7. Here's the start of my code (Thanks root):

from bs4 import BeautifulSoup
import urllib2

f=urllib2.urlopen('http://yify-torrents.com/browse-movie')
html=f.read()
soup=BeautifulSoup(html)

When I print html, its contents are the same as the source of the page viewed in chrome. When I print soup however, it cuts out all the entire body and leaves me with this (the contents of the head tag):

<!DOCTYPE html>

<html>
<head>
<title>Browse Movie - YIFY Torrents</title>
<meta charset="utf-8">
<meta content="IE=9" http-equiv="X-UA-Compatible"/>
<meta content="YIFY-Torrents.com - The official YIFY Torrents website. Here you will be able to browse and download all YIFY rip movies in excellent DVD, 720p, 1080p and 3D quality, all at the smallest file size." name="description"/>
<meta content="torrents, yify, movies, movie, download, 720p, 1080p, 3D, browse movies, yify-torrents" name="keywords"/>
<link href="http://static.yify-torrents.com/yify.ico" rel="shortcut icon"/>
<link href="http://yify-torrents.com/rss" rel="alternate" title="YIFY-Torrents RSS feed" type="application/rss+xml"/>
<link href="http://static.yify-torrents.com/assets/css/styles.css?1353330463" rel="stylesheet" type="text/css"/>
<link href="http://static.yify-torrents.com/assets/css/colorbox.css?1327223987" rel="stylesheet" type="text/css"/>
<script src="http://static.yify-torrents.com/assets/js/jquery-1.6.1.min.js?1327224013" type="text/javascript"></script>
<script src="http://static.yify-torrents.com/assets/js/jquery.validate.min.js?1327224011" type="text/javascript"></script>
<script src="http://static.yify-torrents.com/assets/js/jquery.colorbox-min.js?1327224010" type="text/javascript"></script>
<script src="http://static.yify-torrents.com/assets/js/form.js?1349683447" type="text/javascript"></script>
<script src="http://static.yify-torrents.com/assets/js/common.js?1353399801" type="text/javascript"></script>
<script>
        var webRoot = 'http://yify-torrents.com/';
        var IsLoggedIn = 0  </script>
<!--[if !IE]><!--><style type="text/css">#content input.field:focus, #content textarea:focus{border: 1px solid #47bc15 !important;}</style></meta></head></html> 

Where am I going wrong?!

Callan answered 7/12, 2012 at 10:24 Comment(6)
And what exactly is it that you are missing? Note that we want this question to remain useful even if http://yify-torrents.com/browse-movie changes.Agostino
@Martijn Pieters The page is dynamically generated, so it is constantly changing. What I'm missing is the entire body of the page, any idea as to why that could be?Callan
Please edit your question to clarify what you are expecting. If the page is being altered by JavaScript running in the browser, BeautifulSoup will not include those changes.Agostino
@MartijnPieters Your first comment prompted me to edit it :P I do see that my expectations were initially unclear! haha. "goodness" isn't synonymous with "body contents". The page isn't being dynamically updated. I'd say it's generated by PHP, but I can't be too sure. urllib2 is feeding the raw code to BeautifulSoup correctly as far as I can tell. Could the page be breaking BeautifulSoup? I don't understand it enough as a function to know what it is for sure, the code goes beyond me.Callan
It is technically possible that the generated HTML is broken and that BeautifulSoup cannot reconstruct something sensible from it.Agostino
It would appear that there are many errors in the page. I have run it through a validator hereArguello
A
8

I had the same problem and this solved my problem:

soup = BeautifulSoup(html, 'html5lib')

You need to install html5lib:

pip install html5lib

or

easy_install html5lib

You can read more about different parsers (pros and cons) for Beautiful Soup here:

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

Aludel answered 23/3, 2013 at 15:2 Comment(2)
you saved my life ... i had page loading perfectly but when I parsed it into BeautifulSOup() function, a lot of HTML was removed ... I tried your suggestion and it is fine now thanks alotMasonite
@Umair: Awesome, I think I found the answer after having the same problem you had. Glad I could help.Aludel

© 2022 - 2024 — McMap. All rights reserved.