Why urllib.urlopen.read() does not correspond to source code?

Asked 17/9, 2012 at 20:48 Answered 26/1, 2014 at 5:7

I'm trying to fetch the following webpage:

import urllib
urllib.urlopen("http://www.gallimard-jeunesse.fr/searchjeunesse/advanced/(order)/author?catalog[0]=1&SearchAction=1").read()

The result does not correspond to what I see when inspecting the source code of the webpage using Google Chrome for example.

Could you tell me why this happens and how I could improve my code to overcome the problem?

Thank you for your help.

Vouvray answered 17/9, 2012 at 20:48 Comment(2)

Hello, urllib.urlopen.read() gives me for example in the body:

<body>\n<div id="contenu"><script language="JavaScript" type="text/javascript">Album1.EcritElement(0);</script></div>\n</html>

which is too small information regarding what is on the page. – Vouvray 17/9, 2012 at 20:59

See Srikar's answer. The page is generated dynamically using javascript. The key is in "Album1.EcritElement(0)". – Immethodical 17/9, 2012 at 21:12

What you are getting from urlopen is the raw webpage meaning no javascript is executed css is not used; where as what you get from Chrome (or other browsers) is final webpage which included executable javascript (which might alter the HTML), css rendering etc. all of which does not happen in urlopen...

Hence the difference, hope this is clear

Spinule answered 17/9, 2012 at 20:51 Comment(4)

Does Chrome's source view change when the DOM is manipulated? The Firefox one doesn't. – Kinny 17/9, 2012 at 20:56

@delnan the OP doesn't explicitly say he is using View Source (which doesn't change) rather than Inspect Element (which does). – Bortz 17/9, 2012 at 20:58

@SrikarThanks what should I use instead of urlopen to have the final webpage parsed then? – Vouvray 17/9, 2012 at 21:5

oh well thats a big task.browser rendering engines have been maturing for more than a decade to deal with broken HTML, wrong syntaxes etc. You would need to use a javascript engine for sure V8 or others. That aside i would like to know what are you exactly doing that warrants this? – Spinule 18/9, 2012 at 4:24

you can use python Selenium to solved your issue. Here is a example code have a look.

from selenium import webdriverr
url = "http://www.gallimard-jeunesse.fr/searchjeunesse/advanced/(order)/author?catalog[0]=1&SearchAction=1"
browser = webdriver.Firefox()
browser.get(url)
sleep(10)
all_body_id_html =  browser.find_element_by_id('body') # you can also get all html

Then due your rest of work according to your choice some more example with browser instance

def login(user='ssdf', password="cisin123"):
content = browser.find_element_by_id('content')
content.find_element_by_xpath('.//tbody/tr[2]//input[contains(@class,"textbox")]').send_keys(user)
content.find_element_by_xpath('.//tbody/tr[3]//input[contains(@class,"textbox")]').send_keys(password)
content.find_element_by_css_selector(".button").click()

Horsehide answered 22/1, 2014 at 7:12 Comment(1)

although the other comment answers the basic question "why?", only this answer tells you how to solve the actual problem. – Bing 25/1, 2014 at 23:54

You can use Selenium with Firefox for solving the issue, but it may not be suitable in many cases as the browser pops up every-time you run the code. Another idea is to use a headless broswer like PhantomJS.

The best way for this is to use the mechanize library. Install mechanize via pip.

pip install mechanize

Then you can use the following code:

import mechanize 

mb = mechanize.Browser()
mb.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')] 
mb.set_handle_robots(False)
url = "http://www.gallimard-jeunesse.fr/searchjeunesse/advanced/(order)/author?catalog[0]=1&SearchAction=1"
response = mb.open(url).read()
print response

It also provides option for sleep and executing scripts. You can read them in the documentation.

Northumbrian answered 26/1, 2014 at 5:7 Comment(0)

Also, some websites have a so called browser switch which might lead to different source being shown when using different browsers (e.g. show a light version for mobile browsers).

Have a look at http://www.diveintopython.net/http_web_services/user_agent.html on how to change the User-Agent to something like "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1" (which is actually my User-Agent).

Flagstone answered 17/9, 2012 at 21:1 Comment(0)

It sounds like you want a library that can act like a browser and run the javascript for you, then give you the resulting source code. Windmill should be able to do this for you. (http://www.getwindmill.com/)

There is a good article on how to use it for what you want here:
http://www.packtpub.com/article/web-scraping-with-python

Fretwell answered 23/1, 2014 at 2:1 Comment(0)

Recommended topics

Hot tags