Why can urlopen download a Google search page but not a Google Scholar search page?
Asked Answered
A

1

1

I'm using Python 3.2.3's urllib.request module to download Google search results, but I'm getting an odd error in that urlopen works with links to Google search results, but not Google Scholar. In this example, I'm searching for "JOHN SMITH". This code successfully prints HTML:

from urllib.request import urlopen, Request
from urllib.error import URLError

# Google
try:
    page_google = '''http://www.google.com/#hl=en&sclient=psy-ab&q=%22JOHN+SMITH%22&oq=%22JOHN+SMITH%22&gs_l=hp.3..0l4.129.2348.0.2492.12.10.0.0.0.0.154.890.6j3.9.0...0.0...1c.gjDBcVcGXaw&pbx=1&bav=on.2,or.r_gc.r_pw.r_qf.,cf.osb&fp=dffb3b4a4179ca7c&biw=1366&bih=649'''
    req_google = Request(page_google)
    req_google.add_header('User Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20120427 Firefox/15.0a1')
    html_google = urlopen(req_google).read()
    print(html_google[0:10])
except URLError as e:
    print(e)

but this code, doing the same for Google Scholar, raises a URLError exception:

from urllib.request import urlopen, Request
from urllib.error import URLError

# Google Scholar
try:
    page_scholar = '''http://scholar.google.com/scholar?hl=en&q=%22JOHN+SMITH%22&btnG=&as_sdt=1%2C14'''
    req_scholar = Request(page_scholar)
    req_scholar.add_header('User Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20120427 Firefox/15.0a1')
    html_scholar = urlopen(req_scholar).read()
    print(html_scholar[0:10])
except URLError as e:
    print(e)

Traceback:

Traceback (most recent call last):
  File "/home/ak5791/Desktop/code-sandbox/scholar/crawler.py", line 6, in <module>
    html = urlopen(page).read()
  File "/usr/lib/python3.2/urllib/request.py", line 138, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.2/urllib/request.py", line 369, in open
    response = self._open(req, data)
  File "/usr/lib/python3.2/urllib/request.py", line 387, in _open
    '_open', req)
  File "/usr/lib/python3.2/urllib/request.py", line 347, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.2/urllib/request.py", line 1155, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/usr/lib/python3.2/urllib/request.py", line 1138, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno -5] No address associated with hostname>

I obtained these links by searching in Chrome and copying the link from there. One commenter reported a 403 error, which I sometimes get as well. I presume this is because Google doesn't support scraping of Scholar. However, changing the User Agent string doesn't fix this or the original problem, since I get URLErrors most of the time.  

Adin answered 14/7, 2012 at 13:42 Comment(2)
I get a 403 (Forbidden), which probably means Google does not want you to scrape information from Scholar searches. The terms of service might disallow this (I did not check them).Diversification
@SvenMarnach I updated the question, since I tried changing the User Agent string. Sometimes I get a URLError, and sometimes I get a 403 error. Most of the time I get the former, however.Adin
Y
4

This PHP script seems to indicate you'll need to set some cookies before Google gives you results:

/*

 Need a cookie file (scholar_cookie.txt) like this:

# Netscape HTTP Cookie File
# http://curlm.haxx.se/rfc/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.

.scholar.google.com     TRUE    /       FALSE   2147483647      GSP     ID=353e8f974d766dcd:CF=2
.google.com     TRUE    /       FALSE   1317124758      PREF    ID=353e8f974d766dcd:TM=1254052758:LM=1254052758:S=_biVh02e4scrJT1H
.scholar.google.co.uk   TRUE    /       FALSE   2147483647      GSP     ID=f3f18b3b5a7c2647:CF=2
.google.co.uk   TRUE    /       FALSE   1317125123      PREF    ID=f3f18b3b5a7c2647:TM=1254053123:LM=1254053123:S=UqjRcTObh7_sARkN

*/

This is corroborated by Python recipe for Google Scholar comment, which includes a warning that Google detects scripts and will disable you if you use it too prolifically.

Yardage answered 14/7, 2012 at 15:36 Comment(2)
The Python recipe is great, albeit a bit outdated since the HTML layout of the results page has since changed. With a few tweaks, though, it's working exactly as needed.Adin
I hate to link this question to a new question, but do you have any idea why the cookie you describe there doesn't validate as a proper cookie? I've been trying to figure that out, since it's holding up my entire application.Adin

© 2022 - 2024 — McMap. All rights reserved.