Changing user agent on urllib2.urlopen
Asked Answered
C

9

102

How can I download a webpage with a user agent other than the default one on urllib2.urlopen?


urllib2.urlopen is not available in Python 3.x; the 3.x equivalent is urllib.request.urlopen. See Changing User Agent in Python 3 for urrlib.request.urlopen to set the user agent in 3.x with the standard library HTTP facilities.

Caper answered 29/4, 2009 at 12:32 Comment(0)
B
61

Setting the User-Agent from everyone's favorite Dive Into Python.

The short story: You can use Request.add_header to do this.

You can also pass the headers as a dictionary when creating the Request itself, as the docs note:

headers should be a dictionary, and will be treated as if add_header() was called with each key and value as arguments. This is often used to “spoof” the User-Agent header, which is used by a browser to identify itself – some HTTP servers only allow requests coming from common browsers as opposed to scripts. For example, Mozilla Firefox may identify itself as "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11", while urllib2‘s default user agent string is "Python-urllib/2.6" (on Python 2.6).

Breast answered 29/4, 2009 at 12:34 Comment(0)
E
119

I answered a similar question a couple weeks ago.

There is example code in that question, but basically you can do something like this: (Note the capitalization of User-Agent as of RFC 2616, section 14.43.)

opener = urllib2.build_opener()
opener.addheaders = [('User-Agent', 'Mozilla/5.0')]
response = opener.open('http://www.stackoverflow.com')
Etom answered 29/4, 2009 at 12:56 Comment(3)
That method works for other headers, but not User-Agent -- at least not in my 2.6.2 installation. User-Agent is ignored for some reason.Pygidium
I believe User-agent should in-fact be User-Agent (The A is capitalised) Seems to work for me when done so.Kenspeckle
Header names are case-insensitive.Principality
P
105
headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('www.example.com', None, headers)
html = urllib2.urlopen(req).read()

Or, a bit shorter:

req = urllib2.Request('www.example.com', headers={ 'User-Agent': 'Mozilla/5.0' })
html = urllib2.urlopen(req).read()
Panegyrize answered 4/3, 2011 at 15:58 Comment(2)
With named parameters you can do this in two lines. Remove the first line and replace the second with this: req = urllib2.Request('www.example.com', headers={'User-Agent': 'Mozilla/5.0'}). I prefer this form for making just a single request.Tantalizing
Or even shorter, in one line: html = urlopen(Request('http://www.example.com', headers={'User-Agent': 'Mozilla/5.0'})).read()Osmo
B
61

Setting the User-Agent from everyone's favorite Dive Into Python.

The short story: You can use Request.add_header to do this.

You can also pass the headers as a dictionary when creating the Request itself, as the docs note:

headers should be a dictionary, and will be treated as if add_header() was called with each key and value as arguments. This is often used to “spoof” the User-Agent header, which is used by a browser to identify itself – some HTTP servers only allow requests coming from common browsers as opposed to scripts. For example, Mozilla Firefox may identify itself as "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11", while urllib2‘s default user agent string is "Python-urllib/2.6" (on Python 2.6).

Breast answered 29/4, 2009 at 12:34 Comment(0)
H
15

For python 3, urllib is split into 3 modules...

import urllib.request
req = urllib.request.Request(url="http://localhost/", headers={'User-Agent':' Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20100101 Firefox/12.0'})
handler = urllib.request.urlopen(req)
Haematoxylon answered 6/5, 2012 at 7:19 Comment(3)
This helped wonderfully. I don't understand why i need request.Request and then repeat urllib.request.urlopen where the old version would just do urllib.urlopen(req) fine but either way, this works and I know how to use it in python 3 now.Diphenylamine
I am still getting Error 404 :(Upend
I've removed the confusing data=b'None' parameter from the answer. It transformed the example request to POST with invalid data. Probably the reason of the failure in your case, @MaksimOsmo
P
9

All these should work in theory, but (with Python 2.7.2 on Windows at least) any time you send a custom User-agent header, urllib2 doesn't send that header. If you don't try to send a User-agent header, it sends the default Python / urllib2

None of these methods seem to work for adding User-agent but they work for other headers:

opener = urllib2.build_opener(proxy)
opener.addheaders = {'User-agent':'Custom user agent'}
urllib2.install_opener(opener)

request = urllib2.Request(url, headers={'User-agent':'Custom user agent'})

request.headers['User-agent'] = 'Custom user agent'

request.add_header('User-agent', 'Custom user agent')
Pitfall answered 24/1, 2012 at 21:32 Comment(2)
opener.addheaders should probably be [('User-agent', 'Custom user agent')]. Otherwise all these methods should work (I've tested on Python 2.7.3 (Linux)). In your case it might break because you use the proxy argument wrong.Buckshee
For me the build_opener call returns with a default User-Agent being already defined in the headers. So appending will just create another User-Agent header, which as 2nd will be ignored. That's why @jcoon's sol is working.Comfrey
U
6

For urllib you can use:

from urllib import FancyURLopener

class MyOpener(FancyURLopener, object):
    version = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11'

myopener = MyOpener()
myopener.retrieve('https://www.google.com/search?q=test', 'useragent.html')
Unfortunate answered 20/3, 2015 at 8:1 Comment(0)
E
5

Another solution in urllib2 and Python 2.7:

req = urllib2.Request('http://www.example.com/')
req.add_unredirected_header('User-Agent', 'Custom User-Agent')
urllib2.urlopen(req)
Emersonemery answered 14/1, 2013 at 18:54 Comment(1)
I get an error 404 for a page that exist if url entered trough my browserHemielytron
H
2

there are two properties of urllib.URLopener() namely:
addheaders = [('User-Agent', 'Python-urllib/1.17'), ('Accept', '*/*')] and
version = 'Python-urllib/1.17'.
To fool the website you need to changes both of these values to an accepted User-Agent. for e.g.
Chrome browser : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.149 Safari/537.36'
Google bot : 'Googlebot/2.1'
like this

import urllib
page_extractor=urllib.URLopener()  
page_extractor.addheaders = [('User-Agent', 'Googlebot/2.1'), ('Accept', '*/*')]  
page_extractor.version = 'Googlebot/2.1'
page_extractor.retrieve(<url>, <file_path>)

changing just one property does not work because the website marks it as a suspicious request.

Histrionism answered 7/8, 2017 at 5:19 Comment(0)
H
1

Try this :

html_source_code = requests.get("http://www.example.com/",
                   headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.107 Safari/537.36',
                            'Upgrade-Insecure-Requests': '1',
                            'x-runtime': '148ms'}, 
                   allow_redirects=True).content
Holcman answered 29/7, 2015 at 7:30 Comment(1)
The question explicitly discusses urllib2 and not other modules.Cupel

© 2022 - 2024 — McMap. All rights reserved.