Changing User Agent in Python 3 for urrlib.request.urlopen

Asked 15/6, 2014 at 5:18 Answered 19/4, 2017 at 16:43

Solved python python-3.x urllib user-agent

I want to open a url using urllib.request.urlopen('someurl'):

with urllib.request.urlopen('someurl') as url:
b = url.read()

I keep getting the following error:

urllib.error.HTTPError: HTTP Error 403: Forbidden

I understand the error to be due to the site not letting python access it, to stop bots wasting their network resources- which is understandable. I went searching and found that you need to change the user agent for urllib. However all the guides and solutions I have found for this issue as to how to change the user agent have been with urllib2, and I am using python 3 so all the solutions don't work.

How can I fix this problem with python 3?

Montgomery answered 15/6, 2014 at 5:18 Comment(1)

a 403 error may not be due to your user-agent. – Cudlip 15/6, 2014 at 5:30

116

From the Python docs:

import urllib.request
req = urllib.request.Request(
    url, 
    data=None, 
    headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
    }
)

f = urllib.request.urlopen(req)
print(f.read().decode('utf-8'))

Yurev answered 15/6, 2014 at 5:21 Comment(3)

import urllib.request ImportError: No module named request – Oeildeboeuf 10/5, 2018 at 19:45

@Oeildeboeuf stop using Python 2, this is Python 3 – Unless 13/7, 2019 at 15:12

I got this error >web_byte = req.read() AttributeError: 'bytes' object has no attribute 'read' – Magree 18/9, 2020 at 12:18

from urllib.request import urlopen, Request

urlopen(Request(url, headers={'User-Agent': 'Mozilla'}))

Industrialist answered 9/4, 2015 at 19:3 Comment(2)

This is important. I had to import urllib.request not simply urllib. Everything else in the accepted answer works with this modification. – Dunseath 24/1, 2016 at 3:21

Yeah, you do, but the accepted answer doesn't so I wanted to draw attention to your answer because it addresses a flaw in the accepted one. – Dunseath 26/1, 2016 at 7:8

I just answered a similar question here: https://mcmap.net/q/242315/-urlretrieve-and-user-agent-python

In case you just not only want to open the URL, but also want to download the resource(say, a PDF file), you can use the code as below:

    # proxy = ProxyHandler({'http': 'http://192.168.1.31:8888'})
    proxy = ProxyHandler({})
    opener = build_opener(proxy)
    opener.addheaders = [('User-Agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.1 Safari/603.1.30')]
    install_opener(opener)

    result = urlretrieve(url=file_url, filename=file_name)

The reason I added proxy is to monitor the traffic in Charles, and here is the traffic I got:

Trickish answered 19/4, 2017 at 16:43 Comment(0)

The host site rejection is coming from the OWASP ModSecurity Core Rules for Apache mod-security. Rule 900002 has a list of "bad" user agents, and one of them is "python-urllib2". That's why requests with the default user agent fail.

Unfortunately, if you use Python's "robotparser" function,

https://docs.python.org/3.5/library/urllib.robotparser.html?highlight=robotparser#module-urllib.robotparser

it uses the default Python user agent, and there's no parameter to change that. If "robotparser"'s attempt to read "robots.txt" is refused (not just URL not found), it then treats all URLs from that site as disallowed.

Surah answered 19/5, 2016 at 23:7 Comment(0)

Recommended topics

Hot tags