How can I use Python's Requests to fake a browser visit a.k.a and generate User Agent? [duplicate]
Asked Answered
C

9

192

I want to get the content from this website.

If I use a browser like Firefox or Chrome, I could get the real website page I want, but if I use the Python Requests package (or wget command) to get it, it returns a totally different HTML page.

I thought the developer of the website had made some blocks for this.

How do I fake a browser visit by using Python's Requests or command wget?

Cephalochordate answered 26/12, 2014 at 3:29 Comment(1)
Who chooses these names? There is a well-known example of typosquatting involving Requests ("Request" - without the "s"). See also What is the story behind RussianIdiot on PyPI?.Lemmon
L
406

Provide a User-Agent header:

import requests

url = 'http://www.ichangtou.com/#company:data_000008.html'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

response = requests.get(url, headers=headers)
print(response.content)

FYI, here is a list of User-Agent strings for different browsers:


As a side note, there is a pretty useful third-party package called fake-useragent that provides a nice abstraction layer over user agents:

fake-useragent

Up to date simple useragent faker with real world database

Demo:

>>> from fake_useragent import UserAgent
>>> ua = UserAgent()
>>> ua.chrome
u'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1667.0 Safari/537.36'
>>> ua.random
u'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.67 Safari/537.36'
Lebel answered 26/12, 2014 at 3:31 Comment(9)
thanks for your answer, I tried with the headers in my requests but still could not get the real content of the page, there's a string 'Your web browser must have JavaScript enabled in order for this application to display correctly.' in the returned html page, should I add java script support in the requests? If so how would I do that?Cephalochordate
@user1726366: You can't simply add JavaScript support - you need a JavaScript interpreter for that. The simplest approach is to use the JavaScript interpreter of a real Web browser, but you can automate that from Python using Selenium.Intransigent
@alecxe,@sputnick: I tried to capture the packets with wireshark to compare the difference from using python requests and browser, seems like the website url isn't a static one I have to wait for the page render to complete, so Selenium sounds the right tools for me. Thank you for your kind help. :)Cephalochordate
Turns out some search engines filter some UserAgent. Anyone know why ? Could anyone provide a list of acceptable UserAgents ?Korten
This is the top User-Agent attacking us nowadays, I wonder why ><Mackle
The link to List of all Browsers seems to be dead now.Armalda
Is this legal? What if I develop a mobile app that uses this in the backend and some website gets high enough traffic to cause problems?Larynx
user-agents.top is also hosting a list of user agents. There are lists of user agents that passed captcha testArroyo
The mentioned Python library fake_useragent is outdated for a long time and does not work on my system. The instantiation fails under whatever circumstances.Exocentric
C
57

I used fake UserAgent.

Install:

pip install fake-useragent

How to use:

from fake_useragent import UserAgent
import requests
   

ua = UserAgent()
print(ua.chrome)
header = {'User-Agent':str(ua.chrome)}
print(header)
url = "https://www.hybrid-analysis.com/recent-submissions?filter=file&sort=^timestamp"
htmlContent = requests.get(url, headers=header)
print(htmlContent)

Output:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17
{'User-Agent': 'Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36'}
<Response [200]>

Additional features are available as well. Random return a only Edge or Chrome browser user agents:

from fake_useragent import UserAgent
ua = UserAgent(browsers=['edge', 'chrome'])
ua.random

Randomly return only Linux OS user agent strings:

from fake_useragent import UserAgent
ua = UserAgent(os='linux')
ua.random

Or randomly return user agent strings with a minimum usage percentage of 1.3% or higher:

from fake_useragent import UserAgent
ua = UserAgent(min_percentage=1.3)
ua.random
Centrobaric answered 12/4, 2017 at 7:48 Comment(2)
Let us continue this discussion in chat.Centrobaric
404 should now be resolved.I'm the maintainer of this Python package. It also has now the ability to filter on OS and usage percentage.Mothy
U
11

Try doing this, using Firefox as a fake user agent (moreover, it's a good startup script for web scraping with the use of cookies):

#!/usr/bin/env python2
# -*- coding: utf8 -*-
# vim:ts=4:sw=4


import cookielib, urllib2, sys

def doIt(uri):
    cj = cookielib.CookieJar()
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    page = opener.open(uri)
    page.addheaders = [('User-agent', 'Mozilla/5.0')]
    print page.read()

for i in sys.argv[1:]:
    doIt(i)

Usage:

python script.py "http://www.ichangtou.com/#company:data_000008.html"
Uticas answered 26/12, 2014 at 3:34 Comment(0)
S
7

The root of the answer is that the person asking the question needs to have a JavaScript interpreter to get what they are after. I have found I am able to get all of the information I wanted on a website in JSON before it was interpreted by JavaScript. This has saved me a ton of time in what would be parsing HTML content, hoping each webpage is in the same format.

So when you get a response from a website using Requests, really look at the html/text part because you might find the JavaScript's JSON in the footer ready to be parsed.

Sarsenet answered 19/11, 2017 at 2:27 Comment(0)
C
6

I use pyuser_agent. This package uses get user agent.

import pyuser_agent
import requests

ua = pyuser_agent.UA()

headers = {
      "User-Agent" : ua.random
}
print(headers)

uri = "https://github.com/THAVASIGTI/"
res = requests.request("GET",uri,headers=headers)
print(res)

Console output

{'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN) AppleWebKit/533+ (KHTML, like Gecko)'}
<Response [200]>
Chally answered 11/12, 2021 at 18:32 Comment(2)
I think you should disclose that you are the author of the software you are promoting here.Wentletrap
What do you mean by "This package uses get user agent" (seems incomprehensible)? Can you elaborate?Lemmon
D
5

Answer

You need to create a header with a proper formatted user agent string. It serves to communicate client-server.

You can check your own user agent Here.

Example

Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0
Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0

Third party Package user_agent 0.1.9

I found this module very simple to use, in one line of code it randomly generates a User agent string.

from user_agent import generate_user_agent, generate_navigator
from pprint import pprint

print(generate_user_agent())
# 'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.3; Win64; x64)'

print(generate_user_agent(os=('mac', 'linux')))
# 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:36.0) Gecko/20100101 Firefox/36.0'

pprint(generate_navigator())

# {'app_code_name': 'Mozilla',
#  'app_name': 'Netscape',
#  'appversion': '5.0',
#  'name': 'firefox',
#  'os': 'linux',
#  'oscpu': 'Linux i686 on x86_64',
#  'platform': 'Linux i686 on x86_64',
#  'user_agent': 'Mozilla/5.0 (X11; Ubuntu; Linux i686 on x86_64; rv:41.0) Gecko/20100101 Firefox/41.0',
#  'version': '41.0'}

pprint(generate_navigator_js())

# {'appCodeName': 'Mozilla',
#  'appName': 'Netscape',
#  'appVersion': '38.0',
#  'platform': 'MacIntel',
#  'userAgent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:38.0) Gecko/20100101 Firefox/38.0'}
Dardar answered 7/12, 2020 at 4:36 Comment(0)
I
2

User agent is OK, but he wants to fetch a JavaScript site. We can use Selenium, but it is annoying to set up and maintain, so the best way to fetch a JavaScript rendered page is the requests_html module. Which is a superset of the well-known Requests module. To install, use pip:

pip install requests-html

And to fetch a JavaScript rendered page, use:

from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://python.org/')

It uses puppter to render JavaScript and also it downloads Chromium, but you don't have to worry about everything that is happening under the hood. You will get the end result.

Intramuscular answered 21/6, 2022 at 3:11 Comment(1)
No, it is with "s" at the end (Requests). That is how typosquatting comes about. See e.g. Typosquatting in package repositories.Lemmon
T
1

I had a similar issue, but I was unable to use the UserAgent class inside the fake_useragent module. I was running the code inside a Docker container.

import requests
import ujson
import random

response = requests.get('https://fake-useragent.herokuapp.com/browsers/0.1.11')
agents_dictionary = ujson.loads(response.text)
random_browser_number = str(random.randint(0, len(agents_dictionary['randomize'])))
random_browser = agents_dictionary['randomize'][random_browser_number]
user_agents_list = agents_dictionary['browsers'][random_browser]
user_agent = user_agents_list[random.randint(0, len(user_agents_list)-1)]

I targeted the endpoint used in the module. This solution still gave me a random user agent. However, there is the possibility that the data structure at the endpoint could change.

Torgerson answered 4/12, 2020 at 13:36 Comment(0)
L
1

This is how, I have been using a random user agent from a list of nearlly 1000 fake user agents

from random_user_agent.user_agent import UserAgent
from random_user_agent.params import SoftwareName, OperatingSystem
software_names = [SoftwareName.ANDROID.value]
operating_systems = [OperatingSystem.WINDOWS.value, OperatingSystem.LINUX.value, OperatingSystem.MAC.value]   

user_agent_rotator = UserAgent(software_names=software_names, operating_systems=operating_systems, limit=1000)

# Get list of user agents.
user_agents = user_agent_rotator.get_user_agents()

user_agent_random = user_agent_rotator.get_random_user_agent()

Example

print(user_agent_random)

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36

For more details visit this link

Lobeline answered 29/12, 2020 at 7:6 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.