Getting latest chrome user agent for Scrapy in python or other wise
Asked Answered
D

2

5

Recently I have started to use Scrapy on a regular basis to analyze sites which demand the latest browser (user agent) for their content to show up. Now, this may seem like an old time problem, yet up-to-date the issue is quite open. Why?

There is no simple API or Package to generate/download the latest version user agents (in any OS/platform).

A number of packages try to resolve this:

  1. shadow-useragent - but it relies on voluntary server which is for some reason inactive as of now.
  2. latest-user-agents - but it also uses a hosted json file + it also lists old user agents, and it doesn't have any documentation.

Lastly there is this web: www.whatismybrowser.com which is very helpful but can't be easily automated...

Any clue how to resove that?

Daubery answered 21/6, 2021 at 10:22 Comment(0)
O
8

An old answer, but I was actually also looking for this feature.

Your 2nd option latest-user-agents works as it gets the JSON file from a daily updated source from the same author.

The reason the README.md contains old user-agents is that the repository is not updated, so this is just an example. I tried it today, and it has all the latest user-agents what I could see.

Overstreet answered 6/3, 2022 at 11:55 Comment(0)
R
0

There is @jnrbsn's feed https://jnrbsn.github.io/user-agents/user-agents.json (@Matteus also recommended).

  • This project is a cron-based GitHub hosted scraper that gets its latest user agent strings from whatismybrowser.com and dumps it into json file

  • Includes permutations of user agent strings for Firefox, Chrome, Edge, and Safari with Linux, MacOS, and Windows operating systems

  • 🔔 Note that using this feed potentially exposes you to legal risk as whatismybrowser.com has similar commercial offering via API subscription https://explore.whatismybrowser.com/useragents/explore/


Personally, for one of my non-profit experiment projects, I needed something like this and I decided to use @jnrbsn's feed. I needed to keep my user-agent string in my Python script up to date to the latest Chrome on Windows version.

Sharing the short (and Naïve) Python function I wrote to provide that for me:

import requests

def get_latest_user_agent(operating_system='windows', browser='chrome'):
    url = f'https://jnrbsn.github.io/user-agents/user-agents.json'
    r = requests.get(url)
    r.raise_for_status()
    user_agents = r.json()

    for user_agent in user_agents:
        if operating_system.lower() in user_agent.lower() and browser.lower() in user_agent.lower():
            return user_agent

    return None


print(get_latest_user_agent(operating_system='windows', browser='chrome'))
print(get_latest_user_agent(operating_system='linux', browser='chrome'))
print(get_latest_user_agent(operating_system='mac', browser='chrome'))

outputs:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36
Rincon answered 31/8 at 9:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.