Google Search Web Scraping with Python [closed]
Asked Answered
L

8

24

I've been learning a lot of python lately to work on some projects at work.

Currently I need to do some web scraping with google search results. I found several sites that demonstrated how to use ajax google api to search, however after attempting to use it, it appears to no longer be supported. Any suggestions?

I've been searching for quite a while to find a way but can't seem to find any solutions that currently work.

Luht answered 27/7, 2016 at 17:20 Comment(4)
You can search with Google without an API, but you're likely to get banned by Google if they suspect you're a bot. Read the TOS, you'll likely have to pay to use their API in any significant way.Spiculate
I researched how to do it without an API, I have to change my header/user-agent info. But even when I do that I still can't get results. If that would work, I'd just put a sleep timer in between each request as to not be viewed as a bot.Luht
I have written a google search bot, it works great, but since using a bot directly violates the ToS for Google, I'm not going to post it. Whatever you're trying to do, maybe go through the official APIs.Spiculate
It is against Google's Webmaster Guidelines and terms of service to submit programmatic search queries. Running this code against Google is likely to cause Google to show captcha for searches from your IP address.Israel
M
14

You can always directly scrape Google results. To do this, you can use the URL https://google.com/search?q=<Query> this will return the top 10 search results.

Then you can use lxml for example to parse the page. Depending on what you use, you can either query the resulting node tree via a CSS-Selector (.r a) or using a XPath-Selector (//h3[@class="r"]/a)

In some cases the resulting URL will redirect to Google. Usually it contains a query-parameter qwhich will contain the actual request URL.

Example code using lxml and requests:

from urllib.parse import urlencode, urlparse, parse_qs

from lxml.html import fromstring
from requests import get

raw = get("https://www.google.com/search?q=StackOverflow").text
page = fromstring(raw)

for result in page.cssselect(".r a"):
    url = result.get("href")
    if url.startswith("/url?"):
        url = parse_qs(urlparse(url).query)['q']
    print(url[0])

A note on google banning your IP: In my experience, google only bans if you start spamming google with search requests. It will respond with a 503 if Google thinks you are bot.

Motion answered 27/7, 2016 at 18:46 Comment(4)
Thanks, I was able to get something working similar to this.Luht
As of today, this is not working for me. When I view the source and DOM structure of the Google search results page, it looks as if the results are being loaded and rendered in JavaScript which would prevent this sort of naive scraping. Is this working for anyone else?Shah
@Lane Rettig Works fine.Noyes
Not working for me. page.cssselect(".r a") is an empty array.Amiss
O
11

Here is another service that can be used for scraping SERPs (https://zenserp.com) It does not require a client and is cheaper.

Here is a python code sample:

import requests

headers = {
    'apikey': '',
}

params = (
    ('q', 'Pied Piper'),
    ('location', 'United States'),
    ('search_engine', 'google.com'),
    ('language', 'English'),
)

response = requests.get('https://app.zenserp.com/api/search', headers=headers, params=params)
Orlando answered 14/3, 2019 at 18:8 Comment(1)
I am using the API since 2 months, since it was the only one offering a free plan to start with. Working well & did not have problems so far!Excuse
E
11

You have 2 options. Building it yourself or using a SERP API.

A SERP API will return the Google search results as a formatted JSON response.

I would recommend a SERP API as it is easier to use, and you don't have to worry about getting blocked by Google.

1. SERP API

I have good experience with the scraperbox serp api.

You can use the following code to call the API. Make sure to replace YOUR_API_TOKEN with your scraperbox API token.

import urllib.parse
import urllib.request
import ssl
import json
ssl._create_default_https_context = ssl._create_unverified_context

# Urlencode the query string
q = urllib.parse.quote_plus("Where can I get the best coffee")

# Create the query URL.
query = "https://api.scraperbox.com/google"
query += "?token=%s" % "YOUR_API_TOKEN"
query += "&q=%s" % q
query += "&proxy_location=gb"

# Call the API.
request = urllib.request.Request(query)

raw_response = urllib.request.urlopen(request).read()
raw_json = raw_response.decode("utf-8")
response = json.loads(raw_json)

# Print the first result title
print(response["organic_results"][0]["title"])

2. Build your own Python scraper

I recently wrote an in-depth blog post on how to scrape search results with Python.

Here is a quick summary.

First you should get the HTML contents of the Google search result page.

import urllib.request

url = 'https://google.com/search?q=Where+can+I+get+the+best+coffee'

# Perform the request
request = urllib.request.Request(url)

# Set a normal User Agent header, otherwise Google will block the request.
request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36')
raw_response = urllib.request.urlopen(request).read()

# Read the repsonse as a utf-8 string
html = raw_response.decode("utf-8")

Then you can use BeautifulSoup to extract the search results. For example, the following code will get all titles.

from bs4 import BeautifulSoup

# The code to get the html contents here.

soup = BeautifulSoup(html, 'html.parser')

# Find all the search result divs
divs = soup.select("#search div.g")
for div in divs:
    # Search for a h3 tag
    results = div.select("h3")

    # Check if we have found a result
    if (len(results) >= 1):

        # Print the title
        h3 = results[0]
        print(h3.get_text())

You can extend this code to also extract the search result urls and descriptions.

Ecklund answered 4/1, 2021 at 14:15 Comment(1)
#1 doesn't work for basic example from their own page. Guess Google got to them too.Priedieu
D
5

Current answers will work but google will ban your for scrapping.

My current solution uses the requests_ip_rotator

import requests
from requests_ip_rotator import ApiGateway
import os

keywords = ['test']


def parse(keyword, session):
    url = f"https://www.google.com/search?q={keyword}"
    response = session.get(url)
    print(response)


if __name__ == '__main__':
    AWS_ACCESS_KEY_ID = ''
    AWS_SECRET_ACCESS_KEY = ''

    gateway = ApiGateway("https://www.google.com", access_key_id=AWS_ACCESS_KEY_ID,
                         access_key_secret=AWS_SECRET_ACCESS_KEY)
    gateway.start()

    session = requests.Session()
    session.mount("https://www.google.com", gateway)

    for keyword in keywords:
        parse(keyword, session)
    gateway.shutdown()

AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY you can create in AWS console.

This solution allow you to parse 1 million requests (amazon free limit)

Demimondaine answered 26/12, 2022 at 15:14 Comment(1)
Nice! Seems to work greatMarginalia
T
2

You can also use a third party service like Serp API - I wrote and run this tool - that is a paid Google search engine results API. It solves the issues of being blocked, and you don't have to rent proxies and do the result parsing yourself.

It's easy to integrate with Python:

from lib.google_search_results import GoogleSearchResults

params = {
    "q" : "Coffee",
    "location" : "Austin, Texas, United States",
    "hl" : "en",
    "gl" : "us",
    "google_domain" : "google.com",
    "api_key" : "demo",
}

query = GoogleSearchResults(params)
dictionary_results = query.get_dictionary()

GitHub: https://github.com/serpapi/google-search-results-python

Tartaglia answered 24/4, 2018 at 0:44 Comment(2)
You need to Pay for this API key.Dailey
@TejasKrishnaReddy there's a non-commercial Free Plan with 100 searches per month.Librettist
R
1

You can also use Serpdog's(https://serpdog.io) Google Search API to scrape Google Search Results in Python

import requests
payload = {'api_key': 'APIKEY', 'q':'coffee' , 'gl':'us'}
resp = requests.get('https://api.serpdog.io/search', params=payload)
print (resp.text)

Docs: https://docs.serpdog.io

Disclaimer: I am the founder of serpdog.io

Revolutionize answered 25/5, 2023 at 19:1 Comment(0)
P
1

Another service that can be utilized for scraping Google Search or other SERP data is SearchApi. You may want to check and test it out as it offers 100 free credits upon registration. It provides a rich JSON data set and includes free request HTML in which you could compare HTML data with results.

Documentation for Google Search API: https://www.searchapi.io/docs/google

Python execution example:

import requests

payload = {'api_key': 'key', 'engine': 'google', 'q':'pizza'}
response = requests.get('https://www.searchapi.io/api/v1/search', params=payload)

print (response.text)

Disclaimer: I work for SearchApi

Pohl answered 12/7, 2023 at 17:3 Comment(0)
T
1

I can think of at least three ways to do this:

  • Custom Search JSON API by Google
  • Create your DIY scraper solution
  • Using SerpApi (Recommended)
  1. Using Custom Search JSON API by Google.

You can use the "Google Custom Search JSON API". First, you must set up a Custom Search Engine (CSE) and get an API key from the Google Cloud Console. Once you have both, you can make HTTP requests to the API using Python's requests library or using the Google API client library for Python. By passing your search query and API key as parameters, you'll receive search results in JSON format, which you can then process as needed.

Remember, the API isn't free and has usage limits, so monitor your queries to avoid unexpected costs.

  1. Create your DIY solution
    If you're looking for a DIY solution to get Google search results in Python without relying on Google's official API, you can use web scraping tools like BeautifulSoup and requests. Here's a simple approach:

2.1. Use the requests library to fetch the HTML content of a Google search results page.

2.2. Parse the HTML using BeautifulSoup to extract data from the search results.

You might face issues like IP bans or other scraping problems. Also, Google's structure might change, causing your scraper to break. The point is that building your own Google scraper will come with many challenges.

  1. Using SerpApi to make it all easy
    SerpAPI provides a more structured and reliable way to obtain Google search results without directly scraping Google. SerpAPI essentially serves as a middleman, handling the complexities of scraping and providing structured JSON results. So you can save time and energy to collect data from Google without building your own Google Scraper or using other web scraping tools.

Here is a tutorial on how to scrape google SERP with python for the third option.

Hope it helps!

Telly answered 17/10, 2023 at 22:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.