Searching in Google with Python
Asked Answered
B

7

33

I want to search a text in Google using a python script and return the name, description and URL for each result. I'm currently using this code:

from google import search

ip=raw_input("What would you like to search for? ")

for url in search(ip, stop=20):
     print(url)

This returns only the URL's. How can I return the name and description for each URL?

Barnie answered 28/7, 2016 at 11:45 Comment(2)
Which google search API did you use?Fleer
It is against Google's Webmaster Guidelines and terms of service to submit programmatic search queries. Running this code against Google is likely to cause Google to show captcha for searches from your IP address.Calchas
B
15

Not exactly what I was looking for, but I found myself a nice solution for now (I might edit this if I will able to make this better). I combined searching in Google like I did (returning only URL) and the Beautiful Soup package for parsing HTML pages:

from googlesearch import search
import urllib
from bs4 import BeautifulSoup

def google_scrape(url):
    thepage = urllib.urlopen(url)
    soup = BeautifulSoup(thepage, "html.parser")
    return soup.title.text

i = 1
query = 'search this'
for url in search(query, stop=10):
    a = google_scrape(url)
    print str(i) + ". " + a
    print url
    print " "
    i += 1

This gives me a list of the title of pages and the link.

And another great solutions:

from googlesearch import search
import requests

for url in search(ip, stop=10):
            r = requests.get(url)
            title = everything_between(r.text, '<title>', '</title>')
Barnie answered 28/7, 2016 at 14:23 Comment(5)
ImportError: cannot import name 'search'Nightie
@pyd May i am too late to answer :D Tyr from googlesearch import search use 'googlesearch' instead of 'google' ;)Kettledrum
This worked like a charm but now captcha blocks future searches. Any workarounds for this?Saskatchewan
non of these 2 solution works anymore... You must update the code, for 2022, because google changes many thingsDrench
Hartator's answer is a good option if you don't want to have to deal with captchas or continual changes on Google's side https://mcmap.net/q/442456/-searching-in-google-with-pythonMotionless
F
24

I assume you are using this library by Mario Vilas because of the stop=20 argument which appears in his code. It seems like this library is not able to return anything but the URLs, making it horribly undeveloped. As such, what you want to do is not possible with the library you are currently using.

I would suggest you instead use abenassi/Google-Search-API. Then you can simply do:

from google import google
num_page = 3
search_results = google.search("This is my query", num_page)
for result in search_results:
    print(result.description)
Fleer answered 28/7, 2016 at 11:58 Comment(8)
I'm getting: Traceback (most recent call last): File "Z:/test/test_google.py", line 57, in <module> from google import google ImportError: cannot import name googleBarnie
@Barnie You will have to download the library first. Use the instructions in the link.Fleer
This worked really well. I initially had issues because I hadn't noticed the python-2.7 tag and was trying to install the library in Python 3. After installing in Python 2, it did exactly what I needed.Brogdon
Hi, I am geeting this error: 'str' object has no attribute 'description'when I am trying to call print(result.description). Anything I can do about that? I run the exact same code...Substandard
This works in python 3 also for anybody who is reading comments.Samale
It doesn't work for me in python3, I don't know what you're talking about. There's an issue open to support python3, which does not appear to be complete.Chenab
This code is not handling the corner cases, If the name or any other property is None than we need to handle this at our side. This should not be the case to make it robust. And also for some cases it's not returning name/link even though it has. In their documentation they said that few fields are not reading things properly due to encoding issues. So i can't use this!Kettledrum
It seems Google will block you for this solutionHearttoheart
B
15

Not exactly what I was looking for, but I found myself a nice solution for now (I might edit this if I will able to make this better). I combined searching in Google like I did (returning only URL) and the Beautiful Soup package for parsing HTML pages:

from googlesearch import search
import urllib
from bs4 import BeautifulSoup

def google_scrape(url):
    thepage = urllib.urlopen(url)
    soup = BeautifulSoup(thepage, "html.parser")
    return soup.title.text

i = 1
query = 'search this'
for url in search(query, stop=10):
    a = google_scrape(url)
    print str(i) + ". " + a
    print url
    print " "
    i += 1

This gives me a list of the title of pages and the link.

And another great solutions:

from googlesearch import search
import requests

for url in search(ip, stop=10):
            r = requests.get(url)
            title = everything_between(r.text, '<title>', '</title>')
Barnie answered 28/7, 2016 at 14:23 Comment(5)
ImportError: cannot import name 'search'Nightie
@pyd May i am too late to answer :D Tyr from googlesearch import search use 'googlesearch' instead of 'google' ;)Kettledrum
This worked like a charm but now captcha blocks future searches. Any workarounds for this?Saskatchewan
non of these 2 solution works anymore... You must update the code, for 2022, because google changes many thingsDrench
Hartator's answer is a good option if you don't want to have to deal with captchas or continual changes on Google's side https://mcmap.net/q/442456/-searching-in-google-with-pythonMotionless
D
10

Most of them I tried using, but didn't work out for me or gave errors like search module not found despite importing packages. Or I did work out with selenium web driver and it works great if used with Firefox or chrome or Phantom web browser, but still I felt it was a bit slow in terms of execution time, as it queried browser first and then returned search result.

So I thought of using google api and it works amazingly quick and returns results accurately.

Before I share the code here are few quick tips to follow:-

  1. Register on Google Api to get a Google Api key (free version)
  2. Now search for Google Custom Search and set up your free account to get a custom search id
  3. Now add this package(google-api-python-client) in your python project (can be done by writing !pip install google-api-python-client )

That is it and all you have to do now is run this code:-

from googleapiclient.discovery import build

my_api_key = "your API KEY TYPE HERE"
my_cse_id = "YOUR CUSTOM SEARCH ENGINE ID TYPE HERE"

def google_search(search_term, api_key, cse_id, **kwargs):
      service = build("customsearch", "v1", developerKey=api_key)
      res = service.cse().list(q=search_term, cx=cse_id, **kwargs).execute()
      return res['items']

results= google_search("YOUR SEARCH QUERY HERE",my_api_key,my_cse_id,num=10) 

for result in results:
      print(result["link"])
Doxology answered 6/3, 2018 at 1:50 Comment(4)
would you be able to provide the link to the google api python client documentation?Smolensk
This is good solution, But for internal use only. For enterprise solution, It is costly :)Kettledrum
I very much want to use your solution. But it seems that when setting up custom search id, it's specific to a particular site, e.g. "www.myownsite.com". And it doesn't apply to all the results from google.Hearttoheart
Complementing yangliu2 comment: it appears that this option can't be used to search all Google results, but it is not limited to a single domain, nor to domains you own. You can include a list of websites to be searched (owned by you or not), at least according to docs here and hereMinister
B
7

You can also use a third-party service like SerpApi which is a Google search engine results. It solves the issues of having to rent proxies and parsing the HTML results. JSON output is particularly rich.

It's easy to integrate with Python:

from serpapi import GoogleSearch

params = {
    "q" : "Coffee",
    "location" : "Austin, Texas, United States",
    "hl" : "en",
    "gl" : "us",
    "google_domain" : "google.com",
    "api_key" : "demo",
}

query = GoogleSearch(params)
dictionary_results = query.get_dict()

GitHub: https://github.com/serpapi/google-search-results-python

Berenice answered 23/4, 2018 at 20:1 Comment(3)
Unfortunately they have only the paid version i guess. You need credit card for the trial version too.Barcellona
you can use brightdata.com as well, they appear to be cheaper than SerpApi at the moment and you can test for free.Prescriptible
SerpApi has a free plan offering 100 searches per month, you don't need to enter credit card info to sign up for it. Also, there is a new Python library, which may be even more straightforward to use: serpapi-python.readthedocs.io/en/latestMotionless
D
2

Usually, you cannot use google search function from python by importing google package in python3. but you can use it in python2.

Even by using the requests.get(url+query) the scraping won't perform because google prevents scraping by redirecting it to captcha page.

Possible ways:

  • You can write code in python2
  • If you want to write it in python3, then make 2 files and retrieve search results from python2 script.
  • If found difficult, the best way is to use Google Colab or Jupyter Notebook with python3 runtime. You won't get any error.
Durant answered 7/11, 2019 at 18:8 Comment(0)
H
1

You can use the Google Search Origin package which integrate most of the parameters available on google (it includes dorks and filters). It is based on the google official documentation. Moreover using it will create an object so it will be easily modifiable. For more information look at the project here: https://pypi.org/project/google-search-origin/

Here an example of how using it :

import google_search_origin


if __name__ == '__main__':
    # Initialisation of the class
    google_search = google_search_origin.GoogleSearchOrigin(search='sun')
    
    # Request from the url assembled
    google_search.request_url()

    # Display the link parsed depending on the result
    print(google_search.get_all_links())

    # Modify the parameter
    google_search.parameter_search('dog')

    # Assemble the url
    google_search.assemble_url()

    # Request from the url assembled
    google_search.request_url()

    # Display the raw text depending on the result
    print(google_search.get_response_text())
Halbert answered 16/10, 2021 at 11:5 Comment(0)
E
0

As an alternative, if for some reason using the API is not necessary, you can get away with using BeautifulSoup web scraping library.

If necessary, you can extract data from all pages using an infinite while loop.

while loop will go through all pages no matter how many there are until a certain condition is fulfilled. In our case, this is the presence of a button on the page (.d6cvqb a[id=pnnext] CSS selector):

# stop the loop on the absence of the next page
if soup.select_one(".d6cvqb a[id=pnnext]"):
    params["start"] += 10
else:
    break

When you make a request, the site may decide that you are a bot, to prevent this from happening, you need to send headers that contain user-agent in the request, then the site will assume that you are a user and display the information.

Check full code in the online IDE.

from bs4 import BeautifulSoup
import requests, json, lxml

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
query = input("What would you like to search for? ")
params = {
    "q": query,          # query example
    "hl": "en",          # language
    "gl": "uk",          # country of the search, UK -> United Kingdom
    "start": 0,          # number page by default up to 0
    #"num": 100          # parameter defines the maximum number of results to return.
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}

page_limit = 10          # page limit if you don't need to fetch everything
page_num = 0

data = []

while True:
    page_num += 1
    print(f"page: {page_num}")
        
    html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
    soup = BeautifulSoup(html.text, 'lxml')
    
    for result in soup.select(".tF2Cxc"):
        title = result.select_one(".DKV0Md").text
        try:
           snippet = result.select_one(".lEBKkf span").text
        except:
           snippet = None
        links = result.select_one(".yuRUbf a")["href"]
      
        data.append({
          "title": title,
          "snippet": snippet,
          "links": links
        })
    
    # stop loop due to page limit condition
    if page_num == page_limit:
        break
    # stop the loop on the absence of the next page
    if soup.select_one(".d6cvqb a[id=pnnext]"):
        params["start"] += 10
    else:
        break
print(json.dumps(data, indent=2, ensure_ascii=False))

Example output:

[
  {
    "title": "Web Scraping with Python - Pluralsight",
    "snippet": "There are times in which you need data but there is no API (application programming interface) to be found. Web scraping is the process of extracting data ...",
    "links": "https://www.pluralsight.com/paths/web-scraping-with-python"
  },
  {
    "title": "Chapter 8 Web Scraping | Machine learning in python",
    "snippet": "Web scraping means extacting data from the “web”. However, web is not just an anonymous internet “out there” but a conglomerat of servers and sites, ...",
    "links": "http://faculty.washington.edu/otoomet/machinelearning-py/web-scraping.html"
  },
  {
    "title": "Web scraping 101",
    "snippet": "This vignette introduces you to the basics of web scraping with rvest. You'll first learn the basics of HTML and how to use CSS selectors to refer to ...",
    "links": "https://cran.r-project.org/web/packages/rvest/vignettes/rvest.html"
  },
  other results ...
]

There's a 13 ways to scrape any public data from any website blog post if you want to know more about website scraping.

Edp answered 1/3, 2023 at 16:35 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.