Google Search from a Python App
Asked Answered
B

4

50

I'm trying to run a google search query from a python app. Is there any python interface out there that would let me do this? If there isn't does anyone know which Google API will enable me to do this. Thanks.

Brittnee answered 1/11, 2009 at 16:21 Comment(0)
M
72

There's a simple example here (peculiarly missing some quotes;-). Most of what you'll see on the web is Python interfaces to the old, discontinued SOAP API -- the example I'm pointing to uses the newer and supported AJAX API, that's definitely the one you want!-)

Edit: here's a more complete Python 2.6 example with all the needed quotes &c;-)...:

#!/usr/bin/python
import json
import urllib

def showsome(searchfor):
  query = urllib.urlencode({'q': searchfor})
  url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s' % query
  search_response = urllib.urlopen(url)
  search_results = search_response.read()
  results = json.loads(search_results)
  data = results['responseData']
  print 'Total results: %s' % data['cursor']['estimatedResultCount']
  hits = data['results']
  print 'Top %d hits:' % len(hits)
  for h in hits: print ' ', h['url']
  print 'For more results, see %s' % data['cursor']['moreResultsUrl']

showsome('ermanno olmi')
Morganite answered 1/11, 2009 at 16:30 Comment(5)
Tried this on my local Linux machine and then Google thought I am a bot and any searches from my browser are captcha'ed! I shouldn't have tried this at work, just a heads-up for someone using this. Add user agent and referrer to make it look more like a genuine request!Flanders
Unfortunately the Google Web Search API on which this relies was deprecated in November 2010. The Custom Search API is supposed to replace this, but requires you to configure a list of URLs to search across - not the entire web.Bremser
as of today (2014.06.10), this is working ... on my IPython/Python2.7.6Littlefield
As of March 2016, this doesn't work. Google responds with the following: {"responseData": null, "responseDetails": "The Google Web Search API is no longer available. Please migrate to the Google Custom Search API (developers.google.com/custom-search)", "responseStatus": 403}Maximilien
As mentioned above, this is a deprecated API that no longer works. Also, google uses https for everything, so the http:// url alone deprecates it. Same with John La Rooy's answer below.Adkinson
D
17

Here is Alex's answer ported to Python3

#!/usr/bin/python3
import json
import urllib.request, urllib.parse

def showsome(searchfor):
  query = urllib.parse.urlencode({'q': searchfor})
  url = 'http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s' % query
  search_response = urllib.request.urlopen(url)
  search_results = search_response.read().decode("utf8")
  results = json.loads(search_results)
  data = results['responseData']
  print('Total results: %s' % data['cursor']['estimatedResultCount'])
  hits = data['results']
  print('Top %d hits:' % len(hits))
  for h in hits: print(' ', h['url'])
  print('For more results, see %s' % data['cursor']['moreResultsUrl'])

showsome('ermanno olmi')
Deserved answered 1/11, 2009 at 19:9 Comment(5)
What would be the advantage of using Python 3 over Alex's answer?Birkenhead
@Phill, not sure what you mean by "advantage". If your project uses Python2 you use Alex's answer. If the project uses Python3 you can use this answer. Unfortunately it's not really practical to write this function in such a way to use the same code with both versions of PythonDeserved
I guess my question is why use Python3 over Python2? What are the benefits? New to Python, coming from PHP background. Are things more formalised?Birkenhead
@Phill, Python3 is a cleaner more consistent design than Python2, but is not fully backwards compatible. Typically the changes required to port code are quite small, as you can see here, however a number of 3rd party libraries and frameworks still don't support Python3, so many people are still using Python2Deserved
Is there a way to get more than 4 hits?Overflow
C
11

Here's my approach to this: http://breakingcode.wordpress.com/2010/06/29/google-search-python/

A couple code examples:

    # Get the first 20 hits for: "Breaking Code" WordPress blog
    from google import search
    for url in search('"Breaking Code" WordPress blog', stop=20):
        print(url)

    # Get the first 20 hits for "Mariposa botnet" in Google Spain
    from google import search
    for url in search('Mariposa botnet', tld='es', lang='es', stop=20):
        print(url)

Note that this code does NOT use the Google API, and is still working to date (January 2012).

Carmichael answered 10/1, 2012 at 10:57 Comment(2)
HI Mario, I have tried to use your script and its fabulous. I am facing just one issue - even when I use .COM as the TLD I am getting the results which come on .CO.IN. Can you please help.Devonna
Note this can break at any time as it is not using an official API but scraping the Google results page, e.g. if Google changes the way the results are returned.Clientage
N
6

I am new in python and I was investigating how to do this. None of the provided examples are working properly for me. Some are blocked by google if you make many (few) requests, some are outdated. Parsing the google search html (adding the header in the request) will work until google changes the html structure again. You can use the same logic to search in any other search engine, looking into the html (view-source).

import urllib2

def getgoogleurl(search,siteurl=False):
    if siteurl==False:
        return 'http://www.google.com/search?q='+urllib2.quote(search)
    else:
        return 'http://www.google.com/search?q=site:'+urllib2.quote(siteurl)+'%20'+urllib2.quote(search)

def getgooglelinks(search,siteurl=False):
   #google returns 403 without user agent
   headers = {'User-agent':'Mozilla/11.0'}
   req = urllib2.Request(getgoogleurl(search,siteurl),None,headers)
   site = urllib2.urlopen(req)
   data = site.read()
   site.close()

   #no beatifulsoup because google html is generated with javascript
   start = data.find('<div id="res">')
   end = data.find('<div id="foot">')
   if data[start:end]=='':
      #error, no links to find
      return False
   else:
      links =[]
      data = data[start:end]
      start = 0
      end = 0        
      while start>-1 and end>-1:
          #get only results of the provided site
          if siteurl==False:
            start = data.find('<a href="/url?q=')
          else:
            start = data.find('<a href="/url?q='+str(siteurl))
          data = data[start+len('<a href="/url?q='):]
          end = data.find('&amp;sa=U&amp;ei=')
          if start>-1 and end>-1: 
              link =  urllib2.unquote(data[0:end])
              data = data[end:len(data)]
              if link.find('http')==0:
                  links.append(link)
      return links

Usage:

links = getgooglelinks('python','http://www.stackoverflow.com/')
for link in links:
       print link

(Edit 1: Adding a parameter to narrow the google search to a specific site)

(Edit 2: When I added this answer I was coding a Python script to search subtitles. I recently uploaded it to Github: Subseek)

Nonesuch answered 7/2, 2013 at 5:23 Comment(4)
I'm interested in why none of the examples worked for you, especially the bit about BeautifulSoup not working because the HTML is generated by JavaScript... I've tried mine just now and it's working: breakingcode.wordpress.com/2010/06/29/google-search-pythonComptometer
In my case I wasn't able to use BeautifulSoup. I tested it and it seems that google was generating the html response with javascript blocks, so I didn't find a way to get the links with the BS class. I only found the links in the response using the "find" function.Nonesuch
Maybe the URL to Google is pointing to the newer API that uses JavaScript instead of the legacy API that used bare HTML. I believe adding "&btnG=Google+Search" in your queries causes it to use the HTML API, or at least that's the only difference I see.Comptometer
@Comptometer Thanks for the tip. I will try it using the parameter. Maybe is it faster that way?Nonesuch

© 2022 - 2024 — McMap. All rights reserved.