google search with python requests library
Asked Answered
L

6

15

(I've tried looking but all of the other answers seem to be using urllib2)

I've just started trying to use requests, but I'm still not very clear on how to send or request something additional from the page. For example, I'll have

import requests

r = requests.get('http://google.com')

but I have no idea how to now, for example, do a google search using the search bar presented. I've read the quickstart guide but I'm not very familiar with HTML POST and the like, so it hasn't been very helpful.

Is there a clean and elegant way to do what I am asking?

Levana answered 25/3, 2014 at 1:25 Comment(3)
You can use the Google API without client library. I'm using Google Drive in Python 3 with urllib.request module.Tully
Well I didn't mean it in just the context of Google, there are other sites/databases that I'd also like to be able to search. Also, I thought the standard nowadays was the requests module because urllib/urllib2 had become clunky/outdated?Levana
Some methods (GET) passes their parameters by url, others (POST) by data. And both admits headers (pairs or keyword and value)Tully
T
17

Request Overview

The Google search request is a standard HTTP GET command. It includes a collection of parameters relevant to your queries. These parameters are included in the request URL as name=value pairs separated by ampersand (&) characters. Parameters include data like the search query and a unique CSE ID (cx) that identifies the CSE that is making the HTTP request. The WebSearch or Image Search service returns XML results in response to your HTTP requests.

First, you must get your CSE ID (cx parameter) at Control Panel of Custom Search Engine

Then, See the official Google Developers site for Custom Search.

There are many examples like this:

http://www.google.com/search?
  start=0
  &num=10
  &q=red+sox
  &cr=countryCA
  &lr=lang_fr
  &client=google-csbe
  &output=xml_no_dtd
  &cx=00255077836266642015:u-scht7a-8i

And there are explained the list of parameters that you can use.

Tully answered 25/3, 2014 at 8:54 Comment(0)
T
14
import requests 
from bs4 import BeautifulSoup

headers_Get = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate',
        'DNT': '1',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1'
    }


def google(q):
    s = requests.Session()
    q = '+'.join(q.split())
    url = 'https://www.google.com/search?q=' + q + '&ie=utf-8&oe=utf-8'
    r = s.get(url, headers=headers_Get)

    soup = BeautifulSoup(r.text, "html.parser")
    output = []
    for searchWrapper in soup.find_all('h3', {'class':'r'}): #this line may change in future based on google's web page structure
        url = searchWrapper.find('a')["href"] 
        text = searchWrapper.find('a').text.strip()
        result = {'text': text, 'url': url}
        output.append(result)

    return output

Will return an array of google results in {'text': text, 'url': url} format. Top result url would be google('search query')[0]['url']

Troudeloup answered 23/9, 2017 at 0:31 Comment(1)
FYI, it's against Google's TOS to automate scripted searching- you should use Google's Custom Search API instead (developers.google.com/custom-search/docs/tutorial/creatingcse). Much cleaner and no need to work with BeautifulSoup.Itol
S
5

input:

import requests

def googleSearch(query):
    with requests.session() as c:
        url = 'https://www.google.co.in'
        query = {'q': query}
        urllink = requests.get(url, params=query)
        print urllink.url

googleSearch('Linkin Park')

output:

https://www.google.co.in/?q=Linkin+Park
Scythe answered 23/11, 2016 at 9:13 Comment(1)
that's great but instead of this "google.co.in" i use "google.com/search" that will lead you directly to the search results!Windbreak
S
1

The readable way to send a request with many query parameters would be to pass URL parameters as a dictionary:

params = {
  'q': 'minecraft', # search query
  'gl': 'us',       # country where to search from   
  'hl': 'en',       # language 
}

requests.get('URL', params=params)

But, in order to get the actual response (output/text/data) that you see in the browser you need to send additional headers, more specifically user-agent which is needed to act as a "real" user visit when bot or browser sends a fake user-agent string to announce themselves as a different client.

The reason that your request might be blocked is that the default requests user agent is python-requests and websites understand that. Check what's your user agent.

You can read more about it in the blog post I wrote about how to reduce the chance of being blocked while web scraping.

Pass user-agent:

headers = {
    'User-agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}

requests.get('URL', headers=headers)

Code and example in the online IDE:

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    'User-agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
}

params = {
  'q': 'minecraft',
  'gl': 'us',
  'hl': 'en',
}

html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

for result in soup.select('.tF2Cxc'):
  title = result.select_one('.DKV0Md').text
  link = result.select_one('.yuRUbf a')['href']
  print(title, link, sep='\n')

Alternatively, you can achieve the same thing by using Google Organic API from SerpApi. It's a paid API with a free plan.

The difference is that you don't have to create it from scratch and maintain it.

Code to integrate:

import os
from serpapi import GoogleSearch

params = {
  "engine": "google",
  "q": "tesla",
  "hl": "en",
  "gl": "us",
  "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
  print(result['title'])
  print(result['link'])

Disclaimer, I work for SerpApi.

Sclerotomy answered 21/10, 2021 at 9:10 Comment(0)
Y
0

In this code by using bs4 you can get all the h3 and print their text

# Import the beautifulsoup 
# and request libraries of python.
import requests
import bs4
  
# Make two strings with default google search URL
# 'https://google.com/search?q=' and
# our customized search keyword.
# Concatenate them
text= "c++ linear search program"
url = 'https://google.com/search?q=' + text
  
# Fetch the URL data using requests.get(url),
# store it in a variable, request_result.
request_result=requests.get( url )
  
# Creating soup from the fetched request
soup = bs4.BeautifulSoup(request_result.text,"html.parser")
filter=soup.find_all("h3")
for i in range(0,len(filter)):
    print(filter[i].get_text())
Yaekoyael answered 9/8, 2021 at 4:37 Comment(0)
M
0

You can use 'webbroser', I think it doesn't get easier than that:

import webbrowser

query = input('Enter your query: ')
webbrowser.open(f'https://google.com/search?q={query}')
Malevolent answered 10/10, 2022 at 15:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.