As an alternative, if for some reason using the API is not necessary, you can get away with using BeautifulSoup web scraping library.
If necessary, you can extract data from all pages using an infinite while
loop.
while
loop will go through all pages no matter how many there are until a certain condition is fulfilled. In our case, this is the presence of a button on the page (.d6cvqb a[id=pnnext]
CSS selector):
# stop the loop on the absence of the next page
if soup.select_one(".d6cvqb a[id=pnnext]"):
params["start"] += 10
else:
break
When you make a request, the site may decide that you are a bot, to prevent this from happening, you need to send headers
that contain user-agent
in the request, then the site will assume that you are a user and display the information.
Check full code in the online IDE.
from bs4 import BeautifulSoup
import requests, json, lxml
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
query = input("What would you like to search for? ")
params = {
"q": query, # query example
"hl": "en", # language
"gl": "uk", # country of the search, UK -> United Kingdom
"start": 0, # number page by default up to 0
#"num": 100 # parameter defines the maximum number of results to return.
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36"
}
page_limit = 10 # page limit if you don't need to fetch everything
page_num = 0
data = []
while True:
page_num += 1
print(f"page: {page_num}")
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select(".tF2Cxc"):
title = result.select_one(".DKV0Md").text
try:
snippet = result.select_one(".lEBKkf span").text
except:
snippet = None
links = result.select_one(".yuRUbf a")["href"]
data.append({
"title": title,
"snippet": snippet,
"links": links
})
# stop loop due to page limit condition
if page_num == page_limit:
break
# stop the loop on the absence of the next page
if soup.select_one(".d6cvqb a[id=pnnext]"):
params["start"] += 10
else:
break
print(json.dumps(data, indent=2, ensure_ascii=False))
Example output:
[
{
"title": "Web Scraping with Python - Pluralsight",
"snippet": "There are times in which you need data but there is no API (application programming interface) to be found. Web scraping is the process of extracting data ...",
"links": "https://www.pluralsight.com/paths/web-scraping-with-python"
},
{
"title": "Chapter 8 Web Scraping | Machine learning in python",
"snippet": "Web scraping means extacting data from the “web”. However, web is not just an anonymous internet “out there” but a conglomerat of servers and sites, ...",
"links": "http://faculty.washington.edu/otoomet/machinelearning-py/web-scraping.html"
},
{
"title": "Web scraping 101",
"snippet": "This vignette introduces you to the basics of web scraping with rvest. You'll first learn the basics of HTML and how to use CSS selectors to refer to ...",
"links": "https://cran.r-project.org/web/packages/rvest/vignettes/rvest.html"
},
other results ...
]
There's a 13 ways to scrape any public data from any website blog post if you want to know more about website scraping.