python: Google Search Scraper with BeautifulSoup
Asked Answered
S

3

6

Goal: Pass a search string to search on google and scrape url, title and the small description that get publish along with the url title.

I have following code and at the moment my code only gives first 10 results which is the default google limit for one page. I am not sure how to really handle pagination during webscraping. Also when I look at the actual page results and the what prints out there is a discrepancy. I am also not sure what is the best way to parse span elements.

So far I have the span as follows and I want to remove the <em> element and concatenate the rest of the stings. What would be the best way to do that?

<span class="st">The <em>Beautiful Soup</em> Theater Collective was founded in the summer of 2010 by its Artistic Director, Steven Carl McCasland. A continuation of a student group he <b>...</b><br /></span

Code:

from BeautifulSoup import BeautifulSoup
import urllib, urllib2

def google_scrape(query):
    address = "http://www.google.com/search?q=%s&num=100&hl=en&start=0" % (urllib.quote_plus(query))
    request = urllib2.Request(address, None, {'User-Agent':'Mosilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11'})
    urlfile = urllib2.urlopen(request)
    page = urlfile.read()
    soup = BeautifulSoup(page)

    linkdictionary = {}

    for li in soup.findAll('li', attrs={'class':'g'}):
        sLink = li.find('a')
        print sLink['href']
        sSpan = li.find('span', attrs={'class':'st'})
        print sSpan

    return linkdictionary

if __name__ == '__main__':
    links = google_scrape('beautifulsoup')

My out put looks like this:

http://www.crummy.com/software/BeautifulSoup/
<span class="st"><em>Beautiful Soup</em>: a library designed for screen-scraping HTML and XML.<br /></span>
http://pypi.python.org/pypi/BeautifulSoup/3.2.1
<span class="st"><span class="f">Feb 16, 2012 &ndash; </span>HTML/XML parser for quick-turnaround applications like screen-scraping.<br /></span>
http://www.beautifulsouptheatercollective.org/
<span class="st">The <em>Beautiful Soup</em> Theater Collective was founded in the summer of 2010 by its Artistic Director, Steven Carl McCasland. A continuation of a student group he <b>...</b><br /></span>
http://lxml.de/elementsoup.html
<span class="st"><em>BeautifulSoup</em> is a Python package that parses broken HTML, just like lxml supports it based on the parser of libxml2. <em>BeautifulSoup</em> uses a different parsing <b>...</b><br /></span>
https://launchpad.net/beautifulsoup/
<span class="st">The discussion group is at: http://groups.google.com/group/<em>beautifulsoup</em> &middot; Home page <b>...</b> <em>Beautiful Soup</em> 4.0 series is  the current focus of development <b>...</b><br /></span>
http://www.poetry-online.org/carroll_beautiful_soup.htm
<span class="st"><em>Beautiful Soup BEAUTIFUL Soup</em>, so rich and green, Waiting in a hot tureen! Who for such dainties would not stoop? Soup of the evening, <em>beautiful Soup</em>!<br /></span>
http://www.youtube.com/watch?v=hDG73IAO5M8
<span class="st"><span class="f">Jul 6, 2009 &ndash; </span>taken from the motion picture &quot;Alice in wonderland&quot; (1999) http://www.imdb.com/<wbr>title/tt0164993/<br /></wbr></span>
http://www.soupsong.com/
<span class="st">A witty and substantive research effort on the history of soup and food in all cultures, with over 400 pages of recipes, quotations, stories, traditions, literary <b>...</b><br /></span>
http://www.facebook.com/beautifulsouptc
<span class="st">To connect with The <em>Beautiful Soup</em> Theater Collective, sign up for Facebook <b>...</b> We&#39;re thrilled to announce the cast of <em>Beautiful Soup&#39;s</em> upcoming production of <b>...</b><br /></span>
http://blog.dispatched.ch/webscraping-with-python-and-beautifulsoup/
<span class="st"><span class="f">Mar 15, 2009 &ndash; </span>Recently my life has been a hype; partly due to my upcoming Python addiction. There&#39;s simply no way around it; so I should better confess it in <b>...</b><br /></span>

Google search page results has the following structure:

<li class="g">
<div class="vsc" sig="bl_" bved="0CAkQkQo" pved="0CAgQkgowBQ">
<h3 class="r">
<div class="vspib" aria-label="Result details" role="button" tabindex="0">
<div class="s">
<div class="f kv">
<div id="poS5" class="esc slp" style="display:none">
<div class="f slp">3 answers&nbsp;-&nbsp;Jan 16, 2009</div>
<span class="st">
I read this without finding the solution:
<b>...</b>
The "normal" way is to: Go to the
<em>Beautiful Soup</em>
web site,
<b>...</b>
Brian beat me too it, but since I already have
<b>...</b>
<br>
</span>
</div>
<div>
</div>
<h3 id="tbpr_6" class="tbpr" style="display:none">
</li>

each search results get listed under <li> element.

Skepful answered 16/7, 2012 at 22:39 Comment(0)
A
2

This list comprehension will strip the tag.

>>> sSpan
<span class="st">The <em>Beautiful Soup</em> Theater Collective was founded in the summer of 2010 by its Artistic Director, Steven Carl McCasland. A continuation of a student group he <b>...</b><br /></span>
>>> [em.replaceWithChildren() for em in sSpan.findAll('em')]
[None]
>>> sSpan
<span class="st">The Beautiful Soup Theater Collective was founded in the summer of 2010 by its Artistic Director, Steven Carl McCasland. A continuation of a student group he <b>...</b><br /></span>
Antibody answered 17/7, 2012 at 5:14 Comment(6)
any idea how i can get more than 10 records scrape from the results?Skepful
iterate through the 'start' parameters in the URL: num=10&hl=en&start=0 num=10&hl=en&start=10 num=10&hl=en&start=20Antibody
Hi Chris, above solution didn't work, so I edited. But I see you have removed it. I will add my solution to it. Thanks for looking into it.Skepful
NH, If this didn't work you I'd be happy to see the case that failed. While you can use regular expressions to stip tags in simple case like this, it is a very bad practice to get into (see link below). RegEx approaches rapidly become unworkable with real world complexity. If you are already using a powerful package like BeuatifulSoup to build your DOM, you might as well keep things simple and manipulate the DOM with the same tool too. Note: you're original question only asked for the stripping of the <em> tags. If you just want the text content you can do sSpan.text .Antibody
[#1732848Antibody
@Null-Hypothesis - You can get more than 10 results by changing the value of num. Try num=50Lao
S
0

I constructed a simple html regular expression and then called the replace function on the cleaned up string to remove the dots

import re

p = re.compile(r'<.*?>')
print p.sub('',str(sSpan)).replace('.','')

Before

<span class="st">The <em>Beautiful Soup</em> is a collection of all the pretty places you would rather be. All posts are credited via a click through link. For further inspiration of pretty things, <b>...</b><br /></span>

After

The Beautiful Soup is a collection of all the pretty places you would rather be All posts are credited via a click through link For further inspiration of pretty things, 
Skepful answered 17/7, 2012 at 17:59 Comment(0)
M
0

To get text element from the span tag you can use .text/get_text() methods that beautifulsoup provides. Bs4 do all hard lifting and you don't need to worry about how to get rid of <em> tag.

Code and full example (Google won't show more than ~400 results.):

from bs4 import BeautifulSoup
import requests, lxml, urllib.parse


def print_extracted_data_from_url(url):
    headers = {
        "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
    }
    response = requests.get(url, headers=headers).text

    soup = BeautifulSoup(response, 'lxml')

    print(f'Current page: {int(soup.select_one(".YyVfkd").text)}')
    print(f'Current URL: {url}')
    print()

    for container in soup.findAll('div', class_='tF2Cxc'):
        head_text = container.find('h3', class_='LC20lb DKV0Md').text
        head_sum = container.find('div', class_='IsZvec').text
        head_link = container.a['href']
        print(head_text)
        print(head_sum)
        print(head_link)
        print()

    return soup.select_one('a#pnnext')


def scrape():
    next_page_node = print_extracted_data_from_url(
        'https://www.google.com/search?hl=en-US&q=coca cola')

    while next_page_node is not None:
        next_page_url = urllib.parse.urljoin('https://www.google.com',
                                             next_page_node['href'])

        next_page_node = print_extracted_data_from_url(next_page_url)

scrape()

Output:

Results via beautifulsoup

Current page: 1
Current URL: https://www.google.com/search?hl=en-US&q=coca cola

The Coca-Cola Company: Refresh the World. Make a Difference
We are here to refresh the world and make a difference. Learn more about the Coca-Cola Company, our brands, and how we strive to do business the right way.‎Contact Us · ‎Careers · ‎Coca-Cola · ‎Coca-Cola System
https://www.coca-colacompany.com/home

Coca-Cola
2021 The Coca-Cola Company, all rights reserved. COCA-COLA®, "TASTE THE FEELING", and the Contour Bottle are trademarks of The Coca-Cola Company.
https://www.coca-cola.com/

Together Tastes Better | Coca-Cola®
Coca-Cola is pairing up with celebrity chefs, talented athletes and more surprise guests all summer long to bring you and your loved ones together over the love ...
https://us.coca-cola.com/

Alternatively, you can achieve this using Google Search Engine Results API from SerpApi. It's a paid API with a free plan Check out the Playground to test.

Code to integrate:

import os
from serpapi import GoogleSearch

def scrape():
  
  params = {
    "engine": "google",
    "q": "coca cola",
    "api_key": os.getenv("API_KEY"),
  }

  search = GoogleSearch(params)
  results = search.get_dict()

  print(f"Current page: {results['serpapi_pagination']['current']}")

  for result in results["organic_results"]:
      print(f"Title: {result['title']}\nLink: {result['link']}\n")

  while 'next' in results['serpapi_pagination']:
      search.params_dict["start"] = results['serpapi_pagination']['current'] * 10
      results = search.get_dict()

      print(f"Current page: {results['serpapi_pagination']['current']}")

      for result in results["organic_results"]:
          print(f"Title: {result['title']}\nLink: {result['link']}\n")

Output:

Results from SerpApi

Current page: 1
Title: The Coca-Cola Company: Refresh the World. Make a Difference
Link: https://www.coca-colacompany.com/home

Title: Coca-Cola
Link: https://www.coca-cola.com/

Title: Together Tastes Better | Coca-Cola®
Link: https://us.coca-cola.com/

Title: Coca-Cola - Wikipedia
Link: https://en.wikipedia.org/wiki/Coca-Cola

Title: Coca-Cola - Home | Facebook
Link: https://www.facebook.com/Coca-Cola/

Title: The Coca-Cola Company | LinkedIn
Link: https://www.linkedin.com/company/the-coca-cola-company

Title: Coca-Cola UNITED: Home
Link: https://cocacolaunited.com/

Title: World of Coca-Cola: Atlanta Museum & Tourist Attraction
Link: https://www.worldofcoca-cola.com/

Current page: 2
Title: Coca-Cola (@CocaCola) | Twitter
Link: https://twitter.com/cocacola?lang=en

Disclaimer, I work for SerpApi.

Morality answered 13/4, 2021 at 8:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.