How do I parse Google search results with Nokogiri?
Asked Answered
E

2

7

I need help pulling URLs from Google search results and was told to use Nokogiri. I installed it and read over the Nokogiri docs, but have no idea where to start -- it's all Greek to me.

I know what I am looking for is the URL of each result, each existing between a <cite> tag. So far all I was able to figure out how to do is pull the search results but I just don't know how to go about pulling specific data from the file. Here is the teeny-tiny bit of code I do have:

serp = Nokogiri::HTML(open("http://www.google.com/search?num=100&q=stackoverflow"))
Exert answered 16/5, 2011 at 11:33 Comment(1)
Investigate Nokogiri's use of CSS accessors. They're very powerful and can help get you rolling quickly. From there you'll need to dig into XPath, as that is how we often go after nodes, whether they are in HTML or XML. XPath is a lot more powerful than CSS, but that power comes with added complexity. Also, as a usability tip, at finds the first occurrence of something as a Node, and search finds all occurrences, returning a NodeSet. NodeSet is like an array of Nodes so you can iterate over it.Dight
R
12

enjoy :)

require 'open-uri'
require 'nokogiri'

page = open "http://www.google.com/search?num=100&q=stackoverflow"
html = Nokogiri::HTML page

html.search("cite").each do |cite|
  puts cite.inner_text
end

also look at nokogiri tutorials

Reno answered 16/5, 2011 at 12:12 Comment(4)
Not to revive an old post, but do you know if there is a modern way to control the number of results for the Google results? The num query string no longer works.Duda
@DaveLong it works for me but I think there's an hard limit of 100 resultsReno
This doesn't seem to work any more, google doesn't like wild parsingCowhide
While this works you will be limited to only 100 queries per day as I discovered recently after attempting the above method. A more scalable approach is to use Google's Custom Search API. I wrote a full example and answer in this related question. https://mcmap.net/q/1255789/-what-is-the-correct-way-to-get-google-search-resultsGreasewood
K
1

Make sure you're using user-agent (headers), otherwise it will return an empty output because Google will block requests eventually. What is my user-agent.

headers = {
  "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

Code and example in the online IDE:

require 'nokogiri'
require 'httparty'
require 'json'

headers = {
  "User-Agent" => "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  q: "stackoverflow",
  num: "100"
}

response = HTTParty.get("https://www.google.com/search",
                        query: params,
                        headers: headers)
doc = Nokogiri::HTML(response.body)


data = doc.css(".tF2Cxc").map do |result|
  title = result.at_css(".DKV0Md")&.text
  link = result.at_css(".yuRUbf a")&.attr("href")
  displayed_link = result.at_css(".tjvcx")&.text
  snippet = result.at_css(".VwiC3b")&.text
  # puts "#{title}#{snippet}#{link}#{displayed_link}\n\n"

  {
    title: title,
    link: link,
    displayed_link: displayed_link,
    snippet: snippet,
  }.compact
end

puts JSON.pretty_generate(data)

--------
=begin
[
  {
    "title": "Stack for Stack Overflow - Apps on Google Play",
    "link": "https://play.google.com/store/apps/details?id=me.tylerbwong.stack&hl=en_US&gl=US",
    "displayed_link": "https://play.google.com › store › apps › details",
    "snippet": "Stack is powered by Stack Overflow and other Stack Exchange sites. Search and filter through questions to find the exact answer you're looking for!"
  }
...
]
=end

Alternatively, you can Google Organic Results API from SerpApi. It's a paid API with a free plan.

The main difference is that there's no need to figuring out how to scrape certain parts of the page. All that needs to be done is just to iterate over a structured JSON string.

require 'google_search_results' 
require 'json'

params = {
  api_key: ENV["API_KEY"],
  engine: "google",
  q: "stackoverflow",
  hl: "en",
  num: "100"
}

search = GoogleSearch.new(params)
hash_results = search.get_hash

data = hash_results[:organic_results].map do |result|
  title = result[:title]
  link = result[:link]
  displayed_link = result[:displayed_link]
  snippet = result[:snippet]

  {
    title: title,
    link: link,
    displayed_link: displayed_link,
    snippet: snippet
  }.compact
end

  puts JSON.pretty_generate(data)


-------------
=begin
[
  {
    "title": "Stack Overflow - Home | Facebook",
    "link": "https://www.facebook.com/officialstackoverflow/",
    "displayed_link": "https://www.facebook.com › Pages › Interest",
    "snippet": "Stack Overflow. 519455 likes · 587 talking about this. We are the world's programmer community."
  }
...
]
=end

Disclaimer, I work for SerpApi.

Klepac answered 10/8, 2021 at 17:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.