How do I parse and scrape the meta tags of a URL with Nokogiri?

Asked 22/7, 2013 at 7:26 Answered 16/1, 2018 at 8:13

I am using Nokogiri to pull the <h1> and <title> tags, but I am having trouble getting these:

<meta name="description" content="I design and develop websites and applications.">
<meta name="keywords" content="web designer,web developer">

I have this code:

url = 'https://en.wikipedia.org/wiki/Emma_Watson' 
page = Nokogiri::HTML(open(url))

puts page.css('title')[0].text puts page.css('h1')[0].text
puts page.css('description')
puts META DESCRIPTION
puts META KEYWORDS

I looked in the docs and didn't find anything. Would I use regex to do this?

Thanks.

Steak answered 22/7, 2013 at 7:26 Comment(3)

give the full html.. your need is unclear.. – Pneumatometer 22/7, 2013 at 7:40

Just to clarify: Nokogiri doesn't crawl anything. It only does parsing. Your code, in conjunction with gems like OpenURI and Nokogiri, does the crawling. – Skees 22/7, 2013 at 19:13

Use github.com/BorisBresciani/rails_parse_head – Melliemelliferous 18/12, 2019 at 10:22

Here's how I'd go about it:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<meta name="description" content="I design and develop websites and applications.">
<meta name="keywords" content="web designer,web developer">
EOT

contents = %w[description keywords].map { |name|
  doc.at("meta[name='#{name}']")['content']
}
contents # => ["I design and develop websites and applications.", "web designer,web developer"]

Or:

contents = doc.search("meta[name='description'], meta[name='keywords']").map { |n| 
  n['content'] 
}
contents # => ["I design and develop websites and applications.", "web designer,web developer"]

Skees answered 22/7, 2013 at 19:5 Comment(2)

FYI: The only works when you have the meta tags in the document. If you run this code with one of them removed it'll error: undefined method '[]' for nil:NilClass. – Heavyduty 19/3, 2022 at 16:1

Then something changed in one of many revisions since 2013, because the examples are the exact output from Ruby when the answer was created. – Skees 20/3, 2022 at 22:39

That would be:

page.at('meta[name="keywords"]')['content']

Cordie answered 22/7, 2013 at 8:10 Comment(0)

Another solution: You can use XPath or CSS.

puts page.xpath('/html/head/meta[@name="description"]/@content').to_s
puts page.xpath('/html/head/meta[@name="keywords"]/@content').to_s

Loehr answered 16/1, 2018 at 8:13 Comment(0)

Recommended topics

Hot tags