Get background image with Nokogiri from DOM?

About

Asked 29/1, 2015 at 16:44 Answered 29/1, 2015 at 18:5

I'm scraping a site and I can't get the images, because they are loaded with background-image CSS.

Is there a way to get these attributes with Nokogiri without having to use Phantom.js or Sentinel? The background-image actually uses inline-styles so I should be able to.

I have to get images from an array of URLS:

<div class="zoomLens" style="background-image: url(http://resources1.okadirect.com/assets/en/new/catalogue/1200x1200/EHD005MET-L_01.jpg?version=7); background-position: -14.7368421052632px -977.894736842105px; background-repeat: no-repeat;">&nbsp;</div>

I'm using Nokogiri via Mechanize, but don't know how to write this correctly:

image = agent.get(doc.parser.at('.zoomLens')["background-image"]).save("okaimages/f_deco-#{counter}.jpg")

Apollo answered 29/1, 2015 at 16:44 Comment(0)

I'd use something like:

require 'nokogiri'

doc = Nokogiri::HTML('<div class="zoomLens" style="background-image: url(http://resources1.okadirect.com/assets/en/new/catalogue/1200x1200/EHD005MET-L_01.jpg?version=7); background-position: -14.7368421052632px -977.894736842105px; background-repeat: no-repeat;">&nbsp;</div>')

doc.search('.zoomLens').map{ |n| n['style'][/url\((.+)\)/, 1] }
# => ["http://resources1.okadirect.com/assets/en/new/catalogue/1200x1200/EHD005MET-L_01.jpg?version=7"]

The trick is the appropriate pattern to grab the contents of the parenthesis. n['style'][/url\((.+)\)/, 1] is using String#[] which can take a regular expression with grouping, and return a particular group from the captures. See https://www.regex101.com/r/mV6rY6/1 for a breakdown of what its doing.

At that point you'd be sitting on an array of image URLs. You can easily iterate over the list and use OpenURI or any number of other HTTP clients to retrieve the images.

Ona answered 29/1, 2015 at 18:5 Comment(7)

The thing is I need this to work with dynamic routes that I have in an array (I have about 3000), so Isnt this code going to change for each? – Apollo 29/1, 2015 at 18:15

Then you really need to write a question that reflects that right? We can only answer based on what you tell us and that wasn't part of your input or mentioned as a spec. Telling us a little, then changing and asking a different question, etc., isn't good. Put it all in at first. – Ona 29/1, 2015 at 18:16

Sorry about that, will specify it. – Apollo 29/1, 2015 at 18:18

IF they are all specified using the url(...) form then it'll work as long as the tag uses the zoomLens class. If that isn't true, then you have to modify the code to fit. – Ona 29/1, 2015 at 18:19

Yes, the attributes are the same in every URL + the zoomLens class. Will try it. – Apollo 29/1, 2015 at 18:22

@theTinMan, thank you for the answer. But how can I modify the URLs? – Comminute 25/3, 2020 at 7:7

This is a really old question, and your question in a comment is not the SO way. You should ask a separate, new question about that specific problem. To get you started, you should investigate Ruby's URI class and the Addressable gem. Research, read the documentation, experiment, then ask when you've run into a brick wall that you couldn't overcome after days of hard work. – Ona 25/3, 2020 at 18:7

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags