Scraping a dynamic ecommerce page with infinite scroll
Asked Answered
M

2

22

I'm using rvest in R to do some scraping. I know some HTML and CSS.

I want to get the prices of every product of a URI:

http://www.linio.com.co/tecnologia/celulares-telefonia-gps/

The new items load as you go down on the page (as you do some scrolling).

What I've done so far:

Linio_Celulares <- html("http://www.linio.com.co/celulares-telefonia-gps/")

Linio_Celulares %>%
  html_nodes(".product-itm-price-new") %>%
  html_text()

And i get what i need, but just for the 25 first elements (those load for default).

 [1] "$ 1.999.900" "$ 1.999.900" "$ 1.999.900" "$ 2.299.900" "$ 2.279.900"
 [6] "$ 2.279.900" "$ 1.159.900" "$ 1.749.900" "$ 1.879.900" "$ 189.900"  
[11] "$ 2.299.900" "$ 2.499.900" "$ 2.499.900" "$ 2.799.000" "$ 529.900"  
[16] "$ 2.699.900" "$ 2.149.900" "$ 189.900"   "$ 2.549.900" "$ 1.395.900"
[21] "$ 249.900"   "$ 41.900"    "$ 319.900"   "$ 149.900" 

Question: How to get all the elements of this dynamic section?

I guess, I could scroll the page until all elements are loaded and then use html(URL). But this seems like a lot of work (i'm planning of doing this on different sections). There should be a programmatic work around.

Marianomaribel answered 25/4, 2015 at 4:46 Comment(6)
You would need to use XPath (in R or outside of R) -- have a look at the XML package.Mora
It can't be done with Rvest? I've seen that Rvest imports XML. I've read somestuff about XML. But i'm the URL in my example, i don't see this meta tags from XML. May you help me out?Marianomaribel
Here, I think maybe this will help you do it in rvest: #27812759Mora
@Hack-R. I've seen your example, but what i have is a little different. In my example, there isn't a "Next" button or "Page 2". However, i see a :"<div id="page-number">Página 4</div>" (this changes from 2 to X) that activates as i do scrolling.Would be nice if you have any other tip.Marianomaribel
@OmarGonzales You may have to look into RSelenium to achieve this - see this related post.Logography
I have been to many links but people redirect finally to selenium, How on earth it is not possible in rvest or any R package to activate an infinite scroll page and scrape the final scroll included? Could we invoke @hadley to help here.Shama
B
26

As @nrussell suggested, you can use RSelenium to programatically scroll down the page before getting the source code.

You could for example do:

library(RSelenium)
library(rvest)
#start RSelenium
checkForServer()
startServer()
remDr <- remoteDriver()
remDr$open()

#navigate to your page
remDr$navigate("http://www.linio.com.co/tecnologia/celulares-telefonia-gps/")

#scroll down 5 times, waiting for the page to load at each time
for(i in 1:5){      
remDr$executeScript(paste("scroll(0,",i*10000,");"))
Sys.sleep(3)    
}

#get the page html
page_source<-remDr$getPageSource()

#parse it
html(page_source[[1]]) %>% html_nodes(".product-itm-price-new") %>%
  html_text()
Bumpkin answered 30/4, 2015 at 10:24 Comment(5)
i've been learning some Javascript, but I don't get the for loop you have used. Could you point me to a document on this please?Marianomaribel
this is an R for loop rather than a javascript one, some info hereBumpkin
thanks, but I was talking about the <pre>scroll(0,"i*10000,")</pre> I've heard that the "scroll" command is used in Javascript (like this one: click, hover, etc). 2.- Why 'i*10000'? Is it: for every loop, scroll 10,000 pixels?Marianomaribel
I tried doing the same code as above but it gives me "character(0)".. Why is it so??Negligee
this is now outdated, it appears to use Docker insteadInclinatory
O
-1
library(rvest)
url<-"https://www.linio.com.co/c/celulares-y-tablets?page=1"
page<-html_session(url)

html_nodes(page,css=".price-secondary") %>% html_text()

Loop through the website https://www.linio.com.co/c/celulares-y-tablets?page=2 and 3 and so on and it will be easy for you to scrape the data

EDIT dated 07/05/2019

The website elements changed. Hence new code

library(rvest)
url<-"https://www.linio.com.co/c/celulares-y-tablets?page=1"
page<-html_session(url)

html_nodes(page,css=".price-main") %>% html_text()
Oratory answered 19/12, 2018 at 22:3 Comment(2)
lineo changed it's url structure, not, as you say, is easy to scrap their products. Not in 2015.Marianomaribel
Yeah they have changed the css element alone. It still works with this code @OmarGonzales library(rvest) url<-"https://www.linio.com.co/c/celulares-y-tablets?page=1" page<-html_session(url) html_nodes(page,css=".price-main") %>% html_text()Oratory

© 2022 - 2024 — McMap. All rights reserved.