How to scrape all contents from infinite scroll website? scrapy
Asked Answered
W

6

16

I'm using scrapy.

The website i'm using has infinite scroll.

the website has loads of posts but i only scraped 13.

How to scrape the rest of the posts?

here's my code:

class exampleSpider(scrapy.Spider):
name = "example"
#from_date = datetime.date.today() - datetime.timedelta(6*365/12)
allowed_domains = ["example.com"]
start_urls = [
    "http://www.example.com/somethinghere/"
]

def parse(self, response):
  for href in response.xpath("//*[@id='page-wrap']/div/div/div/section[2]/div/div/div/div[3]/ul/li/div/h1/a/@href"):
    url = response.urljoin(href.extract())
    yield scrapy.Request(url, callback=self.parse_dir_contents)


def parse_dir_contents(self, response):
    #scrape contents code here
Weitzman answered 13/5, 2016 at 10:43 Comment(0)
C
10

Check the website code.

If the infinite scroll is automatically triggering js action, you could proceed as follows using the Alioth proposal: spynner

Following the spynner docs, you can find that can trigger jquery events.

Look up the library code to see which kind of events you can fire.

Try to generate a scroll to bottom event or create a css property change on any of the divs inside the scrollable content in the website. Following spynner docs, something like:

browser = spynner.Browser(debug_level=spynner.DEBUG, debug_stream=debug_stream)
# load here your website as spynner allows
browser.load_jquery(True)
ret = run_debug(browser.runjs,'window.scrollTo(0, document.body.scrollHeight);console.log(''scrolling...);')
# continue parsing ret 

It is not quite probable that an infinite scroll is triggered by an anchor link, but maybe can be triggered by a jquery action, not necesarry attached to a link. For this case use code like the following:

br.load('http://pypi.python.org/pypi')

anchors = br.webframe.findAllElements('#menu ul.level-two a')
# chooses an anchor with Browse word as key
anchor = [a for a in anchors if 'Browse' in a.toPlainText()][0]
br.wk_click_element_link(anchor, timeout=10)
output = br.show()
# save output in file: output.html or 
# plug this actions into your scrapy method and parse output var as you do 
# with response body

Then, run scrapy on the output.html file or, if you implemented it so, using the local memory variable you choosed to store the modified html after the js action.

As another solution, the website you are trying to parse might have an alternate render version in case the visitor browser has not js activated.

Try to render the website with a javascript disabled browser, and maybe that way, the website makes available an anchor link at the end of the content section.

Also there are successful implementations of crawler js navigation using the approach with Scrapy together with Selenium detailed in this so answer.

Cestoid answered 14/4, 2017 at 22:0 Comment(2)
Thank you for the perfect answer. ♥Conoscenti
I used the option to disable javascript, and was able to see the page still render. DevTools showed me that there was an alternate render, which I then used in my scraper - worked like a charm. Thank you, upvoting this answer.Entremets
O
8

I use Selenium rather than scrapy but you must be able to do the equivalent and what I do is run some JavaScript on loading the file, namely:

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

And I jut keep doing that till it won't scroll any longer. It's not pretty and could not be used in production but effective for specific jobs.

Oarsman answered 12/4, 2017 at 11:44 Comment(1)
Whether using Scrapy will also run JavaScript?Conoscenti
H
2

i think what you are looking for is a pagination logic along side your normal logic

In most cases ..infinite scrolling == paging, on such page when you scroll down to 3/4 of page or till to the end of the page , page fires AJAX call and downloads next page content and load the response into current page

I would recommend using network monitor tool in firefox and notice any such page request when you scroll down

-- clue : you will be using scrapy.FormRequest or scrapy.FormRequest.from_response while implementing this solution

Highlander answered 14/5, 2016 at 14:21 Comment(0)
I
1

I think you are looking for something like DEPTH-LIMIT

http://doc.scrapy.org/en/latest/topics/settings.html#depth-limit

http://bgrva.github.io/blog/2014/03/04/scrapy-after-tutorials-part-1/

Iseabal answered 13/5, 2016 at 11:5 Comment(3)
i tried putting depth limit in my settings but still can't get in.. it stuck at getting all these links :"www.example.com/blog/2016/05/13" but it didnt click in the link and scrape inside..Weitzman
Sorry I could not understand where it is stuck. You could check out some example online like github.com/scrapy/dirbot/blob/master/dirbot/spiders/dmoz.pyIseabal
Depth limit is to go to all the links present. Lets say the page your in is the first level and if u click a link on it. It will be level 1 and goes on... This is the DEPTH-LIMIT used for not for infinity scroll ...Gondi
U
1

Obviously, that target site upload its content dynamically. Hence there are two appropriate solutions there:

  1. Decrypt jQuery interaction in subtleties and try to simulate data exchange with server manually

  2. Use another tool for this particular job. For example spynner seems to me a right choice to pay attention.

Ulrich answered 12/4, 2017 at 10:53 Comment(0)
A
1

In some cases, you can find in the source code the element called to run the "next" pagination, even in infinite scroll. So you just have to click on this element and it will show the rest of the posts. With scrapy/selenium :

next = self.driver.find_element_by_xpath('//a[@class="nextResults"]')
next.click()
time.sleep(2) 
Aquacade answered 28/6, 2018 at 11:47 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.