how does scrapy-splash handle infinite scrolling?
Asked Answered
K

3

8

I want to reverse engineering the contents generated by scrolling down in the webpage. The problem is in the url https://www.crowdfunder.com/user/following_page/80159?user_id=80159&limit=0&per_page=20&screwrand=933. screwrand doesn't seem to follow any pattern, so the reversing the urls don't work. I'm considering the automatic rendering using Splash. How to use Splash to scroll like browsers? Thanks a lot! Here are the codes for two request:

request1 = scrapy_splash.SplashRequest(
    'https://www.crowdfunder.com/user/following/{}'.format(user_id),
     self.parse_follow_relationship,
     args={'wait':2},
     meta={'user_id':user_id, 'action':'following'},
     endpoint='http://192.168.99.100:8050/render.html')

yield request1

request2 = scrapy_splash.SplashRequest(
    'https://www.crowdfunder.com/user/following_user/80159?user_id=80159&limit=0&per_page=20&screwrand=76',
    self.parse_tmp,
    meta={'user_id':user_id, 'action':'following'},
    endpoint='http://192.168.99.100:8050/render.html')

yield request2

ajax request shown in browser console

Katabatic answered 30/10, 2016 at 2:56 Comment(0)
C
21

To scroll a page you can write a custom rendering script (see http://splash.readthedocs.io/en/stable/scripting-tutorial.html), something like this:

function main(splash)
    local num_scrolls = 10
    local scroll_delay = 1.0

    local scroll_to = splash:jsfunc("window.scrollTo")
    local get_body_height = splash:jsfunc(
        "function() {return document.body.scrollHeight;}"
    )
    assert(splash:go(splash.args.url))
    splash:wait(splash.args.wait)

    for _ = 1, num_scrolls do
        scroll_to(0, get_body_height())
        splash:wait(scroll_delay)
    end        
    return splash:html()
end

To render this script use 'execute' endpoint instead of render.html endpoint:

script = """<Lua script> """
scrapy_splash.SplashRequest(url, self.parse,
                            endpoint='execute', 
                            args={'wait':2, 'lua_source': script}, ...)
Cursory answered 1/11, 2016 at 18:36 Comment(2)
can you please guide where to write this script. I mean i am confused how can i write this javascript function in python fileEaglet
If this script reaches the end and then some javascript appends new content to the page, will the script scroll again and again until no more content is added?Virulent
D
4

Thanks Mikhail, I tried your scroll script, and it worked, but I also notice that your script scroll too much one time, some js have no time too render and is skipped, so I do some little change as follow:

function main(splash)
        local num_scrolls = 10
        local scroll_delay = 1

        local scroll_to = splash:jsfunc("window.scrollTo")
        local get_body_height = splash:jsfunc(
            "function() {return document.body.scrollHeight;}"
        )
        assert(splash:go(splash.args.url))
        splash:wait(splash.args.wait)

        for _ = 1, num_scrolls do
            local height = get_body_height()
            for i = 1, 10 do
                scroll_to(0, height * i/10)
                splash:wait(scroll_delay/10)
            end
        end        
        return splash:html()
end
Decipher answered 30/10, 2018 at 2:26 Comment(0)
M
0

I do not think that setting the number of scrolls hard coded is a good idea for infinite scroll pages, so I modified the above-mentioned code like this:

function main(splash, args)
    
    current_scroll = 0  
  
    scroll_to = splash:jsfunc("window.scrollTo")
    get_body_height = splash:jsfunc(
        "function() {return document.body.scrollHeight;}"
    )
    assert(splash:go(splash.args.url))
    splash:wait(3)
  
    height = get_body_height()

    while current_scroll < height do
        scroll_to(0, get_body_height())
        splash:wait(5)
            current_scroll = height
            height = get_body_height()
    end 
    splash:set_viewport_full()
    return splash:html()
end
Mcardle answered 23/3, 2022 at 9:3 Comment(1)
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Jasperjaspers

© 2022 - 2024 — McMap. All rights reserved.