It depends on the amount of javascript present on the page.
You must know that to render all the javascript the splash takes some time and the python application proceeds without waiting for the rendering to be complete. So sometimes splash is also not able to do it.
- You can explicitly put a wait for rendering as it needs some time generally.
- Also it is a good practice to put up some
wait
.
Here,
import scrapy
from scrapy_splash import SplashRequest
yield scrapy.Request(url, callback=self.parse, meta={'splash':{'args':{'wait':'25'},'endpoint':'render.html'}})
or
import scrapy
from scrapy_splash import SplashRequest
yield SplashRequest(url, self.parse, endpoint='render.html',
args={'wait': 5, 'html' : 1 } )
Between scrapy and selenium
Selenium
is only used to automate web browser interaction, Scrapy
is used to download HTML, process data and save it(whole web crawling framework).
Talking about scraping I would recommend scrapy
and if the problem is javascript.
- Scrapy already has its own official project for javascript called scrapy-splash
- Also, you can create new instance of webdriver from Selenium in the scrapy spider, do some work, extract the data, and then close it after all work done.