How can I send Dynamic website content to scrapy with the html content generated by selenium browser?
Asked Answered
B

1

7

I am working on certain stock-related projects where I have had a task to scrape all data on a daily basis for the last 5 years. i.e from 2016 to date. I particularly thought of using selenium because I can use crawler and bot to scrape the data based on the date. So I used the use of button click with selenium and now I want the same data that is displayed by the selenium browser to be fed by scrappy. This is the website I am working on right now. I have written the following code inside scrappy spider.

class FloorSheetSpider(scrapy.Spider):
    name = "nepse"

    def start_requests(self):

        driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
        
     
        floorsheet_dates = ['01/03/2016','01/04/2016', up to till date '01/10/2022']

        for date in floorsheet_dates:
            driver.get(
                "https://merolagani.com/Floorsheet.aspx")

            driver.find_element(By.XPATH, "//input[@name='ctl00$ContentPlaceHolder1$txtFloorsheetDateFilter']"
                                ).send_keys(date)
            driver.find_element(By.XPATH, "(//a[@title='Search'])[3]").click()
            total_length = driver.find_element(By.XPATH,
                                               "//span[@id='ctl00_ContentPlaceHolder1_PagerControl2_litRecords']").text
            z = int((total_length.split()[-1]).replace(']', ''))    
            for data in range(z, z + 1):
                driver.find_element(By.XPATH, "(//a[@title='Page {}'])[2]".format(data)).click()
                self.url = driver.page_source
                yield Request(url=self.url, callback=self.parse)

               
    def parse(self, response, **kwargs):
        for value in response.xpath('//tbody/tr'):
            print(value.css('td::text').extract()[1])
            print("ok"*200)

Update: Error after answer is

2022-01-14 14:11:36 [twisted] CRITICAL: 
Traceback (most recent call last):
  File "/home/navaraj/PycharmProjects/first_scrapy/env/lib/python3.8/site-packages/twisted/internet/defer.py", line 1661, in _inlineCallbacks
    result = current_context.run(gen.send, result)
  File "/home/navaraj/PycharmProjects/first_scrapy/env/lib/python3.8/site-packages/scrapy/crawler.py", line 88, in crawl
    start_requests = iter(self.spider.start_requests())
TypeError: 'NoneType' object is not iterable

I want to send current web html content to scrapy feeder but I am getting unusal error for past 2 days any help or suggestions will be very much appreciated.

Boling answered 10/1, 2022 at 10:37 Comment(3)
Do you mean something like this?Wagon
What is the "unusual error"?Gilthead
@DMalan I am not able to feed the web content of current browser provided by selenium.The scrapy catches initial page by default.Boling
W
3

The 2 solutions are not very different. Solution #2 fits better to your question, but choose whatever you prefer.

Solution 1 - create a response with the html's body from the driver and scraping it right away (you can also pass it as an argument to a function):

import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from scrapy.http import HtmlResponse


class FloorSheetSpider(scrapy.Spider):
    name = "nepse"

    def start_requests(self):

        # driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
        driver = webdriver.Chrome()

        floorsheet_dates = ['01/03/2016','01/04/2016']#, up to till date '01/10/2022']

        for date in floorsheet_dates:
            driver.get(
                "https://merolagani.com/Floorsheet.aspx")

            driver.find_element(By.XPATH, "//input[@name='ctl00$ContentPlaceHolder1$txtFloorsheetDateFilter']"
                                ).send_keys(date)
            driver.find_element(By.XPATH, "(//a[@title='Search'])[3]").click()
            total_length = driver.find_element(By.XPATH,
                                               "//span[@id='ctl00_ContentPlaceHolder1_PagerControl2_litRecords']").text
            z = int((total_length.split()[-1]).replace(']', ''))
            for data in range(1, z + 1):
                driver.find_element(By.XPATH, "(//a[@title='Page {}'])[2]".format(data)).click()
                self.body = driver.page_source

                response = HtmlResponse(url=driver.current_url, body=self.body, encoding='utf-8')
                for value in response.xpath('//tbody/tr'):
                    print(value.css('td::text').extract()[1])
                    print("ok"*200)

        # return an empty requests list
        return []

Solution 2 - with super simple downloader middleware:

(You might have a delay here in parse method so be patient).

import scrapy
from scrapy import Request
from scrapy.http import HtmlResponse
from selenium import webdriver
from selenium.webdriver.common.by import By


class SeleniumMiddleware(object):
    def process_request(self, request, spider):
        url = spider.driver.current_url
        body = spider.driver.page_source
        return HtmlResponse(url=url, body=body, encoding='utf-8', request=request)


class FloorSheetSpider(scrapy.Spider):
    name = "nepse"

    custom_settings = {
        'DOWNLOADER_MIDDLEWARES': {
            'tempbuffer.spiders.yetanotherspider.SeleniumMiddleware': 543,
            # 'projects_name.path.to.your.pipeline': 543
        }
    }
    driver = webdriver.Chrome()

    def start_requests(self):

        # driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())


        floorsheet_dates = ['01/03/2016','01/04/2016']#, up to till date '01/10/2022']

        for date in floorsheet_dates:
            self.driver.get(
                "https://merolagani.com/Floorsheet.aspx")

            self.driver.find_element(By.XPATH, "//input[@name='ctl00$ContentPlaceHolder1$txtFloorsheetDateFilter']"
                                ).send_keys(date)
            self.driver.find_element(By.XPATH, "(//a[@title='Search'])[3]").click()
            total_length = self.driver.find_element(By.XPATH,
                                               "//span[@id='ctl00_ContentPlaceHolder1_PagerControl2_litRecords']").text
            z = int((total_length.split()[-1]).replace(']', ''))
            for data in range(1, z + 1):
                self.driver.find_element(By.XPATH, "(//a[@title='Page {}'])[2]".format(data)).click()
                self.body = self.driver.page_source
                self.url = self.driver.current_url

                yield Request(url=self.url, callback=self.parse, dont_filter=True)

    def parse(self, response, **kwargs):
        print('test ok')
        for value in response.xpath('//tbody/tr'):
            print(value.css('td::text').extract()[1])
            print("ok"*200)

Notice that I've used chrome so change it back to firefox like in your original code.

Wagon answered 13/1, 2022 at 9:49 Comment(19)
thanks for ans I will definitely try this soln will reply youBoling
what is my middleware path if my project name is first_scrapy?Boling
If it's inside the spider (like in the answer) and the file's name is spider.py (in the answer the filename is yetanotherspider.py), then it will be: first_scrapy.spiders.spider.SeleniumMiddleware. But it's best if you put the middleware class inside middlewares.py and then it will be first_scrapy.middlewares.SeleniumMiddleware, I only put it in the spider so you could see better.Wagon
I am not getting any data printed :(Boling
It worked for me, parse method printed the data. Do you get any errors or something?Wagon
without middlware seems working but with middlware parse method print('test ok') is not printedBoling
just want to know does use of middleware makes scraping fast?Boling
In this case no, it's just so you can use scrapy.Request with the page from selenium.Wagon
thank you You massively supported me hope I can fix that middleware part too.Boling
hello I am gettingerror when bot goes to last page basically I was trying to save all data in a array and convert it to json by appendining but i am getting none type object is not iterableBoling
Is it working for every other page? (If you can add the updated code it would be great)Wagon
Yeah it works for all pages but when It goes to last page It says none type error, this is all my code in gist just look at line number 88 i think it is thorwing error from there gist link with all code is here ps you can ignore my csv related part https://gist.github.com/nawarazpokhrel/5626eb9998dba7951bad5e2a739036e8Boling
any update? regarding issue?Boling
You never update final_floor_sheet so it stays empty.Wagon
i removed that still same error :(Boling
I have updated error trackbackBoling
start_requests: This method must return an iterable with the first Requests to crawl for this spider. Make a dummy request at the end of the function, and create a parse method with with pass and it should be OK. (if you use the middleware then you don't need to do this).Wagon
can you please update same in ans it will be helpful to others tooBoling
@nava I checked and it's enough to return an empty list. I've update the first solution.Wagon

© 2022 - 2024 — McMap. All rights reserved.