Scrapy 403 response because of Cloudflare (clutch.co)
Asked Answered
D

1

6

I'm trying to scrape some info regarding different agencies from clutch.co. When I look up the urls in my browser everything is fine, but using scrapy it gives me 403 response. From all I read on the related issues, I suppose it's coming from Cloudflare. Is there anyway I can bypass these security measures? Here's my scrapy code:

class ClutchSpider(scrapy.Spider):
    name = "clutch"
    allowed_domains = ["clutch.co"]
    start_urls = ["http://clutch.co/"]
    
    custom_settings = {
        'DOWNLOAD_DELAY': 1,
        'CONCURRENT_REQUESTS': 5,
        'RETRY_ENABLED': True,
        'RETRY_TIMES': 5,
        'ROBOTSTXT_OBEY': False,
        'FEED_URL': f'output/output{datetime.timestamp(datetime.now())}.json',
        'FEED_FORMAT': 'json',
    }


    def __init__(self, *args, **kwargs) -> None:
        super().__init__(*args, **kwargs)
        self.input_urls = ['https://clutch.co/directory/mobile-application-developers']
        self.headers = {
                        'accept': '*/*', 
                        'accept-encoding': 'gzip, deflate, br', 
                        'accept-language': 'en-US,en;q=0.9,fa;q=0.8', 
                        # 'cookie': 'shortlist_prompts=true; FPID=FPID2.2.iqvavTK2dqTJ7yLsgWqoL8fYmkFoX3pzUlG6mTVjfi0%3D.1673247154; CookieConsent={stamp:%27zejzt8TIN2JRypvuDr+oPX/PjYUsuVCNii4qWhJvCxxtOxEXcb5hMg==%27%2Cnecessary:true%2Cpreferences:true%2Cstatistics:true%2Cmarketing:true%2Cmethod:%27explicit%27%2Cver:1%2Cutc:1673247163647%2Cregion:%27nl%27}; _gcl_au=1.1.1124048711.1676796982; _gid=GA1.2.316079371.1676796983; ab.storage.deviceId.c7739970-c490-4772-aa67-2b5c1403137e=%7B%22g%22%3A%22d2822ae5-4bac-73ae-cfc0-86adeaeb1add%22%2C%22c%22%3A1676797005041%2C%22l%22%3A1676797005041%7D; ln_or=eyIyMTU0NjAyIjoiZCJ9; hubspotutk=f019384cf677064ee212b1891e67181c; FPLC=o62q7Cwf0JP12iF73tjxOelgvID3ocGZrxnLxzHlB%2F9In25%2BL7oYAwvSxOTnaZWDYH7G2iMkQ03VUW%2BJgWsv7i7StDXSdFnQr6Dpj6VC%2F2Ya4ZptNbWzzRcJUv00JA%3D%3D; __hssrc=1; shortlist_prompts=true; __hstc=238368351.f019384cf677064ee212b1891e67181c.1676798584729.1676873409297.1676879456609.3; __cf_bm=Pn4xsZ2pgyFdB0bdi9t0xTpqxVzY9t5vhySYN6uRpAQ-1676881063-0-AT8uJ+ux6Tmu0WU+bsJovJ1CubUhs+C0JBulUr1i2aQLY28rn7T23PVuGWffSrCaNjeeYPzSDN42NJ46j10jKEPjPO3mS4P8uMx9dDmA7wTqz5NCdil5W5uGQJs2pMbcjbQSfNTjQLh5umYER6hhhLx8qrRFHDnTTJ1vkORfc0eSqBe0rjqaHeR4HFINZOp1UQ==; _ga=GA1.2.298895719.1676796981; _gat_gtag_UA_2814589_5=1; __hssc=238368351.3.1676879456609; _ga_D0WFGX8X3V=GS1.1.1676879405.3.1.1676881074.46.0.0', 
                        'referer': 'https://google.com', 
                        'sec-ch-ua': '"Chromium";v="110", "Not A(Brand";v="24", "Microsoft Edge";v="110"', 
                        'sec-ch-ua-mobile': '?0', 
                        'sec-ch-ua-platform': '"Windows"', 
                        'sec-fetch-dest': 'empty', 
                        'sec-fetch-mode': 'cors', 
                        'sec-fetch-site': 'same-origin', 
                        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36 Edg/110.0.1587.50'
                    }

    def start_requests(self) -> scrapy.Request:
        for url in self.input_urls:
            yield scrapy.Request(url=url, callback=self.parse, headers=self.headers)


    def parse(self, response) -> scrapy.Request:
        agencies = response.xpath(".//div[@class='company col-md-12 prompt-target sponsor']/a/@href").extract()
        for agency in agencies:
            return response.follow(agency, callback=self.parse_agency, headers=self.headers)


PS: I'm not willing to use tools such as selenium, as they make everything too slow. But if there are no other ways to go around this issue, how can I benefit from selenium? (though it also gave me 403)

Dunne answered 20/2, 2023 at 10:27 Comment(1)
good day dear Fateme- many thanks for the hints - i am runiing into the same issues with cloudflare - when i try to obtain some data from clutch io. .- i am trying to find a solution - perhaps your approach will work for me too,. -i try it out..... perhaps you have some more ideas for me: #76390087Icken
S
3

User cloudscraper project designed to bypass cloudflare protection:

import cloudscraper

# returns a CloudScraper instance
scraper = cloudscraper.create_scraper()

# CloudScraper inherits from requests.Session
# Or: scraper = cloudscraper.CloudScraper()  

page = scraper.get("http://somesite.com")

# 200
print(page.status_code)

Instalation:

Simply run pip install cloudscraper. The PyPI package is at https://pypi.python.org/pypi/cloudscraper/

Seizing answered 21/2, 2023 at 8:37 Comment(3)
good day dear Jurakin - many thanks for the hints - i am runiing into the same issues with cloudflare - when i try to obtain some data from clutch io. .- i am trying to find a solution - perhaps your approach will work for me too,. -i try it outIcken
i need a minimalist approach that works even on Google colab - can you help me here!?Icken
thank you for this hint! well i like it and yes its a good choice to have a look at the clutscraper first: - i think its good, always and first of all, take a look at your soup to see if all the expected ingredients are there.Icken

© 2022 - 2024 — McMap. All rights reserved.