unable to scrape myntra API data using scrapy framework 307 redirect error
Asked Answered
M

1

-1

Below is the spider code:

import scrapy
class MyntraSpider(scrapy.Spider):

    custom_settings = {
        'HTTPCACHE_ENABLED': False,
        'dont_redirect': True,
        #'handle_httpstatus_list' : [302,307],
        #'CRAWLERA_ENABLED': False,
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36',
    }


    name = "heytest"
    allowed_domains = ["www.myntra.com"]
    start_urls = ["https://www.myntra.com/web/v2/search/data/duke"]
    def parse(self, response):
        self.logger.debug('Parsed jabong.com')

"Parsed jabong.com" is not getting logged. Actually, callback method(parse) is not getting called. Kindly revert.

Please find Error logs from scraping hub:

See also Postman screenshot

Monopolist answered 16/12, 2017 at 6:32 Comment(7)
did you check what status 307 means ?Evanston
is there documentation for this API?Evanston
hi, I found this API from myntra site, I don't have documentation for this but while running this API on postman it working with 200 ok code.Monopolist
show screenshot from postmanEvanston
Screenshot:link.I think there is no issue with the spider.I think we need to change header config.what do you think?Monopolist
you could add link to question so other people will see it.Evanston
request data in postman woukd be more interested - ie. headers, parameters, etc.Evanston
E
0

I run this code (only few times) and I have no problem to get data.

It looks similar to your code so I don't know why you have problem.

Maybe they block you for some reason.

#!/usr/bin/env python3

import scrapy
import json

class MySpider(scrapy.Spider):

    name = 'myspider'

    allowed_domains = ['www.myntra.com']

    start_urls = ['https://www.myntra.com/web/v2/search/data/duke']

    #def start_requests(self):
    #    for tag in self.tags:
    #        for page in range(self.pages):
    #            url = self.url_template.format(tag, page)
    #            yield scrapy.Request(url)

    def parse(self, response):
        print('url:', response.url)

        #print(response.body)

        data = json.loads(response.body)

        print('data.keys():', data.keys())

        print('meta:', data['meta'])

        print("data['data']:", data['data'].keys())

        # download files
        #for href in response.css('img::attr(href)').extract():
        #   url = response.urljoin(src)
        #   yield {'file_urls': [url]}

        # download images and convert to JPG
        #for src in response.css('img::attr(src)').extract():
        #   url = response.urljoin(src)
        #   yield {'image_urls': [url]}

# --- it runs without project and saves in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    #'USER_AGENT': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36',

    # save in CSV or JSON
    'FEED_FORMAT': 'csv',     # 'json
    'FEED_URI': 'output.csv', # 'output.json

    # download files to `FILES_STORE/full`
    # it needs `yield {'file_urls': [url]}` in `parse()`
    #'ITEM_PIPELINES': {'scrapy.pipelines.files.FilesPipeline': 1},
    #'FILES_STORE': '/path/to/valid/dir',

    # download images and convert to JPG
    # it needs `yield {'image_urls': [url]}` in `parse()`
    #'ITEM_PIPELINES': {'scrapy.pipelines.files.ImagesPipeline': 1},
    #'IMAGES_STORE': '/path/to/valid/dir',

    #'HTTPCACHE_ENABLED': False,
    #'dont_redirect': True,
    #'handle_httpstatus_list' : [302,307],
    #'CRAWLERA_ENABLED': False,
})
c.crawl(MySpider)
c.start()
Evanston answered 16/12, 2017 at 17:13 Comment(2)
hi, did this spider request called parse method??I don't think they have blocked me we are using crawlera .Monopolist
it uses parse() with all requests as default method. Crawlera may catch standard blocks like 403, 503 but 30x means redirections which mostly don't block. Maybe Crawlera's admins can say something more.Evanston

© 2022 - 2024 — McMap. All rights reserved.