Crawling LinkedIn while authenticated with Scrapy
Asked Answered
M

1

11

So I've read through the Crawling with an authenticated session in Scrapy and I am getting hung up, I am 99% sure that my parse code is correct, I just don't believe the login is redirecting and being successful.

I also am having an issue with the check_login_response() not sure what page it is checking.. Though "Sign Out" would make sense.




====== UPDATED ======

from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from linkedpy.items import LinkedPyItem

class LinkedPySpider(InitSpider):
    name = 'LinkedPy'
    allowed_domains = ['linkedin.com']
    login_page = 'https://www.linkedin.com/uas/login'
    start_urls = ["http://www.linkedin.com/csearch/results?type=companies&keywords=&pplSearchOrigin=GLHD&pageKey=member-home&search=Search#facets=pplSearchOrigin%3DFCTD%26keywords%3D%26search%3DSubmit%26facet_CS%3DC%26facet_I%3D80%26openFacets%3DJO%252CN%252CCS%252CNFR%252CF%252CCCR%252CI"]

    def init_request(self):
        #"""This function is called before crawling starts."""
        return Request(url=self.login_page, callback=self.login)

    def login(self, response):
        #"""Generate a login request."""
        return FormRequest.from_response(response,
                    formdata={'session_key': '[email protected]', 'session_password': 'somepassword'},
                    callback=self.check_login_response)

    def check_login_response(self, response):
        #"""Check the response returned by a login request to see if we aresuccessfully logged in."""
        if "Sign Out" in response.body:
            self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
            # Now the crawling can begin..

            return self.initialized() # ****THIS LINE FIXED THE LAST PROBLEM*****

        else:
            self.log("\n\n\nFailed, Bad times :(\n\n\n")
            # Something went wrong, we couldn't log in, so nothing happens.

    def parse(self, response):
        self.log("\n\n\n We got data! \n\n\n")
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//ol[@id=\'result-set\']/li')
        items = []
        for site in sites:
            item = LinkedPyItem()
            item['title'] = site.select('h2/a/text()').extract()
            item['link'] = site.select('h2/a/@href').extract()
            items.append(item)
        return items



The issue was resolved by adding 'Return' in front of self.initialized()

Thanks Again! -Mark

Minesweeper answered 8/6, 2012 at 18:16 Comment(9)
What happens when you run the above code?Domenic
'request_depth_max': 1, 'scheduler/memory_enqueued': 3, 'start_time': datetime.datetime(2012, 6, 8, 18, 31, 48, 252601)} 2012-06-08 14:31:49-0400 [LinkedPy] INFO: Spider closed (finished) 2012-06-08 14:31:49-0400 [scrapy] INFO: Dumping global stats:{}Minesweeper
This sort of information should be put in your original question rather than comments.Domenic
@Domenic I will update my post above now, see if we cannot figure out whats going on..Minesweeper
Does SgmlLinkExtractor apply to login_page (or the one after it loads) or start_urls Minesweeper
The rules are used to define how links should be extracted from crawled pages, so those pages defined in start_urls and all other pages reached while crawling from them.Domenic
@Domenic Okay that makes more sense, well can you help with this than, I want to to crawl all the results of the pages in the search. I still cannot figure out how to get it to goto the search page and crawl that.. is because the Rules is blocking it?Minesweeper
@ACorn I've interchanged many things and I cannot get it work, any ideas?Minesweeper
@Minesweeper where did you get that linkedpy library ?Heathheathberry
D
4
class LinkedPySpider(BaseSpider):

should be:

class LinkedPySpider(InitSpider):

Also you shouldn't override the parse function as I mentioned in my answer here: https://stackoverflow.com/a/5857202/crawling-with-an-authenticated-session-in-scrapy

If you don't understand how to define the rules for extracting links, just have a proper read through the documentation:
http://readthedocs.org/docs/scrapy/en/latest/topics/spiders.html#scrapy.contrib.spiders.Rule
http://readthedocs.org/docs/scrapy/en/latest/topics/link-extractors.html#topics-link-extractors

Domenic answered 8/6, 2012 at 18:22 Comment(4)
That did help. I see a log of Success. But I am not sure the def parse(self, response): is actually running. I tried putting a self.log() into there and nothing returned.Minesweeper
It seems parse() should be parse_item()Minesweeper
There is a GOOD chance the problem has to do with the above and allow=r'-\w+.html$' as I do not know what this is..Minesweeper
(Updated based off these changes)Minesweeper

© 2022 - 2024 — McMap. All rights reserved.