Multi threaded web scraper using urlretrieve on a cookie-enabled site

F

3

2

I am trying to write my first Python script, and with lots of Googling, I think that I am just about done. However, I will need some help getting myself across the finish line.

I need to write a script that logs onto a cookie-enabled site, scrape a bunch of links, and then spawn a few processes to download the files. I have the program running in single-threaded, so I know that the code works. But, when I tried to create a pool of download workers, I ran into a wall.

#manager.py
import Fetch # the module name where worker lives
from multiprocessing import pool

def FetchReports(links,Username,Password,VendorID):
    pool = multiprocessing.Pool(processes=4, initializer=Fetch._ProcessStart, initargs=(SiteBase,DataPath,Username,Password,VendorID,))
    pool.map(Fetch.DownloadJob,links)
    pool.close()
    pool.join()


#worker.py
import mechanize
import atexit

def _ProcessStart(_SiteBase,_DataPath,User,Password,VendorID):
    Login(User,Password)

    global SiteBase
    SiteBase = _SiteBase

    global DataPath
    DataPath = _DataPath

    atexit.register(Logout)

def DownloadJob(link):
    mechanize.urlretrieve(mechanize.urljoin(SiteBase, link),filename=DataPath+'\\'+filename,data=data)
    return True

In this revision, the code fails because the cookies have not been transferred to the worker for urlretrieve to use. No problem, I was able to use mechanize's .cookiejar class to save the cookies in the manager, and pass them to the worker.

#worker.py
import mechanize
import atexit

from multiprocessing import current_process

def _ProcessStart(_SiteBase,_DataPath,User,Password,VendorID):
    global cookies
    cookies = mechanize.LWPCookieJar()

    opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies))

    Login(User,Password,opener)  # note I pass the opener to Login so it can catch the cookies.

    global SiteBase
    SiteBase = _SiteBase

    global DataPath
    DataPath = _DataPath

    cookies.save(DataPath+'\\'+current_process().name+'cookies.txt',True,True)

    atexit.register(Logout)

def DownloadJob(link):
    cj = mechanize.LWPCookieJar()
    cj.revert(filename=DataPath+'\\'+current_process().name+'cookies.txt', ignore_discard=True, ignore_expires=True)
    opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj))

    file = open(DataPath+'\\'+filename, "wb")
    file.write(opener.open(mechanize.urljoin(SiteBase, link)).read())
    file.close

But, THAT fails because opener (I think) wants to move the binary file back to the manager for processing, and I get an "unable to pickle object" error message, referring to the webpage it's trying to read to the file.

The obvious solution is to read the cookies in from the cookie jar and manually add them to the header when making the urlretrieve request, but I am trying to avoid that, and that is why I am fishing for suggestions.

Flowing answered 24/5, 2011 at 13:44 Comment(0)

F

4

After working for most of the day, it turns out that Mechanize was not the problem, it looks more like a coding error. After extensive tweaking and cursing, I have gotten the code to work properly.

For future Googlers like myself, I am providing the updated code below:

#manager.py [unchanged from original]
def FetchReports(links,Username,Password,VendorID):
    import Fetch
    import multiprocessing

    pool = multiprocessing.Pool(processes=4, initializer=Fetch._ProcessStart, initargs=(SiteBase,DataPath,Username,Password,VendorID,))
    pool.map(Fetch.DownloadJob,_SplitLinksArray(links))
    pool.close()
    pool.join()


#worker.py
import mechanize
from multiprocessing import current_process

def _ProcessStart(_SiteBase,_DataPath,User,Password,VendorID):
    global cookies
    cookies = mechanize.LWPCookieJar()
    opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies))

    Login(User,Password,opener)

    global SiteBase
    SiteBase = _SiteBase

    global DataPath
    DataPath = _DataPath

    cookies.save(DataPath+'\\'+current_process().name+'cookies.txt',True,True)

def DownloadJob(link):
    cj = mechanize.LWPCookieJar()
    cj.revert(filename=DataPath+'\\'+current_process().name+'cookies.txt',True,True)
    opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj))

    mechanize.urlretrieve(url=mechanize.urljoin(SiteBase, link),filename=DataPath+'\\'+filename,data=data)

Because I am just downloading links from a list, the non-threadsafe nature of mechanize doesn't seem to be a problem [full disclosure: I have run this process exactly three times, so a problem may appear under further testing]. The multiprocessing module and it's worker pool does all the heavy lifting. Maintaining cookies in files was important for me because the webserver I am downloading from has to give each thread it's own session ID, but other people implementing this code may not need to use it. I did notice that it seems to "forget" variables between the init call and the run call, so the cookiejar may not make the jump.

Flowing answered 25/5, 2011 at 13:59 Comment(1)

One more bug in my code that I will post as a future question, none of my threads exit properly. The atexit function is there, but it does not fire, unless I change it into a decorator. But, then it loses all the session variables I used to log into the site in the first place! For now, it's okay to leave the eight sessions hanging, but I will have to revisit the procedure in the future. – Flowing 26/5, 2011 at 13:33

R

5

Creating a multi-threaded web scraper the right way is hard. I'm sure you could handle it, but why not use something that has already been done?

I really really suggest you to check out Scrapy http://scrapy.org/

It is a very flexible open source web scraper framework that will handle most of the stuff you would need here as well. With Scrapy, running concurrent spiders is a configuration issue, not a programming issue (http://doc.scrapy.org/topics/settings.html#concurrent-requests-per-spider). You will also get support for cookies, proxies, HTTP Authentication and much more.

For me, it took around 4 hours to rewrite my scraper in Scrapy. So please ask yourself: do you really want to solve the threading issue yourself or instead climb to the shoulders of others and focus on the issues of web scraping, not threading?

PS. Are you using mechanize now? Please notice this from mechanize FAQ http://wwwsearch.sourceforge.net/mechanize/faq.html:

"Is it threadsafe?

No. As far as I know, you can use mechanize in threaded code, but it provides no synchronisation: you have to provide that yourself."

If you really want to keep using mechanize, start reading through documentation on how to provide synchronization. (e.g. http://effbot.org/zone/thread-synchronization.htm, http://effbot.org/pyfaq/what-kinds-of-global-value-mutation-are-thread-safe.htm)

Rese answered 25/5, 2011 at 8:15 Comment(5)

From the OPs question, it sounds like he's doing it for education. Scrapy therefore won't suit his needs. – Fucus 25/5, 2011 at 8:25

Oh right, I didn't pick that tune. But yeah I'm still leaving my answer as-is in case some other comes to this answer with google. – Rese 25/5, 2011 at 8:27

Scrapy looks like a great resource, and something I will certainly check out as our needs grow. However, my scraping code is already functional (and single-threaded), and does not have the large needs for time or logic that would require me to start with another solution. The downloads, on the other hand, are far more important with 400+ Excel spreadsheets that have to be downloaded weekly. – Flowing 25/5, 2011 at 13:43

I've been where you are now :) I used other scraping mechanisms a lot before Scrapy because I just didn't find the time to invest in learning it. After I did, there is no way back -- Scrapy is simply excellent in all terms of web scraping with Python. Honestly, I really recommend you to try it out! – Rese 25/5, 2011 at 17:34

I certainly will check it out. Thanks very much for the recommendation! – Flowing 26/5, 2011 at 13:31

F

4

After working for most of the day, it turns out that Mechanize was not the problem, it looks more like a coding error. After extensive tweaking and cursing, I have gotten the code to work properly.

For future Googlers like myself, I am providing the updated code below:

#manager.py [unchanged from original]
def FetchReports(links,Username,Password,VendorID):
    import Fetch
    import multiprocessing

    pool = multiprocessing.Pool(processes=4, initializer=Fetch._ProcessStart, initargs=(SiteBase,DataPath,Username,Password,VendorID,))
    pool.map(Fetch.DownloadJob,_SplitLinksArray(links))
    pool.close()
    pool.join()


#worker.py
import mechanize
from multiprocessing import current_process

def _ProcessStart(_SiteBase,_DataPath,User,Password,VendorID):
    global cookies
    cookies = mechanize.LWPCookieJar()
    opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cookies))

    Login(User,Password,opener)

    global SiteBase
    SiteBase = _SiteBase

    global DataPath
    DataPath = _DataPath

    cookies.save(DataPath+'\\'+current_process().name+'cookies.txt',True,True)

def DownloadJob(link):
    cj = mechanize.LWPCookieJar()
    cj.revert(filename=DataPath+'\\'+current_process().name+'cookies.txt',True,True)
    opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj))

    mechanize.urlretrieve(url=mechanize.urljoin(SiteBase, link),filename=DataPath+'\\'+filename,data=data)

Because I am just downloading links from a list, the non-threadsafe nature of mechanize doesn't seem to be a problem [full disclosure: I have run this process exactly three times, so a problem may appear under further testing]. The multiprocessing module and it's worker pool does all the heavy lifting. Maintaining cookies in files was important for me because the webserver I am downloading from has to give each thread it's own session ID, but other people implementing this code may not need to use it. I did notice that it seems to "forget" variables between the init call and the run call, so the cookiejar may not make the jump.

Flowing answered 25/5, 2011 at 13:59 Comment(1)

One more bug in my code that I will post as a future question, none of my threads exit properly. The atexit function is there, but it does not fire, unless I change it into a decorator. But, then it loses all the session variables I used to log into the site in the first place! For now, it's okay to leave the eight sessions hanging, but I will have to revisit the procedure in the future. – Flowing 26/5, 2011 at 13:33

T

0

In order to enable cookie session in the first code example, add the following code to the function DownloadJob:

cj = mechanize.LWPCookieJar()
opener = mechanize.build_opener(mechanize.HTTPCookieProcessor(cj))
mechanize.install_opener(opener)

And then you may retrieve the url as you do:

mechanize.urlretrieve(mechanize.urljoin(SiteBase, link),filename=DataPath+'\\'+filename,data=data)

Triboluminescence answered 1/11, 2013 at 2:44 Comment(0)

Recommended topics

Hot tags