Using urllib3 or requests and Celery
Asked Answered
N

0

6

We have a script that downloads documents from various sources periodically. I'm going to move this over to celery, but while doing so, I wanted to take advantage of connection pooling at the same time, but I wasn't sure how to go about it.

My current thought is to do something like this using Requests:

import celery
import requests

s = requests.session()

@celery.task(retry=2)
def get_doc(url):
    doc = s.get(url)
    #do stuff with doc

But I'm concerned that the connections will stay open indefinitely.

I really only need the connections to stay open so long as I'm processing new documents.

So something like this possible:

import celery
import requests


def get_all_docs()
    docs = Doc.objects.filter(some_filter=True)
    s = requests.session()
    for doc in docs: t=get_doc.delay(doc.url, s)

@celery.task(retry=2)
def get_doc(url):
    doc = s.get(url)
    #do stuff with doc

However, in this case, I'm not certain that the connection sessions will persist across instances, or if Requests will create new connections once the pickling / unpickling is complete.

Lastly, I could try the experimental support for task decorators on a class method, so something like this:

import celery
import requests


class GetDoc(object):
    def __init__(self):
        self.s = requests.session()

@celery.task(retry=2)
def get_doc(url):
    doc = self.s.get(url)
    #do stuff with doc

The last one seems like this best approach, and I'm going to test this; however, I was wondering if anyone here has already done something similar to this, or if not, one of you reading this might have a better approach than one of the above methods.

Newbold answered 7/9, 2012 at 17:0 Comment(3)
I suspect you're right. I'm not an expert in the inner workings of Celery but from what I understand is that each job is run by separate workers and you'll have no guarantee that worker A performing a request to google.com will also perform the next request to google.com. I imagine that resource sharing across tasks is inherently against what Celery does, unless there is a specific Celery design feature to support this.Three
I'm thinking about this exact same thing. Did you ever come up with a solution?Cyme
I would love to know the answer to this as wellStink

© 2022 - 2024 — McMap. All rights reserved.