I have now a big number of documents to process and am using Python RQ to parallelize the task.
I would like a pipeline of work to be done as different operations is performed on each document. For example: A
-> B
-> C
means pass the document to function A
, after A
is done, proceed to B
and last C
.
However, Python RQ does not seem to support the pipeline stuff very nicely.
Here is a simple but somewhat dirty of doing this. In one word, each function along the pipeline call its next function in a nesting way.
For example, for a pipeline A
->B
->C
.
At the top level, some code is written like this:
q.enqueue(A, the_doc)
where q is the Queue
instance and in function A
there are code like:
q.enqueue(B, the_doc)
And in B
, there are something like this:
q.enqueue(C, the_doc)
Is there any other way more elegant than this? For example some code in ONE function:
q.enqueue(A, the_doc)
q.enqueue(B, the_doc, after = A)
q.enqueue(C, the_doc, after= B)
depends_on parameter is the closest one to my requirement, however, running something like:
A_job = q.enqueue(A, the_doc)
q.enqueue(B, depends_on=A_job )
won't work. As q.enqueue(B, depends_on=A_job )
is executed immediately after A_job = q.enqueue(A, the_doc)
is executed. By the time B is enqueued, the result from A might not be ready as it takes time to process.
PS:
If Python RQ is not really good at this, what else tool in Python can I use to achieve the same purpose:
- round-robin parallelization
- pipeline processing support
q.enqueue(B, depends_on=A_job )
job B will be processed after A is finished. Isn't what matters is when it process rather than enqueue? – Talapoin