Dataflow computing in python
Asked Answered
N

2

15

I have n (typically n < 10 but it should scale) processes running on different machines and communicating through amqp using RabbitMQ. Processes are typically long running and may be implemented in any language (though most are java/python).

Each process requires a number of inputs (numbers/strings) and produces a number of outputs (also just numbers or strings). Executing a process happens asynchronously: sending a message on its input queue and waiting for a callback to be triggered by the output queue.

Ideally the user specifies some inputs and desired outputs and the system should:

  • detect which processes are needed and generate the dependency graph
  • topologically sort the graph and execute it, node transitions will need to be event driven

A node should fire if its input is ready, allowing parallelism per branch. I can assume no cycles for now, but eventually there will be cycles (e.g., two processes may need to iterate until the output no longer changes).

This should be a known problem from (data)flow programming (discussed here before) and I want to avoid re-inventing the wheel. I would prefer a python solution and a search leads to Trellis and Pypes. Trellis is no longer developed but seems to support cycles, while pypes does not. Also not sure how actively developed pypes is.

Further searches reveal a whole list of event based programming frameworks, none of which I am particularly knowledgeable about. There are of course workflow environments like Taverna and KNIME, but that seems overkill.

Does anybody have any experience tackling this type of problem or with the libraries mentioned?

Edit: Other libraries I found are:

Nardi answered 28/3, 2011 at 16:19 Comment(3)
What did you select in the end?Paxton
I ended up just rolling my own thin layer on top of rabbitmqNardi
yeah... maybe en the near future Dataflow/beam will be a good solution for python. https://mcmap.net/q/182612/-what-is-apache-beam-closedPaxton
E
5

python.org has a Wiki page on "Flow Based Programming" -- http://wiki.python.org/moin/FlowBasedProgramming

Erma answered 18/10, 2012 at 0:46 Comment(0)
T
1

The bottom line is that if you can reinvent the wheel in a small number of lines of code ( a few hundred) which you completely understand and can document, then do it.

This is an area where the abstractions used are not that hard to implement given some basic foundation tools. RabbitMQ is such a tool. Node.js is another. There are lots of libraries around that implement useful ways to manages dataflows, workflows, finite state machines, etc., but they have a lot of overlap and they tend to be incomplete. Probably the original developer just built enough to get over his initial problem, and since this type of programming was not that popular, there was not the critical mass to keep development going.

There is a lot to be said for ranking all the possible solutions by popularity, picking the most popular one, and putting your effort into making it work (while sharing your work, of course).

Tallbot answered 26/4, 2011 at 7:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.