ZeroMQ/ZMQ Push/Pull pattern usefulness
Asked Answered
V

1

52

In experimenting with the ZeroMQ Push/Pull (what they call Pipeline) socket type, I'm having difficulty understanding the utility of this pattern. It's billed as a "load-balancer".

Given a single server sending tasks to a number of workers, Push/Pull will evenly hand out the tasks between all the clients. 3 clients and 30 tasks, each client gets 10 tasks: client1 gets tasks 1, 4, 7,... client2, 2, 5,... and so on. Fair enough. Literally.

However, in practice there is often a non-homogeneous mix of task complexity or client compute resources (or availability), then this pattern breaks badly. All the tasks seem to be scheduled in advance, and the server has no knowledge of the progress of the clients or if they are even available. If client1 goes down, its remaining tasks are not sent to the other clients, but remain queued for client1. If client1 remains down, then those tasks are never handled. Conversely, if a client is faster at processing its tasks, it doesn't get further tasks and remains idle, as they remain scheduled for the other clients.

Using REQ/REP is one possible solution; tasks are then only given to an available resource .

So am I missing something? How is Push/Pull to be used effectively? Is there a way to handle the asymmetry of clients, tasks, etc, with this socket type?

Thanks!

Here's a simple Python example:

# server

import zmq
import time

context = zmq.Context()
socket = context.socket(zmq.PUSH)
#socket = context.socket(zmq.REP)   # uncomment for Req/Rep

socket.bind("tcp://127.0.0.1:5555")

i = 0
time.sleep(1)   # naive wait for clients to arrive

while True:
  #msg = socket.recv()    # uncomment for Req/Rep
  socket.send(chr(i))
  i += 1 
  if i == 100:
    break

time.sleep(10)   # naive wait for tasks to drain

.

# client

import zmq
import time
import sys

context = zmq.Context()

socket = context.socket(zmq.PULL)
#socket = context.socket(zmq.REQ)    # uncomment for Req/Rep

socket.connect("tcp://127.0.0.1:5555")

delay = float(sys.argv[1])

while True:
  #socket.send('')     # uncomment for Req/Rep
  message = socket.recv()
  print "recv:", ord(message)
  time.sleep(delay)

Fire up 3 clients with a delay parameter on the command line (ie, 1, 1, and 0.1) and then the server, and see how all the tasks are evenly distributed. Then kill one of the clients to see that its remaining tasks aren't handled.

Uncomment the lines indicated to switch it to a Req/Rep type socket and watch a more effective load-balancer.

Volitive answered 20/9, 2012 at 0:13 Comment(0)
U
70

It's not a load balancer, this was a faulty explanation that stayed in the 0MQ docs for a while. To do load-balancing, you have to get some information back from the workers about their availability. PUSH, like DEALER, is a round-robin distributor. It's useful for its raw speed, and simplicity. You don't need any kind of chatter, just pump tasks down the pipeline and they're sprayed out to all available workers as rapidly as the network can handle them.

The pattern is useful when you are doing really high numbers of small tasks, and where workers come and go infrequently. The pattern is not good for larger tasks that take time to complete because then you want a single queue that sends new tasks only to available workers. It also suffers from an anti-pattern where if a client sends many tasks and then workers connect, the first worker will grab 1,000 or so messages while the others are still busy connecting.

You can make your own higher-level routing in several ways. Look at the LRU patterns in the Guide: in this the workers explicitly tell the broker 'ready'. You can also do credit-based flow control, and this is what I'd do in any real load-balancing situation. It's a generalization of the LRU pattern. See http://hintjens.com/blog:15

Uninterrupted answered 21/9, 2012 at 9:44 Comment(4)
When a worker does fail, is there a mechanism to both detect this and recover the queued tasks than have been assigned but not sent? Something like a timeout with a task redistribution.Volitive
If you want to detect failed workers you have to add this yourself. It's relatively easy: collect all results and if there's one missing, restart the whole batch. Failure is rare enough that this simple brutal approach handles it well.Uninterrupted
The link at the end is broken.Sunny
What if we use the ZMQ_HWM option, where we set it to a small number on all downstream pullers, then wouldn't this force the pusher not to simply send all requests to the first puller that connects?Lisa

© 2022 - 2024 — McMap. All rights reserved.