Python threading queue is very slow
Asked Answered
A

1

7

I acquire samples (integers) at a very high rate (several kilo samples per seconds) in a thread and put() them in a threading.Queue. The main thread get()s the samples one by one into a list of length 4096, then msgpacks them and finally sends them via ZeroMQ to a client. The client shows the chunks on the screen (print or plot). In short the original idea is, fill the queue with single samples, but empty it in large chunks.

Everything works 100% as expected. But the latter part i.e. accessing the queue is very very slow. Queue gets larger and the output always lags behind by several to tens of seconds.

My question is: how can I do something to make queue access faster? Is there a better approach?

Aurita answered 20/9, 2016 at 15:49 Comment(5)
Are you sure your bottleneck is queue operations and not the client operation?Launch
collections.deque is much faster than threading.Queue and also threadsafe but does not have all the features. Maybe multiprocessing.dummy (which actually uses threads) is worth a look, too for you.Twopiece
You could produce complete lists with 4096 samples in the sampling-thread and then put those lists as single items in the Queue - this would require less comparably slow calls to Queue-methods.Twopiece
@Launch yes, I could check that, in that I temporarily sent data directly from the sampling thread to the client. Though not totally fast, it was much faster than with queue, but comes at the cost of loss of samples of course.Aurita
@Twopiece : making lists on the sampling thread greatly improved the speed! It nearly perfect now. Thanks so much for the hint! I also changed the chunk size. The speed seems to be also dependent on the chunk size. It seems that the optimum chunk size is 1024, not more not less. I am going to check your other suggestion on queue and multiprocessing to see if it can get any better.Aurita
F
1

Q : "Is there a better approach?"

A :
Well, my ultimate performance-candidate would be this :

  • the sampler will operate two or more, separate, statically preallocated "circular"-buffers, one for storing in phase one, the other thus free-to get sent and vice-verse
  • once the sampler's filling reaches the end of the first buffer, it starts filling the other, sending the first one and vice versa
  • ZeroMQ zero-copy, zero-blocking .send( zmq.NOBLOCK ) over an inproc:// transport-class uses just memory-pointer mapping, without moving data in-RAM ( or we can even further reduce the complexity, if moving the filled-up buffer right from here directly to the client, w/o any mediating party ( if not needed otherwise ) for doing so, if using a pre-allocated, static storage,
    like a numpy.array( ( bufferSize, nBuffersInRoundRobinCYCLE ), dtype = np.int32 ), we can just send an already packed-block of { int32 | int64 }-s or other dtype-mapped data using .data-buffer, round-robin cycling along the set of nBuffersInRoundRobinCYCLE-separate inplace storage buffers (used for sufficient latency-masking, filling them one after another in cycle and letting them get efficiently .send( zmq.NOBLOCK )-sent in the "background" ( behind the back of the Python-GIL-lock blocker tyrant ) in the meantime as needed ).

Tweaking Python-interpreter, disabling gc.disable() at all and tuning the default GIL-lock smooth processing "meat-chopper" from 100[ms] somewhere reasonably above, as no threading is needed anymore, by sys.settimeinterval() and moving several acquired samples in lump multiples of CPU-words ~up~to~ CPU-cache-line lengths ( aligned for reducing the fast-cache-to-slow-RAM-memory cache-consistency management mem-I/O updates ) are left for the next LoD of bleeding performance boosting candidates

Flooded answered 10/1, 2022 at 11:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.