Python threading queue is very slow

I acquire samples (integers) at a very high rate (several kilo samples per seconds) in a thread and put() them in a threading.Queue. The main thread get()s the samples one by one into a list of length 4096, then msgpacks them and finally sends them via ZeroMQ to a client. The client shows the chunks on the screen (print or plot). In short the original idea is, fill the queue with single samples, but empty it in large chunks.

Everything works 100% as expected. But the latter part i.e. accessing the queue is very very slow. Queue gets larger and the output always lags behind by several to tens of seconds.

My question is: how can I do something to make queue access faster? Is there a better approach?

Q : "Is there a better approach?"

A :
Well, my ultimate performance-candidate would be this :

the sampler will operate two or more, separate, statically preallocated "circular"-buffers, one for storing in phase one, the other thus free-to get sent and vice-verse
once the sampler's filling reaches the end of the first buffer, it starts filling the other, sending the first one and vice versa
ZeroMQ zero-copy, zero-blocking .send( zmq.NOBLOCK ) over an inproc:// transport-class uses just memory-pointer mapping, without moving data in-RAM ( or we can even further reduce the complexity, if moving the filled-up buffer right from here directly to the client, w/o any mediating party ( if not needed otherwise ) for doing so, if using a pre-allocated, static storage,
like a numpy.array( ( bufferSize, nBuffersInRoundRobinCYCLE ), dtype = np.int32 ), we can just send an already packed-block of { int32 | int64 }-s or other dtype-mapped data using .data-buffer, round-robin cycling along the set of nBuffersInRoundRobinCYCLE-separate inplace storage buffers (used for sufficient latency-masking, filling them one after another in cycle and letting them get efficiently .send( zmq.NOBLOCK )-sent in the "background" ( behind the back of the Python-GIL-lock blocker tyrant ) in the meantime as needed ).

Tweaking Python-interpreter, disabling gc.disable() at all and tuning the default GIL-lock smooth processing "meat-chopper" from 100[ms] somewhere reasonably above, as no threading is needed anymore, by sys.settimeinterval() and moving several acquired samples in lump multiples of CPU-words ~up~to~ CPU-cache-line lengths ( aligned for reducing the fast-cache-to-slow-RAM-memory cache-consistency management mem-I/O updates ) are left for the next LoD of bleeding performance boosting candidates

Recommended topics

Hot tags