Poor performance / lockup with STM
Asked Answered
C

3

21

I'm writing a program where a large number of agents listen for events and react on them. Since Control.Concurrent.Chan.dupChan is deprecated I decided to use TChan's as advertised.

The performance of TChan is much worse than I expected. I have the following program that illustrates the issue:

{-# LANGUAGE BangPatterns #-}

module Main where

import Control.Concurrent.STM
import Control.Concurrent
import System.Random(randomRIO)
import Control.Monad(forever, when)

allCoords :: [(Int,Int)]
allCoords = [(x,y) | x <- [0..99], y <- [0..99]]

randomCoords :: IO (Int,Int)
randomCoords = do
  x <- randomRIO (0,99)
  y <- randomRIO (0,99)
  return (x,y)

main = do
  chan <- newTChanIO :: IO (TChan ((Int,Int),Int))

  let watcher p = do
         chan' <- atomically $ dupTChan chan
         forkIO $ forever $ do
                    r@(p',_counter) <- atomically $ readTChan chan'
                    when (p == p') (print r)
         return ()

  mapM_ watcher allCoords

  let go !cnt = do
       xy <- randomCoords
       atomically $ writeTChan chan (xy,cnt)
       go (cnt+1)

  go 1

When compiled (-O) and run the program first will output something like this:

./tchantest
((0,25),341)
((0,33),523)
((0,33),654)
((0,35),196)
((0,48),181)
((0,48),446)
((1,15),676)
((1,50),260)
((1,78),561)
((2,30),622)
((2,38),383)
((2,41),365)
((2,50),596)
((2,57),194)
((3,19),259)
((3,27),344)
((3,33),65)
((3,37),124)
((3,49),109)
((3,72),91)
((3,87),637)
((3,96),14)
((4,0),34)
((4,17),390)
((4,73),381)
((4,74),217)
((4,78),150)
((5,7),476)
((5,27),207)
((5,47),197)
((5,49),543)
((5,53),641)
((5,58),175)
((5,70),497)
((5,88),421)
((5,89),617)
((6,0),15)
((6,4),322)
((6,16),661)
((6,18),405)
((6,30),526)
((6,50),183)
((6,61),528)
((7,0),74)
((7,28),479)
((7,66),418)
((7,72),318)
((7,79),101)
((7,84),462)
((7,98),669)
((8,5),126)
((8,64),113)
((8,77),154)
((8,83),265)
((9,4),253)
((9,26),220)
((9,41),255)
((9,63),51)
((9,64),229)
((9,73),621)
((9,76),384)
((9,92),569)
...

And then, at some point, will stop writing anything, while still consuming 100% cpu.

((20,56),186)
((20,58),558)
((20,68),277)
((20,76),102)
((21,5),396)
((21,7),84)

With -threaded the lockup is even faster and occurs after only a handful of lines. It will also consume whatever number of cores are made available through RTS' -N flag.

Additionally the performance seems rather poor - only about 100 events per second are processed.

Is this a bug in STM or am I misunderstanding something about semantics of STM?

Cypsela answered 22/6, 2011 at 12:36 Comment(3)
One thing you misunderstand is that Chan will wake up a single reader while STM's TChan will wake all readers for each individual write. Aside from that, Neil Brown has a good suggestion for you in his answer.Icecold
It's not the semantics of STM you misunderstand, it's the implementation. It's implemented with optimistic locking. This makes it appropriate in the cases where you have many independent mutable cells and many transactions that want to update usually-non-overlapping subsets of them. It also makes it very inappropriate in the case where every transaction touches the same mutable cell - like the TChan in this case.Pigmy
Even where every transaction touches the same mutable cell you can do pretty well as long as reads sufficiently dominate writes.Ainslie
C
21

The program is going to perform quite badly. You're spawning off 10,000 threads all of which will queue up waiting for a single TVar to be written to. So once they're all going, you may well get this happening:

  1. Each of the 10,000 threads tries to read from the channel, finds it empty, and adds itself to the wait queue for the underlying TVar. So you'll have 10,000 queue-up events, and 10,000 processes in the wait queue for the TVar.
  2. Something is written to the channel. This will unqueue each of the 10,000 threads and put it back on the run-queue (this may be O(N) or O(1), depending on how the RTS is written).
  3. Each of the 10,000 threads must then process the item to see if it's interested in it, which most won't be.

So each item will cause processing O(10,000). If you see 100 events per second, that means that each thread requires about 1 microsecond to wake up, read a couple of TVars, write to one and queue up again. That doesn't seem so unreasonable. I don't understand why the program would grind to a complete halt, though.

In general, I would scrap this design and replace it as follows:

Have a single thread reading the event channel, which maintains a map from coordinate to interested-receiver-channel. The single thread can then pick out the receiver(s) from the map in O(log N) time (much better than O(N), and with a much smaller constant factor involved), and send the event to just the interested receiver. So you perform just one or two communications to the interested party, rather than 10,000 communications to everyone. A list-based form of the idea is written in CHP in section 5.4 of this paper: http://chplib.files.wordpress.com/2011/05/chp.pdf

Chargeable answered 22/6, 2011 at 13:22 Comment(0)
A
10

This is a great test case! I think you've actually created a rare instance of genuine livelock/starvation. We can test this by compiling with -eventlog and running with -vst or by compiling with -debug and running with -Ds. We see that even as the program "hangs" the runtime still is working like crazy, jumping between blocked threads.

The high-level reason is that you have one (fast) writer and many (fast) readers. The readers and writer both need to access the same tvar representing the end of the queue. Let's say that nondeterministically one thread succeeds and all others fail when this happens. Now, as we increase the number of threads in contention to 100*100, then the probability of the reader making progress rapidly goes towards zero. In the meantime, the writer in fact takes longer in its access to that tvar than do the readers, so that makes things worse for it.

In this instance, putting a tiny throttle between each invocation of go for the writer (say, threadDelay 100) is enough to fix the problem. It gives the readers enough time to all block between successive writes, and so eliminates the livelock. However, I do think that it would be an interesting problem to improve the behavior of the runtime scheduler to deal with situations like this.

Ainslie answered 22/6, 2011 at 14:43 Comment(2)
The runtime scheduler can't deal with this in the general case of wanting STM actions to ever execute concurrently. It's just the nature of optimistic locking - it falls over under contention. It's theoretically possible to provide alternate STM implementations with other methods of controlling concurrency. Pessimistic locking would make this case not fail so hard. But there's been no real progress in that direction.Pigmy
@Pigmy -- there's the possibility of using an alternate scheduler that has more sophisticated policies, such as exponential backoff, etc. You still could have problems with malicious code, but at least the obviously pathological cases like this could be avoided.Ainslie
Y
7

Adding to what Neil said, your code also has a space leak (noticeable with smaller n):Space leak After fixing the obvious tuple build-up issue by making tuples strict, I was left with the following profile:Profile with strict tuples What's happening here, I think, is that the main thread is writing data to the shared TChan faster than the worker threads can read it (TChan, like Chan, is unbounded). So the worker threads spend most of their time reexecuting their respective STM transactions, while the main thread is busy stuffing even more data into the channel; this explains why your program hangs.

Yirinec answered 22/6, 2011 at 14:23 Comment(3)
One solution to this problem is using a BOUNDED TChan. I developed one a bit ago and released it as the package bounded-tchan, but these days you might want to use Wren's awesome and more complete package stm-chans (specifically, Control.Concurrent.STM.TBChan).Icecold
Well, I didn't read your answer carefully enough, thus didn't see you were talking about evaluation of the tuple and not an ever-growing STM TChan. I'll leave my previous comment just because bounded chans are useful, but I realize that it is off-mark.Icecold
@Thomas I was talking about both. Or maybe you read my answer before I edited it.Yirinec

© 2022 - 2024 — McMap. All rights reserved.