Experiences with Clojure STM for large datasets?

Asked 30/12, 2010 at 13:1 Answered 7/1, 2011 at 19:51

I need to make a decision about whether to use STM in a Clojure system I am involved with for a system which needs several GB to be stored in a single STM ref.

I would like to hear from anyone who has any advice in using Clojure STM with large indexed datasets to hear their experiences.

Constituent answered 30/12, 2010 at 13:1 Comment(0)

I've been using Clojure for some fairly large-scale data processing tasks (definitely gigabytes of data, typically lots of largish Java arrays stored inside various Clojure constructs/STM refs).

As long as everything fits in available memory, you shouldn't have a problem with extremely large amounts of data in a single ref. The ref itself applies only a small fixed amount of STM overhead that is independent of the size of whatever is contained within it.

A nice extra bonus comes from the structural sharing that is built into Clojure's standard data structures (maps, vectors etc.) - you can take a complete copy of a 10GB data structure, change one element anywhere in the structure, and be guaranteed that both data structures will together only require a fraction more than 10GB. This is very helpful, particularly if you consider that due to STM/concurrency you will potentially have several different versions of the data being created at any one time.

Janes answered 7/1, 2011 at 19:51 Comment(2)

Nice answer. What is the read / write access pattern of your application, and the retry transaction rate? Also, do you use one ref, or multiple refs? – Constituent 7/1, 2011 at 21:38

I have lots of readers but not much write contention - typically only one writer. Not benchmarked the transaction retry rate but I suspect it is pretty low. I use one ref per logical identity, e.g. "the list of all processing results so far" which gets appended to when various tasks complete – Janes 7/1, 2011 at 22:6

The performance isn't going to be any worse or any better than STM involving a single ref with a small dataset. Performance is more hindered by the number of updates to a dataset than the actual size of the dataset.

If you have one writer to the dataset and many readers, then performance will still be quite good. However if you have one reader and many writers, performance will suffer.

Perhaps more information would help us help you more.

Snip answered 6/1, 2011 at 22:29 Comment(1)

I will be expecting different usage patterns and I just wanted to know general experiences to get a feel for how they perform in different situations. But your information was useful, thanks – Constituent 6/1, 2011 at 23:1

Recommended topics

Hot tags