When should I choose Vector in Scala?
Asked Answered
I

6

217

It seems that Vector was late to the Scala collections party, and all the influential blog posts had already left.

In Java ArrayList is the default collection - I might use LinkedList but only when I've thought through an algorithm and care enough to optimise. In Scala should I be using Vector as my default Seq, or trying to work out when List is actually more appropriate?

Ite answered 3/8, 2011 at 14:45 Comment(6)
I guess what I mean here is that in Java I would create write List<String> l = new ArrayList<String>() Scala blogs would have you believe that everybody uses List in order to get persistent collection goodness - but is Vector general-purpose enough that we should be using it in List's place?Ite
@Debilski: I am wondering what you mean by that. I get a List when I type Seq() at REPL.Amritsar
Hmm, well, it says so in the docs. Maybe this is only true for IndexedSeq.Calumnious
The comment regarding the default concrete type of Seq is over three years old. As of Scala 2.11.4 (and earlier), the default concrete type of Seq is List.Pfeiffer
For random access, vector is better. For head, tail access, list is better. For bulk operation, such as map, filter, vector is preferred since vector is organized with 32 elements as a chunk whereas list organized the elements with pointers to each other there is no guarantee these elements are close to each other.Independency
Great answers in this page. I would love to hear some more thoughts about how Set and Map compare to other Scala containers, do you feel they usually require good reason to be used? For instance, even if you have a proper unordered set of unique elements, but you are updating more often than you are testing the elements, should you expect Vector to have a significantly better preference?Diphyodont
N
303

As a general rule, default to using Vector. It’s faster than List for almost everything and more memory-efficient for larger-than-trivial sized sequences. See this documentation of the relative performance of Vector compared to the other collections. There are some downsides to going with Vector. Specifically:

  • Updates at the head are slower than List (though not by as much as you might think)

Another downside before Scala 2.10 was that pattern matching support was better for List, but this was rectified in 2.10 with generalized +: and :+ extractors.

There is also a more abstract, algebraic way of approaching this question: what sort of sequence do you conceptually have? Also, what are you conceptually doing with it? If I see a function that returns an Option[A], I know that function has some holes in its domain (and is thus partial). We can apply this same logic to collections.

If I have a sequence of type List[A], I am effectively asserting two things. First, my algorithm (and data) is entirely stack-structured. Second, I am asserting that the only things I’m going to do with this collection are full, O(n) traversals. These two really go hand-in-hand. Conversely, if I have something of type Vector[A], the only thing I am asserting is that my data has a well defined order and a finite length. Thus, the assertions are weaker with Vector, and this leads to its greater flexibility.

Naivete answered 3/8, 2011 at 22:35 Comment(8)
2.10 has been out for a while now, is the List pattern matching still better than Vector?Lingonberry
The list pattern matching is not better anymore. In fact, it's quite the contrary. For example, to get head and tail one can do case head +: tail or case tail :+ head. To match against empty, you can do case Seq() and so forth. Everything you need is there in the API, which is more versatile than List'sSimonton
List is implemented with a singly-linked list. Vector is implemented something like Java's ArrayList.Uppish
@JosiahYoder It is implemented nothing like ArrayList. ArrayList wraps an array which it dynamically resizes. Vector is a trie, where the keys are the indexes of values.Beseem
I apologize. I was going on a web-source that was vague about the details. Should I correct my earlier statement? Or is that bad form?Uppish
@JosiahYoder I wouldn't worry about it. It's not a big deal, it only really matters if someone is curious about how it works behind the scenes. For most, I think its the performance asymptotics listed in the answer that will be the most crucial.Beseem
It doesn't seem true that Vector has weaker assertions than List. Vector provides O(1) random access. Vector still allows full O(n) traversal just like List. Can you clarify?Corselet
They're referring to conceptual assertions that the programmer makes when they choose a data structure. Because of List's performance characteristics, when you use it, you are implicitly asserting that all the computation happens at the head of the List, whereas with Vector, you are not making this assertion.Carse
M
102

Well, a List can be incredibly fast if the algorithm can be implemented solely with ::, head and tail. I had an object lesson of that very recently, when I beat Java's split by generating a List instead of an Array, and couldn't beat that with anything else.

However, List has a fundamental problem: it doesn't work with parallel algorithms. I cannot split a List into multiple segments, or concatenate it back, in an efficient manner.

There are other kinds of collections that can handle parallelism much better -- and Vector is one of them. Vector also has great locality -- which List doesn't -- which can be a real plus for some algorithms.

So, all things considered, Vector is the best choice unless you have specific considerations that make one of the other collections preferable -- for example, you might choose Stream if you want lazy evaluation and caching (Iterator is faster but doesn't cache), or List if the algorithm is naturally implemented with the operations I mentioned.

By the way, it is preferable to use Seq or IndexedSeq unless you want a specific piece of API (such as List's ::), or even GenSeq or GenIndexedSeq if your algorithm can be run in parallel.

Match answered 3/8, 2011 at 23:6 Comment(13)
Thanks for the answer. What do you mean by "has great locality"?Heriot
@ngocdaothanh It means that data is grouped closely together in memory, improving the chance that data will be in the cache when you need it.Match
What do you mean by 'incredibly fast if the algorithm.... '? Lists are created using pointers Vectors are contiguous spaces (Updated in multiples of 2 when need comes, and all actions are amortized O(1) ) Contiguous spaces => low cache misses. Pointers => cache misses. Am I missing something, or do you have any substantial meaning when you say 'incredibly fast'?Kaleena
@user247077 Yes, Lists can beat Vectors in performance given the particulars I mentioned. And not all actions of vectors are amortized O(1). In fact, on immutable data structures (which is the case), alternate insert/deletions at either end will not amortize at all. In that case, the cache is useless because you are always copying the vector.Match
@DanielC.Sobral Immutable, sure Lists will beat vectors. Mutable- I would be really surprised if you can give me a program where lists beat vectors.Kaleena
@user247077 Perhaps you are not aware that Vector is an immutable data structure in Scala?Match
@DanielC.Sobral Clearly I'm a fool. Apologies for wasting your time. Although makes me wonder, is this vector not implementEd in the way I talked above? i.e. space allocated is 2x of current required, and insertions occur at the end of this contiguous space and space doubled everytime there is need.. ?Kaleena
@user247077 It's way more complicated than that, including some internally mutable stuff to make append cheaper, but when you use it as a stack, which is immutable list optimal scenario, you still end up having the same memory characteristics of a linked list, but with a much bigger memory allocation profile.Match
@user247077 Ah, yes, Vector beats List in most scenarios. Just not all.Match
@DanielC.Sobral Ok, hnece proved that I AM a fool. Indeed immutability means that the initial pointer doesn't change. Otherwise looks like it is implemented as I originally thought. In which case, I'd be very curious to know if list can bit vectors for any non-trivial operations. Vectors will have cache locality, and hence read following reads would be fast. If list is created in one-go, they would be cache local too, but I don't see how they can be vectors.. Dan, thanks again for answering questions patiently. Any chance you have an example where list beats vectors..?Kaleena
@user247077 Ah, but there's the key: read followed by read. Algorithms where the most common operation is pop/compute/push do not fit that. For instance, tree search. One could go mutable, but immutable gives you options such as undo, save&continue, parallelism, and multiple solutions.Match
"Vector also has great locality", where can I read more about the locality in this context?Classy
@Classy Start with this article, and go on from there. Learn the concept, then learn how to apply it, and finally how Vector and List differ in this respect.Match
E
47

Some of the statements here are confusing or even wrong, especially the idea that immutable.Vector in Scala is anything like an ArrayList. List and Vector are both immutable, persistent (i.e. "cheap to get a modified copy") data structures. There is no reasonable default choice as their might be for mutable data structures, but it rather depends on what your algorithm is doing. List is a singly linked list, while Vector is a base-32 integer trie, i.e. it is a kind of search tree with nodes of degree 32. Using this structure, Vector can provide most common operations reasonably fast, i.e. in O(log_32(n)). That works for prepend, append, update, random access, decomposition in head/tail. Iteration in sequential order is linear. List on the other hand just provides linear iteration and constant time prepend, decomposition in head/tail. Everything else takes in general linear time.

This might look like as if Vector was a good replacement for List in almost all cases, but prepend, decomposition and iteration are often the crucial operations on sequences in a functional program, and the constants of these operations are (much) higher for vector due to its more complicated structure. I made a few measurements, so iteration is about twice as fast for list, prepend is about 100 times faster on lists, decomposition in head/tail is about 10 times faster on lists and generation from a traversable is about 2 times faster for vectors. (This is probably, because Vector can allocate arrays of 32 elements at once when you build it up using a builder instead of prepending or appending elements one by one). Of course all operations that take linear time on lists but effectively constant time on vectors (as random access or append) will be prohibitively slow on large lists.

So which data structure should we use? Basically, there are four common cases:

  • We only need to transform sequences by operations like map, filter, fold etc: basically it does not matter, we should program our algorithm generically and might even benefit from accepting parallel sequences. For sequential operations List is probably a bit faster. But you should benchmark it if you have to optimize.
  • We need a lot of random access and different updates, so we should use vector, list will be prohibitively slow.
  • We operate on lists in a classical functional way, building them by prepending and iterating by recursive decomposition: use list, vector will be slower by a factor 10-100 or more.
  • We have an performance critical algorithm that is basically imperative and does a lot of random access on a list, something like in place quick-sort: use an imperative data structure, e.g. ArrayBuffer, locally and copy your data from and to it.
Evoy answered 31/1, 2016 at 14:59 Comment(0)
A
25

For immutable collections, if you want a sequence, your main decision is whether to use an IndexedSeq or a LinearSeq, which give different guarantees for performance. An IndexedSeq provides fast random-access of elements and a fast length operation. A LinearSeq provides fast access only to the first element via head, but also has a fast tail operation. (Taken from the Seq documentation.)

For an IndexedSeq you would normally choose a Vector. Ranges and WrappedStrings are also IndexedSeqs.

For a LinearSeq you would normally choose a List or its lazy equivalent Stream. Other examples are Queues and Stacks.

So in Java terms, ArrayList used similarly to Scala's Vector, and LinkedList similarly to Scala's List. But in Scala I would tend to use List more often than Vector, because Scala has much better support for functions that include traversal of the sequence, like mapping, folding, iterating etc. You will tend to use these functions to manipulate the list as a whole, rather than randomly accessing individual elements.

Amelia answered 3/8, 2011 at 17:41 Comment(4)
But if Vector's iteration is faster than List's, and I can map fold etc as well, then apart from some specialised cases (essentially all those FP algorithms that are specialised to List) it seems that List is essentially legacy.Ite
@Duncan where have you heard that Vector's iteration is faster? For a start, you need to keep track of and update the current index, which you don't need to with a linked list. I wouldn't call the list functions "specialised cases" - they are the bread and butter of functional programming. Not using them would be like trying to program Java without for- or while-loops.Amelia
I'm pretty sure Vector's iteration is faster, but someone needs to benchmark it to be sure.Naivete
I think (?) elements in Vector physically exist together on RAM in groups of 32, which more fully fit in the CPU cache... so there's less cache missConservation
C
2

In situations which involve a lot random access and random mutation, a Vector (or – as the docs say – a Seq) seems to be a good compromise. This is also what the performance characteristics suggest.

Also, the Vector class seems to play nicely in distributed environments without much data duplication because there is no need to do a copy-on-write for the complete object. (See: http://akka.io/docs/akka/1.1.3/scala/stm.html#persistent-datastructures)

Calumnious answered 3/8, 2011 at 15:51 Comment(3)
So much to learn... What does Vector being the default Seq mean? If I write Seq(1, 2, 3) I get List[Int] not Vector[Int].Ite
If you have random access, use an IndexedSeq. Which is also Vector, but that's another matter.Match
@DuncanMcGregor: Vector is the default IndexedSeq which implements Seq. Seq(1, 2, 3) is a LinearSeq which is implemented using List.Alphosis
O
2

If you're programming immutably and need random access, Seq is the way to go (unless you want a Set, which you often actually do). Otherwise List works well, except it's operations can't be parallelized.

If you don't need immutable data structures, stick with ArrayBuffer since it's the Scala equivalent to ArrayList.

Orang answered 3/8, 2011 at 16:3 Comment(3)
I'm sticking to the realm of immutable, persistent collections. My point is, that even if I don't need random access, has Vector effectively replaced List?Ite
Depends a bit on the use case. Vectors are more balanced. The iteration is faster than list and random access is much faster. Updates are slower since it's not just a list prepend, unless it's a bulk update from a fold which can be done with a builder. That said, I think Vector is the best default choice since it is so versatile.Orang
That I think gets to the heart of my question - Vectors are so good that we may as well use them where examples usually show List.Ite

© 2022 - 2024 — McMap. All rights reserved.