This has been discussed on the Clojure google group; see for example the thread map semantics from February of this year. I'll take the liberty of reusing some of the points I made in my message to that thread below while adding several new ones.
Before I go on to explain why I think the "separate seq" design is the correct one, I would like to point out that a natural solution for the situations where you'd really want to have an output similar to the input without being explicit about it exists in the form of the function fmap
from the contrib library algo.generic. (I don't think it's a good idea to use it by default, however, for the same reasons for which the core library design is a good one.)
Overview
The key observation, I believe, is that the sequence operations like map
, filter
etc. conceptually divide into three separate concerns:
some way of iterating over their input;
applying a function to each element of the input;
producing an output.
Clearly 2. is unproblematic if we can deal with 1. and 3. So let's have a look at those.
Iteration
For 1., consider that the simplest and most performant way to iterate over a collection typically does not involve allocating intermediate results of the same abstract type as the collection. Mapping a function over a chunked seq over a vector is likely to be much more performant than mapping a function over a seq producing "view vectors" (using subvec
) for each call to next
; the latter, however, is the best we can do performance-wise for next
on Clojure-style vectors (even in the presence of RRB trees, which are great when we need a proper subvector / vector slice operation to implement an interesting algorithm, but make traversals terrifying slow if we used them to implement next
).
In Clojure, specialized seq types maintain traversal state and extra functionality such as (1) a node stack for sorted maps and sets (apart from better performance, this has better big-O complexity than traversals using dissoc
/ disj
!), (2) current index + logic for wrapping leaf arrays in chunks for vectors, (3) a traversal "continuation" for hash maps. Traversing a collection through an object like this is simply faster than any attempt at traversing through subvec
/ dissoc
/ disj
could be.
Suppose, however, that we're willing to accept the performance hit when mapping a function over a vector. Well, let's try filtering now:
(->> some-vector (map f) (filter p?))
There's a problem here -- there's no good way to remove elements from a vector. (Again, RRB trees could help in theory, but in practice all the RRB slicing and concatenating involved in producing "real vector" for filtering operations would absolutely destroy performance.)
Here's a similar problem. Consider this pipeline:
(->> some-sorted-set (filter p?) (map f) (take n))
Here we benefit from laziness (or rather, from the ability to stop filtering and mapping early; there's a point involving reducers to be made here, see below). Clearly take
could be reordered with map
, but not with filter
.
The point is that if it's ok for filter
to convert to seq implicitly, then it is also ok for map
; and similar arguments can be made for other sequence functions. Once we've made the argument for all -- or nearly all -- of them, it becomes clear that it also makes sense for seq
to return specialized seq
objects.
Incidentally, filtering or mapping a function over a collection without producing a similar collection as a result is very useful. For example, often we care only about the result of reducing the sequence produced by a pipeline of transformations to some value or about calling a function for side effect at each element. For these scenarios, there is nothing whatsoever to be gained by maintaining the input type and quite a lot to be lost in performance.
Producing an output
As noted above, we do not always want to produce an output of the same type as the input. When we do, however, often the best way to do so is to do the equivalent of pouring a seq over the input into an empty output collection.
In fact, there is absolutely no way to do better for maps and sets. The fundamental reason is that for sets of cardinality greater than 1 there is no way to predict the cardinality of the output of mapping a function over a set, since the function can "glue together" (produce the same outputs for) arbitrary inputs.
Additionally, for sorted maps and sets there is no guarantee that the input set's comparator will be able to deal with outputs from an arbitrary function.
So, if in many cases there is no way to, say, map
significantly better than by doing a seq
and an into
separately, and considering how both seq
and into
make useful primitives in their own right, Clojure makes the choice of exposing the useful primitives and letting users compose them. This lets us map
and into
to produce a set from a set, while leaving us the freedom to not go on to the into
stage when there is no value to be gained by producing a set (or another collection type, as the case may be).
Not all is seq; or, consider reducers
Some of the problems with using the collection types themselves when mapping, filtering etc. don't apply when using reducers.
The key difference between reducers and seqs is that the intermediate objects produced by clojure.core.reducers/map
and friends only produce "descriptor" objects that maintain information on what computations need to be performed in the event that the reducer is actually reduced. Thus, individual stages of the computation can be merged.
This allows us to do things like
(require '[clojure.core.reducers :as r])
(->> some-set (r/map f) (r/filter p?) (into #{}))
Of course we still need to be explicit about our (into #{})
, but this is just a way of saying "the reducers pipeline ends here; please produce the result in the form of a set". We could also ask for a different collection type (a vector of results perhaps; note that mapping f
over a set may well produce duplicate results and we may in some situations wish to preserve them) or a scalar value ((reduce + 0)
).
Summary
The main points are these:
the fastest way to iterate over a collection typically doesn't involve produce intermediate results similar to the input;
seq
uses the fastest way to iterate;
the best approach to transforming a set by mapping or filtering involves using a seq
-style operation, because we want to iterate very fast while accumulating an output;
thus seq
makes a great primitive;
map
and filter
, in their choice to deal with seqs, depending on the scenario, may avoid performance penalties without upsides, benefit from laziness etc., yet can still be used to produce a collection result with into
;
thus they too make great primitives.
Some of these points may not apply to a statically typed language, but of course Clojure is dynamic. Additionally, when we do want to a return that matches input type, we're simply forced to be explicit about it and that, in itself, may be viewed as a good thing.
filter
would not need to operate on seq. Kinda proves my point. – Susie