(sorry for the long post, to skip directly to the question(s), see bottom)
(UPDATE: if you are revisiting, please see sections marked "update" ;)
I set myself to understand better what was going on under the hood with F# sequences. A task I needed to optimized involved converting strings into a sequence of Unicode codepoints and I was wondering if I could replace the mutable loop we were using into an immutable one without sacrificing too much performance.
One of the challenges is that the returned sequence does not have the same length as the input sequence, because of surrogate pairs that together return one integer. This was the original code looking like:
let stocp0(value: string) : seq<int> =
let pos = ref 0
seq {
while !position < value.Length do
let c = System.Char.ConvertToUtf32(value, !pos)
pos := !position + if c >= 0x10000 then 2 else 1
yield c
}
Attempt 1: for-do
I figured that the easiest thing to do was to turn it into a for-do loop (not a for-in-do loop, they have too much extra overhead):
let inline stocp1(value: string) =
seq {
for i = 0 to value.Length - 1 do
if(not <| Char.IsLowSurrogate(value.[i]))
then yield Char.ConvertToUtf32(value, i)
}
This performed 3.2 times slower than the mutable counterpart above. A higher factor than I had imagined.
Attempt 2: Seq.mapi
Since a string is already a sequence (ok, there's an IEnumerable<char>
wrapper) I thought to utilize that with existing sequence functions from the Seq
module, hoping that would perhaps bring better performance:
let inline stocp2(value: string) =
value
|> Seq.mapi (fun i c ->
if(Char.IsLowSurrogate(c)) then 0
else Char.ConvertToUtf32(value, i))
|> Seq.filter ((<>) 0)
It performed 3.5 times slower.
Strangely, if I replace value
with value.AsEnumerable()
, it performs significantly faster than stocp1
: factor 3.0.
After several more tests it became clear to me that each |>
creates a new IEnumerable<T>
layer, with all the chaining operations involved (this can also be observed in the FSharp source code of Seq
). But the size of the overhead surprised me. Since none of the above had even remotely equal performance of the original, I decided to try to prevent the extra sequence overhead to kick in and create a Seq.mapiAndFilter
function to do both actions at once.
Attempt 3: Seq.mapiAndFilter
Since it is such a delicate loop and I only need to filter on the current character and return based on the current position, perhaps I could remove the extra step involved with Seq.mapi
, which seems expensive.
To do that, I needed to mimic the behavior of the existing Seq.xxx
functions and my first attempt was to do that with a while-yield loop. This would be closest to the original mutable approach but adds one layer of IEnumerable<T>
overhead.
I wrote the following function, which takes a function that returns a boolean and if true, it applies the second function on the position of the current element.
let inline mapiAndFilter f g (e: seq<_>) : seq<_> =
let position = ref -1
let e = e.GetEnumerator()
seq {
while e.MoveNext() do
position := !position + 1
if f e.Current then yield g !position
}
// and the application of it:
let inline stocp3(value: string) =
value.AsEnumerable()
|> mapiAndFilter (not << Char.IsLowSurrogate) (fun i -> Char.ConvertToUtf32 (value, i))
The result was much better than previous attempts, it clocked in at 1.5 times the performance of the mutable solution. Still disappointingly slow, though, it seemed to imply that the added overhead with the enumerators is about 50% in tight loops.
Attempt 4: improved Seq.mapiAndFilter
To find out what was going on under the hood, I then decided to write the enumerable type explicitly, which should give me the opportunity to find out whether any boilerplate checks added in the FSharp libraries had something to do with the low performance characteristics.
Without the safe-guards FSharp Seq
functions use internally (to raise an error on illegal usage of Current etc), I came up with this:
let mapiAndFilter f g (e : seq<'a>) : seq<'b> =
let i = ref -1
let e = e.GetEnumerator()
let inline getEnum() = {
new IEnumerator<'b> with
member x.Current = g !i
interface System.Collections.IEnumerator with
member x.Current = box <| g !i
member x.MoveNext() =
let rec next() =
i := !i + 1
e.MoveNext() && (f e.Current || next())
next()
member x.Reset() = noReset()
interface System.IDisposable with
member x.Dispose() = e.Dispose()
}
{
new System.Collections.Generic.IEnumerable<'b> with
member x.GetEnumerator() = getEnum()
interface System.Collections.IEnumerable with
member x.GetEnumerator() = getEnum() :> System.Collections.IEnumerator
}
// apply the same way as before:
let inline stocp4(value: string) =
value.AsEnumerable()
|> mapiAndFilter (not << Char.IsLowSurrogate) (fun i -> Char.ConvertToUtf32 (value, i))
This became our current winner! It appeared to clock in at only 1.1 times slower than the original mutable function. Of course, it uses mutable state, but so do all the Seq.xxx
functions internally anyway.
Performance comparison overview
General note on all attempts above: I have also tested with ToCharArray()
, which improves performance on small-to-medium input, but becomes detrimental on large input strings, esp. when not all items need to be enumerated. Many other approaches I left out, because their performance was so much worse (Seq.choose
over Seq.filter
is much slower, Seq.collect
, very slow etc).
I used the following for performance comparison (apparently, Seq.length
is the fastest way to force-iterate, Seq.last
and Seq.iter
are much slower):
let input = "ab\U0001ABCDcde\U0001ABCEfghi\U0001ABCF"
let run f = for i in 1 .. 1000000 do f input |> Seq.length |> ignore;;
run stocp1
// etc
Results:
Function CPU Factor
stocp0 0.296 1.0
stocp1 0.951 3.2
stocp2 1.029 3.5
stocp2' 0.873 3.0
stocp3 0.436 1.5
stocp4 0.327 1.1
stocp5 0.405 1.3 (latkin's answer, adj. with Array.toSeq)
stocp'
is the version that uses AsEnumerable()
on the string prior to passing it to the Seq.xxx
functions. All other functions already use this.
I also tested with longer and with very large (50MB) strings, which is our typical use-case, and while the timings are less steady on subsequent runs, the effective factors are about the same as above.
Update: I added latkin's answer as stocp5
, but had to adjust by adding a Array.toSeq
to it. Without it, it clocks in at 0.234
, which is faster than the original while-loop. Unfortunately, I require a sequence (we must use lazy loading and cannot hold whole strings in memory).
(Update) Performance comparison incl element access
The above comparison only tests iteration, which helps find the issues caused by the stacked iterators. However, the timings are a little different if you add element access to the equation. I enforced it by an added Seq.map id
:
let runmap f = for i in 1 .. 1000000 do f input |> Seq.map id |> Seq.length |> ignore;;
Results:
Function CPU Factor
stocp0 0.795 1.0
stocp1 1.528 1.9
stocp2 1.731 2.2
stocp2' 1.812 2.3
stocp3 0.936 1.2
stocp4 0.842 1.1
stocp5 0.873 1.1 (credit: latkin, see his answer and notes above)
(Update) Performance comparison incl limited element access
Since our typical use-cases do not require full iteration, I've added a test that just iterates until the 2nd surrogate pair at position 6, with larger sized input (3932160 characters).
let runmapnth f = for i in 1 .. 1000000 do f input |> Seq.map id |> Seq.nth 6 |> ignore;;
Results:
Function CPU Factor
stocp0 0.624 1.0
stocp1 1.029 1.6
stocp2 1.263 2.0
stocp2' 1.107 1.8
stocp3 0.717 1.1
stocp4 0.624 1.0
stocp5 --- --- OOM
The OutOfMemoryException
with latkin's answer surprised me a bit, it means that the created arrays were not cleaned up when used in a tight loop as above. My machine allocated 8GB a few times in a few seconds, and drops (GC'ed?) in-between, but in the end still fails. Strange:
The other performance characteristics are as can be expected based on earlier observations.
Conclusion, questions
With the last exercise above, I found out something I hadn't expected: the F# compiler only calls the non-generic IEnumerator.Current
and never calls the IEnumerator<T>.Current
. This may explain in part why the performance degradation with chained sequence filters is so noticeable when the object you are performing it on is a value type: the boxing places it on the heap and back again, which is terrible.
Why is the compiler not using the generic interface?
How come that the for-loop is so slow, what happens internally here? Isn't it supposed to turn into a tail-call, which is then compiled into a fast loop internally?
Is there a more natural or other way of writing a filter like I did (mapi, then filter) that does not have the drawbacks of the detrimental performance I described?
Why is there such a big difference between piping the string directly (slow) and
string.AsEnumerable()
(fast)?
I have many more questions, but the SO-format generally wants you to ask a simple single question only, which I clearly didn't. Sorry to have been so elaborate, I hope I doesn't put too many people off to come with some insightful observations.
UPDATE: as pointed out in the comments, the boxing seems to only appear when run from FSharp Interactive (FSI). If you take stocp4
and change the calling code by adding a redundant Seq.filter ((<>) 0)
(or something similar), it will instead call the unboxed accessor. Why? No idea.
IEnumerator<T>.Current
, no boxing, Am I missing something? How did you determine that theobj
-typedCurrent
is being called? – Underwaistprintfn
statement in both theIEnumerator<T>.Current
and theIEnumerator.Current
. Since both interfaces are needed (MoveNext is only available onIEnumerator
), I figured this was caused by the way the FSharp compiler unwinds the syntax. But if you say the generated code doesn't have that, it means the IL or JIT compiler may be causing it, in which case this will be reproducible in C# as well. – Parasympathetic