Performance of sequences with while vs. for-do comprehensions, compared to direct `IEnumerable<T>` implementation
Asked Answered
P

2

15

(sorry for the long post, to skip directly to the question(s), see bottom)
(UPDATE: if you are revisiting, please see sections marked "update" ;)

I set myself to understand better what was going on under the hood with F# sequences. A task I needed to optimized involved converting strings into a sequence of Unicode codepoints and I was wondering if I could replace the mutable loop we were using into an immutable one without sacrificing too much performance.

One of the challenges is that the returned sequence does not have the same length as the input sequence, because of surrogate pairs that together return one integer. This was the original code looking like:

let stocp0(value: string) : seq<int> =
    let pos = ref 0
    seq {
        while !position < value.Length do
            let c = System.Char.ConvertToUtf32(value, !pos)
            pos := !position + if c >= 0x10000 then 2 else 1
            yield c
    }

Attempt 1: for-do

I figured that the easiest thing to do was to turn it into a for-do loop (not a for-in-do loop, they have too much extra overhead):

let inline stocp1(value: string) =
    seq {
        for i = 0 to value.Length - 1 do
            if(not <| Char.IsLowSurrogate(value.[i])) 
            then yield Char.ConvertToUtf32(value, i)
    }

This performed 3.2 times slower than the mutable counterpart above. A higher factor than I had imagined.

Attempt 2: Seq.mapi

Since a string is already a sequence (ok, there's an IEnumerable<char> wrapper) I thought to utilize that with existing sequence functions from the Seq module, hoping that would perhaps bring better performance:

let inline stocp2(value: string) =
    value
        |> Seq.mapi (fun i c -> 
            if(Char.IsLowSurrogate(c)) then 0
            else Char.ConvertToUtf32(value, i))
        |> Seq.filter ((<>) 0)

It performed 3.5 times slower.

Strangely, if I replace value with value.AsEnumerable(), it performs significantly faster than stocp1: factor 3.0.

After several more tests it became clear to me that each |> creates a new IEnumerable<T> layer, with all the chaining operations involved (this can also be observed in the FSharp source code of Seq). But the size of the overhead surprised me. Since none of the above had even remotely equal performance of the original, I decided to try to prevent the extra sequence overhead to kick in and create a Seq.mapiAndFilter function to do both actions at once.

Attempt 3: Seq.mapiAndFilter

Since it is such a delicate loop and I only need to filter on the current character and return based on the current position, perhaps I could remove the extra step involved with Seq.mapi, which seems expensive.

To do that, I needed to mimic the behavior of the existing Seq.xxx functions and my first attempt was to do that with a while-yield loop. This would be closest to the original mutable approach but adds one layer of IEnumerable<T> overhead.

I wrote the following function, which takes a function that returns a boolean and if true, it applies the second function on the position of the current element.

let inline mapiAndFilter f g (e: seq<_>) : seq<_> =
    let position = ref -1
    let e = e.GetEnumerator()
    seq {
        while e.MoveNext() do
            position := !position + 1
            if f e.Current then yield g !position
    }


// and the application of it:
let inline stocp3(value: string) =
    value.AsEnumerable()
        |> mapiAndFilter (not << Char.IsLowSurrogate) (fun i -> Char.ConvertToUtf32 (value, i))

The result was much better than previous attempts, it clocked in at 1.5 times the performance of the mutable solution. Still disappointingly slow, though, it seemed to imply that the added overhead with the enumerators is about 50% in tight loops.

Attempt 4: improved Seq.mapiAndFilter

To find out what was going on under the hood, I then decided to write the enumerable type explicitly, which should give me the opportunity to find out whether any boilerplate checks added in the FSharp libraries had something to do with the low performance characteristics.

Without the safe-guards FSharp Seq functions use internally (to raise an error on illegal usage of Current etc), I came up with this:

let mapiAndFilter f g (e : seq<'a>) : seq<'b> =
    let i = ref -1
    let e = e.GetEnumerator()
    let inline getEnum() = {
            new IEnumerator<'b> with 
                member x.Current = g !i
            interface System.Collections.IEnumerator with 
                member x.Current = box <| g !i
                member x.MoveNext() = 
                    let rec next() = 
                        i := !i + 1
                        e.MoveNext() && (f e.Current || next())
                    next()
                member x.Reset() = noReset()
            interface System.IDisposable with 
                member x.Dispose() = e.Dispose()  
        }
    {
    new System.Collections.Generic.IEnumerable<'b> with
        member x.GetEnumerator() = getEnum()
    interface System.Collections.IEnumerable with
        member x.GetEnumerator() = getEnum() :> System.Collections.IEnumerator
    }

// apply the same way as before:
let inline stocp4(value: string) =
    value.AsEnumerable()
        |> mapiAndFilter (not << Char.IsLowSurrogate) (fun i -> Char.ConvertToUtf32 (value, i))

This became our current winner! It appeared to clock in at only 1.1 times slower than the original mutable function. Of course, it uses mutable state, but so do all the Seq.xxx functions internally anyway.

Performance comparison overview

General note on all attempts above: I have also tested with ToCharArray(), which improves performance on small-to-medium input, but becomes detrimental on large input strings, esp. when not all items need to be enumerated. Many other approaches I left out, because their performance was so much worse (Seq.choose over Seq.filter is much slower, Seq.collect, very slow etc).

I used the following for performance comparison (apparently, Seq.length is the fastest way to force-iterate, Seq.last and Seq.iter are much slower):

let input = "ab\U0001ABCDcde\U0001ABCEfghi\U0001ABCF"
let run f = for i in 1 .. 1000000 do f input |> Seq.length |> ignore;;
run stocp1
// etc

Results:

Function  CPU     Factor
stocp0    0.296   1.0
stocp1    0.951   3.2
stocp2    1.029   3.5
stocp2'   0.873   3.0
stocp3    0.436   1.5
stocp4    0.327   1.1
stocp5    0.405   1.3 (latkin's answer, adj. with Array.toSeq)

stocp' is the version that uses AsEnumerable() on the string prior to passing it to the Seq.xxx functions. All other functions already use this.

I also tested with longer and with very large (50MB) strings, which is our typical use-case, and while the timings are less steady on subsequent runs, the effective factors are about the same as above.

Update: I added latkin's answer as stocp5, but had to adjust by adding a Array.toSeq to it. Without it, it clocks in at 0.234, which is faster than the original while-loop. Unfortunately, I require a sequence (we must use lazy loading and cannot hold whole strings in memory).

(Update) Performance comparison incl element access

The above comparison only tests iteration, which helps find the issues caused by the stacked iterators. However, the timings are a little different if you add element access to the equation. I enforced it by an added Seq.map id:

let runmap f = for i in 1 .. 1000000 do f input |> Seq.map id |> Seq.length |> ignore;;

Results:

Function  CPU     Factor
stocp0    0.795   1.0
stocp1    1.528   1.9
stocp2    1.731   2.2
stocp2'   1.812   2.3
stocp3    0.936   1.2
stocp4    0.842   1.1
stocp5    0.873   1.1  (credit: latkin, see his answer and notes above)

(Update) Performance comparison incl limited element access

Since our typical use-cases do not require full iteration, I've added a test that just iterates until the 2nd surrogate pair at position 6, with larger sized input (3932160 characters).

let runmapnth f = for i in 1 .. 1000000 do f input |> Seq.map id |> Seq.nth 6 |> ignore;;

Results:

Function  CPU     Factor
stocp0    0.624   1.0
stocp1    1.029   1.6
stocp2    1.263   2.0
stocp2'   1.107   1.8
stocp3    0.717   1.1
stocp4    0.624   1.0
stocp5    ---     --- OOM

The OutOfMemoryException with latkin's answer surprised me a bit, it means that the created arrays were not cleaned up when used in a tight loop as above. My machine allocated 8GB a few times in a few seconds, and drops (GC'ed?) in-between, but in the end still fails. Strange:

8GB allocation spikes

The other performance characteristics are as can be expected based on earlier observations.

Conclusion, questions

With the last exercise above, I found out something I hadn't expected: the F# compiler only calls the non-generic IEnumerator.Current and never calls the IEnumerator<T>.Current. This may explain in part why the performance degradation with chained sequence filters is so noticeable when the object you are performing it on is a value type: the boxing places it on the heap and back again, which is terrible.

  • Why is the compiler not using the generic interface?

  • How come that the for-loop is so slow, what happens internally here? Isn't it supposed to turn into a tail-call, which is then compiled into a fast loop internally?

  • Is there a more natural or other way of writing a filter like I did (mapi, then filter) that does not have the drawbacks of the detrimental performance I described?

  • Why is there such a big difference between piping the string directly (slow) and string.AsEnumerable() (fast)?

I have many more questions, but the SO-format generally wants you to ask a simple single question only, which I clearly didn't. Sorry to have been so elaborate, I hope I doesn't put too many people off to come with some insightful observations.

UPDATE: as pointed out in the comments, the boxing seems to only appear when run from FSharp Interactive (FSI). If you take stocp4 and change the calling code by adding a redundant Seq.filter ((<>) 0) (or something similar), it will instead call the unboxed accessor. Why? No idea.

Parasympathetic answered 13/5, 2015 at 2:1 Comment(7)
Wow - I'm impressed by the effort you have put into this question!Bushy
I wonder what kind of results you'd get with Nessos Streams instead of sequences...Marquisette
I think your asking so much internal details about the F# compiler might be better off as a GitHub issue. github.com/fsharp/fsharp Asking there should get you the F# compiler guys on-board and it might even end up as a bug and being fixed in the next F# version (I have seen that happen a few times).Pontoon
For the record, I can add that I am able to reproduce these results. I looked briefly at generated code, and can't really tell what's wrong there. One thing I did notice, though, is that generated code appears to correctly call the strongly typed version of IEnumerator<T>.Current, no boxing, Am I missing something? How did you determine that the obj-typed Current is being called?Underwaist
@FyodorSoikin: simply by adding a printfn statement in both the IEnumerator<T>.Current and the IEnumerator.Current. Since both interfaces are needed (MoveNext is only available on IEnumerator), I figured this was caused by the way the FSharp compiler unwinds the syntax. But if you say the generated code doesn't have that, it means the IL or JIT compiler may be causing it, in which case this will be reproducible in C# as well.Parasympathetic
I updated the answer to answer some of your questions and latkin's answer, I also added other performance timings for a fairer comparison. Also note: I cannot (fully) rely on arrays, it will blow up with large inputs.Parasympathetic
Not sure if it was an oversight, but stocp4's mapiAndFilter doesn't have an "inline" (stocp3's does), which increases it's performance (and I would remove the inline on stocp4) in my tests (especially for very long strings, in which case it was the clear winner.)Pascual
M
9

Ok I'll take a shot. All code and benchmark results can be found here.

Lazy v Eager Seqs are slow. Comprehensions are slow. They are a convenient abstraction that involves a lot of compiler-generated goop and allocations, and should generally just be avoided altogether if perf is important. All of the impls in question are handily beat by below simple non-lazy solution.

// ~50% faster for given test case
// still ~20% faster even for length 1.5M string
let eager1 (value: string) =
    let result = ResizeArray(value.Length)
    for i in 0 .. value.Length - 1 do
        if not (Char.IsLowSurrogate(value.[i]))
        then result.Add(Char.ConvertToUtf32(value, i))
    result.ToArray()

Generic v Non Your generic code is being invoked in the benchmark function.

Add a logging statement to both .Current impls, and pipe your output sequence to |> Seq.iter (printfn "%d"), and you will see it's the generic one that's invoked.

Were you testing in FSI? For whatever reason, FSI's "print a few elements of this sequence to the console" code does wind up in the non-generic path, but that doesn't affect the executing code. Maybe that's what you were seeing?

Loops in seq{ } Loops inside of seq { } and other computation expressions are not regular loops. (in fact pretty much nothing "normal looking" inside of computation expressions is actually normal, which is kind of the point :)) As indicated in the computation expression docs, a for loop ends up codgening as an iteration over another enumerable. while loops are a bit simpler.

This more or less explains why your "attempt 1" is so much slower - the for loop results in allocation and iteration of yet another seq inside of your seq.

Piping through Seq APIs Yes, this will create new seqs at each step. If the "real work" involved is very tiny like this example, then overhead starts to dominate.

Getting faster Your subsequent optimizations each remove layers of abstraction, and so although I don't have precise explanations, it seems reasonable that they get a bit faster.

.AsEnumerable() That's pretty wacky, I can repro the significant speedup that you see. Very strange given that the AsEnumerable extension method does nothing but return its input directly!

The structure of the generated code in these case is very different. Maybe this is a pathological case in the optimizer somehow. Interesting find.

Variations I found that results vary quite significantly when you enable/disable optimizations, and when you target x64 vs x86. Take that for what it's worth.


Update after changed benchmarks and requirements from OP

Array.toSeq It is unnecessary to use Array.toSeq here, and will predictably drag down performance of my suggested solution. Array.toSeq and Seq.ofArray are there more for safety (make sure resulting seq can't be converted back to array by consumer and mutated), than type conversion.

Better choices:

  • Simply cast the array to seq<_> when returning it
  • Update your other APIs to accept the flexible type #seq<'t>, then even a plain array is ok

Updated Requirements Given newly-revealed constraints:

  • Processing strings so large that even 1 or 2 copies will cause OOM
  • Frequent early bail-out when processing

then it's clear a lazy approach is going to be more appropriate, at least in some cases.

Yet even with those requirements, in my testing with your new benchmarks, non-lazy solutions still perform very well in all cases except OOM or huge input with early bailout.

See my gist linked above for results. It includes alternative non-lazy implementations:

let eager2 (value: string) =
    let result = ResizeArray(value.Length)
    for i in 0 .. value.Length - 1 do
        if not (Char.IsLowSurrogate(value.[i]))
        then result.Add(Char.ConvertToUtf32(value, i))
    // cast result so that return type isn't array
    (result.ToArray()) :> seq<_>

let eager3 (value: string) =
    let result = ResizeArray(value.Length)
    for i in 0 .. value.Length - 1 do
        if not (Char.IsLowSurrogate(value.[i]))
        then result.Add(Char.ConvertToUtf32(value, i))
    // ToArray() causes another copy to be generated.
    // Avoiding that is a win in large-input scenarios, but at a cost
    // of otherwise slower processing
    (result) :> seq<_>

Improving lazy solution

Here is a further optimization of the lazy approach, directly integrating all logic, avoiding use of the string enumerator, and avoiding recursion.

This guy actually seems to beat the non-lazy solutions in most cases!

let lazy5 (value : string) =         
    let inline getEnum() = 
        let i = ref -1
        { new IEnumerator<int> with
              member __.Current = Char.ConvertToUtf32(value, !i)
          interface System.Collections.IEnumerator with
              member __.Current =  box (Char.ConvertToUtf32(value, !i))
              member __.MoveNext() = 
                      incr i
                      if !i >= value.Length then false else
                      if not (Char.IsLowSurrogate(value.[!i])) then true else
                      incr i
                      !i < value.Length                  
              member __.Reset() = failwith "reset"
          interface IDisposable with
              member __.Dispose() = () }
    { new IEnumerable<int> with
          member __.GetEnumerator() = getEnum()
      interface IEnumerable with
          member __.GetEnumerator() = getEnum() :> IEnumerator }

Summary

The first while-based seq solution looks great and performs well given constraints. I've tried to give some context on why proposed alternatives might be slower, hopefully that's helpful. I managed to squeeze out a bit more perf by integrating everything into an explict IEnumerable directly.

Depending on constraints and input, a non-lazy solution might be a good choice. I've suggested a few options here. As always, you will need to test in your real environment.

Messeigneurs answered 13/5, 2015 at 8:34 Comment(3)
Your solution indeed outperforms my attempts, however there are two drawbacks: (1) it doesn't return a seq<int> (required for our subsequent processing) and (2) it allocates arrays the size of the string, which will blow up on large inputs. I have updated my question with timings with your code, with and appended Array.toSeq, and some other observations (an unexpected OOM to name one). I was, however, surprised to see the improvement >20% above my original while-loop with dyn array (without the toSeq).Parasympathetic
Also, I found that I could only get the 20% improve on my original example test case when I changed your code to use for i = 0 to value.Length - 1. It clocks in at 0.234, your example clocks in at 0.390. I didn't manage to get the 50% increase though. About your other conclusions, you are right, I used FSI. About for: I still find the big performance lag quite surprising. And yes, I have seen many variations as well, all my timings above are with inline and all optimizations.Parasympathetic
@Parasympathetic I've updated my answer based on the new requirements. Eager solutions still perform very well, though I did managed to optimize the lazy approach even more. Full code and results here.Messeigneurs
N
0

I'm might be a lot late. But I got an answer to why the compiler do the unboxed version when throwing in an unneeded Seq.filter (<> 0).

Basically, the compiler/interpreter looks at the code, where because Seq.filter uses <> which is marked 'inline' as all build in operators in F#. the type of (<>: 'a -> 'a -> bool when 'a: equality), whereby the constant 0 given are enforcing 'a to be of type int i.e a value type. without the redundant Seq.filter it would just return the none generic enumerator, where as because it are passed to the Seq.filter it envokes the type checker at compile time instead of at runtime. it even might, the compiler will see that it is safe to replace the indirect calls from the interface with direct calls or inline the function body instead.

This video explains some of the treatment of interfaces by the compiler (in C#) https://www.youtube.com/watch?v=UybGH0xL5ns

Natator answered 25/12, 2022 at 20:52 Comment(2)
Not quite, but thanks for your answer. <> 0 is a lambda function, short for fun x -> x <> 0. A lambda is created on the heap and captures the parameter as a field (not always, but generally speaking). The issue explained here is the effect of chaining calls, where the compiler could optimise, but doesn’t, see latkins answer. Since F# 6 the lambda will likely be inlined (through InlineIfLambda), but that wouldn’t really alter the perf characteristics in these examples. The boxing here happens only in FSI, probably because of the printing or some other misunderstanding.Parasympathetic
Btw, a lot of my original research has been caught up by recent improvements in F#, especially wrt sequences, and arrays treated as sequences. Still, the fastest approach remains the fastest, but the margins are slimmer.Parasympathetic

© 2022 - 2024 — McMap. All rights reserved.