Why is function composition in F# so much slower, by 60%, than piping?

Asked 6/9, 2016 at 12:54 Answered 6/9, 2016 at 13:37

Solved performance f#piping function-composition

Admittedly, I am unsure whether I am correctly comparing apples with apples or apples with pears here. But I'm particularly surprised about the bigness of the difference, where a slighter difference, if any, would be expected.

Piping can often be expressed as function composition and vice versa, and I would assume the compiler knows that too, so I tried a little experiment:

// simplified example of some SB helpers:
let inline bcreate() = new StringBuilder(64)
let inline bget (sb: StringBuilder) = sb.ToString()
let inline appendf fmt (sb: StringBuilder) = Printf.kbprintf (fun () -> sb) sb fmt
let inline appends (s: string) (sb: StringBuilder) = sb.Append s
let inline appendi (i: int) (sb: StringBuilder) = sb.Append i
let inline appendb (b: bool) (sb: StringBuilder) = sb.Append b

// test function for composition, putting some garbage data in SB
let compose a =            
    (appends "START" 
    >> appendb  true
    >> appendi 10
    >> appendi a
    >> appends "0x"
    >> appendi 65535
    >> appendi 10
    >> appends "test"
    >> appends "END") (bcreate())

// test function for piping, putting the same garbage data in SB
let pipe a =
    bcreate()
    |> appends "START" 
    |> appendb  true
    |> appendi 10
    |> appendi a
    |> appends "0x"
    |> appendi 65535
    |> appendi 10
    |> appends "test"
    |> appends "END"

Testing this in FSI (64 bit enabled, --optimize flag on) gives:

> for i in 1 .. 500000 do compose 123 |> ignore;;
Real: 00:00:00.390, CPU: 00:00:00.390, GC gen0: 62, gen1: 1, gen2: 0
val it : unit = ()
> for i in 1 .. 500000 do pipe 123 |> ignore;;
Real: 00:00:00.249, CPU: 00:00:00.249, GC gen0: 27, gen1: 0, gen2: 0
val it : unit = ()

A small difference would be understandable, but this is a factor 1.6 (60%) performance degradation.

I would actually expect the bulk of the work to happen in the StringBuilder, but apparently the overhead of composition has quite a bit of influence.

I realize that in most practical situations this difference will be negligible, but if you are writing large formatted text files (like log files) as in this case, it has an impact.

I am using the latest version of F#.

Doering answered 6/9, 2016 at 12:54 Comment(0)

I tried out your example with FSI and found no appreciable difference:

> #time
for i in 1 .. 500000 do compose 123 |> ignore

--> Timing now on

Real: 00:00:00.229, CPU: 00:00:00.234, GC gen0: 32, gen1: 32, gen2: 0
val it : unit = ()
> #time;;

--> Timing now off

> #time
for i in 1 .. 500000 do pipe 123 |> ignore;;;;

--> Timing now on

Real: 00:00:00.214, CPU: 00:00:00.218, GC gen0: 30, gen1: 30, gen2: 0
val it : unit = ()

Measuring it in BenchmarkDotNet (The first table is just a single compose/pipe run, the 2nd table is doing it 500000 times), I found something similar:

  Method | Platform |       Jit |      Median |     StdDev |    Gen 0 | Gen 1 | Gen 2 | Bytes Allocated/Op |
-------- |--------- |---------- |------------ |----------- |--------- |------ |------ |------------------- |
 compose |      X64 |    RyuJit | 319.7963 ns |  5.0299 ns | 2,848.50 |     - |     - |             182.54 |
    pipe |      X64 |    RyuJit | 308.5887 ns | 11.3793 ns | 2,453.82 |     - |     - |             155.88 |
 compose |      X86 | LegacyJit | 428.0141 ns |  3.6112 ns | 1,970.00 |     - |     - |             126.85 |
    pipe |      X86 | LegacyJit | 416.3469 ns |  8.0869 ns | 1,886.00 |     - |     - |             121.86 |

  Method | Platform |       Jit |      Median |    StdDev |    Gen 0 | Gen 1 | Gen 2 | Bytes Allocated/Op |
-------- |--------- |---------- |------------ |---------- |--------- |------ |------ |------------------- |
 compose |      X64 |    RyuJit | 160.8059 ms | 4.6699 ms | 3,514.75 |     - |     - |      56,224,980.75 |
    pipe |      X64 |    RyuJit | 163.1026 ms | 4.9829 ms | 3,120.00 |     - |     - |      50,025,686.21 |
 compose |      X86 | LegacyJit | 215.8562 ms | 4.2769 ms | 2,292.00 |     - |     - |      36,820,936.68 |
    pipe |      X86 | LegacyJit | 209.9219 ms | 2.5605 ms | 2,220.00 |     - |     - |      35,554,575.32 |

It may be that differences you are measuring are related to GC. Try to force a GC collect before/after your timings.

That said, looking at the source code for the pipe operator:

let inline (|>) x f = f x

and comparing against the composition operator:

let inline (>>) f g x = g(f x)

seems to make it clear that the composition operator will be creating lambda functions, which should result in more allocations. This can also be seen in the BenchmarkDotNet runs. That might also be the cause for the performance difference you are seeing.

Arriviste answered 6/9, 2016 at 13:37 Comment(4)

Thanks, very interesting comparisons. Maybe you use the server GC and I have the normal, single-threaded GC? I don't know how to configure that for FSI. I should compare the compiled versions. I appreciate seeing that, at least on your system, the differences are negligible, as it ought to be. – Doering 6/9, 2016 at 13:41

I'm not using any special flags for FSI other than --optimize, which is the one you mentioned. I'm running fsianycpu.exe as well in case that matters. – Arriviste 6/9, 2016 at 13:49

@Arriviste I disagree about lambdas. Yes, they are created in unoptimized code. But with optimizations on, I only see two lambdas, not nine. Everything else does get inlined. I guess the bottom line should be that the compiler has harder time figuring out the inlining in case of composition than in case of piping. – Distribution 6/9, 2016 at 14:34

I found out where the difference came from, I had accidentally switched script debugging on earlier today. @FyodorSoikin is right, the lambdas actually are optimized away. There is still a difference in performance, but now it is much smaller and less pronounced, as can be expected. – Doering 6/9, 2016 at 14:37

Without any deep knowledge about F# internals, what I can tell from the generated IL is that compose will yield lambdas (and lots of them if optimizations are turned off), whereas in pipe all the calls to append* will be inlined.

Generated IL for pipe function:

Main.pipe:
IL_0000:  nop         
IL_0001:  ldc.i4.s    40 
IL_0003:  newobj      System.Text.StringBuilder..ctor
IL_0008:  ldstr       "START"
IL_000D:  callvirt    System.Text.StringBuilder.Append
IL_0012:  ldc.i4.1    
IL_0013:  callvirt    System.Text.StringBuilder.Append
IL_0018:  ldc.i4.s    0A 
IL_001A:  callvirt    System.Text.StringBuilder.Append
IL_001F:  ldarg.0     
IL_0020:  callvirt    System.Text.StringBuilder.Append
IL_0025:  ldstr       "0x"
IL_002A:  callvirt    System.Text.StringBuilder.Append
IL_002F:  ldc.i4      FF FF 00 00 
IL_0034:  callvirt    System.Text.StringBuilder.Append
IL_0039:  ldc.i4.s    0A 
IL_003B:  callvirt    System.Text.StringBuilder.Append
IL_0040:  ldstr       "test"
IL_0045:  callvirt    System.Text.StringBuilder.Append
IL_004A:  ldstr       "END"
IL_004F:  callvirt    System.Text.StringBuilder.Append
IL_0054:  ret

Generated IL for compose function:

Main.compose:
IL_0000:  nop         
IL_0001:  ldarg.0     
IL_0002:  newobj      [email protected]
IL_0007:  stloc.1     
IL_0008:  ldloc.1     
IL_0009:  newobj      [email protected]
IL_000E:  stloc.0     
IL_000F:  ldc.i4.s    40 
IL_0011:  newobj      System.Text.StringBuilder..ctor
IL_0016:  stloc.2     
IL_0017:  ldloc.0     
IL_0018:  ldloc.2     
IL_0019:  callvirt    Microsoft.FSharp.Core.FSharpFunc<System.Text.StringBuilder,System.Text.StringBuilder>.Invoke
IL_001E:  ldstr       "END"
IL_0023:  callvirt    System.Text.StringBuilder.Append
IL_0028:  ret

[email protected]:
IL_0000:  nop         
IL_0001:  ldarg.0     
IL_0002:  ldfld       [email protected]
IL_0007:  ldarg.1     
IL_0008:  call        Main.f@1
IL_000D:  ldc.i4.s    0A 
IL_000F:  callvirt    System.Text.StringBuilder.Append
IL_0014:  ret         

[email protected]:
IL_0000:  ldarg.0     
IL_0001:  call        Microsoft.FSharp.Core.FSharpFunc<System.Text.StringBuilder,System.Text.StringBuilder>..ctor
IL_0006:  ldarg.0     
IL_0007:  ldarg.1     
IL_0008:  stfld       [email protected]
IL_000D:  ret         

[email protected]:
IL_0000:  nop         
IL_0001:  ldarg.0     
IL_0002:  ldfld       [email protected]
IL_0007:  ldarg.1     
IL_0008:  callvirt    Microsoft.FSharp.Core.FSharpFunc<System.Text.StringBuilder,System.Text.StringBuilder>.Invoke
IL_000D:  ldstr       "test"
IL_0012:  callvirt    System.Text.StringBuilder.Append
IL_0017:  ret         

[email protected]:
IL_0000:  ldarg.0     
IL_0001:  call        Microsoft.FSharp.Core.FSharpFunc<System.Text.StringBuilder,System.Text.StringBuilder>..ctor
IL_0006:  ldarg.0     
IL_0007:  ldarg.1     
IL_0008:  stfld       [email protected]
IL_000D:  ret

Adenectomy answered 6/9, 2016 at 13:30 Comment(4)

This is intriguing. I have seen the generation of "lots of lambdas" before when using composition. But this difference is quite bigger than I had expected. More IL does not necessarily mean less performance, though. I'm still curious why this performs the way it does. My guess is that the JIT compiler cannot effectively optimize the closures away in the compose scenario. – Doering 6/9, 2016 at 13:37

The JIT:er has limited time, memory and knowledge. In my experience we can't rely on it do holistic optimizations. It can eliminate unused variables, inline methods (unless virtual) and unroll loops but that seems about it to me. The F# compiler has much more information available to it and should in principle be able to write more efficient IL. – Mutual 6/9, 2016 at 14:32

The "lots of lambdas" only happen without optimizations. See my comment on Ringil's answer. – Distribution 6/9, 2016 at 14:35

@FyodorSoikin, yes. That's exactly what the answer says. – Adenectomy 6/9, 2016 at 14:38

Recommended topics

Hot tags