.NET 4.6 RC x64 is twice as slow as x86 (release version)
Asked Answered
C

1

10

Net 4.6 RC x64 is twice as slow as x86 (release version):

Consider this piece of code:

class SpectralNorm
{
    public static void Main(String[] args)
    {
        int n = 5500;
        if (args.Length > 0) n = Int32.Parse(args[0]);

        var spec = new SpectralNorm();
        var watch = Stopwatch.StartNew();
        var res = spec.Approximate(n);

        Console.WriteLine("{0:f9} -- {1}", res, watch.Elapsed.TotalMilliseconds);
    }

    double Approximate(int n)
    {
        // create unit vector
        double[] u = new double[n];
        for (int i = 0; i < n; i++) u[i] = 1;

        // 20 steps of the power method
        double[] v = new double[n];
        for (int i = 0; i < n; i++) v[i] = 0;

        for (int i = 0; i < 10; i++)
        {
            MultiplyAtAv(n, u, v);
            MultiplyAtAv(n, v, u);
        }

        // B=AtA         A multiplied by A transposed
        // v.Bv /(v.v)   eigenvalue of v 
        double vBv = 0, vv = 0;
        for (int i = 0; i < n; i++)
        {
            vBv += u[i] * v[i];
            vv += v[i] * v[i];
        }

        return Math.Sqrt(vBv / vv);
    }


    /* return element i,j of infinite matrix A */
    double A(int i, int j)
    {
        return 1.0 / ((i + j) * (i + j + 1) / 2 + i + 1);
    }

    /* multiply vector v by matrix A */
    void MultiplyAv(int n, double[] v, double[] Av)
    {
        for (int i = 0; i < n; i++)
        {
            Av[i] = 0;
            for (int j = 0; j < n; j++) Av[i] += A(i, j) * v[j];
        }
    }

    /* multiply vector v by matrix A transposed */
    void MultiplyAtv(int n, double[] v, double[] Atv)
    {
        for (int i = 0; i < n; i++)
        {
            Atv[i] = 0;
            for (int j = 0; j < n; j++) Atv[i] += A(j, i) * v[j];
        }
    }

    /* multiply vector v by matrix A and then by matrix A transposed */
    void MultiplyAtAv(int n, double[] v, double[] AtAv)
    {
        double[] u = new double[n];
        MultiplyAv(n, v, u);
        MultiplyAtv(n, u, AtAv);
    }
}

On my machine x86 release version takes 4.5 seconds to complete, while the x64 takes 9.5 seconds. Is there any specific flag/setting needed for the x64?

UPDATE

It turns out that RyuJIT has a role in this issue. If useLegacyJit is enabled in app.config, the result is different and this time x64 is faster.

<?xml version="1.0" encoding="utf-8"?>
<configuration>
  <startup>
    <supportedRuntime version="v4.0" sku=".NETFramework,Version=v4.6"/>
  </startup>
  <runtime>
    <useLegacyJit enabled="1" />
 </runtime>
</configuration>

UPDATE

Now the issue has been reported to the CLR team coreclr, issue 993

Circlet answered 12/5, 2015 at 15:28 Comment(11)
I'm not familiar with spectral norms, and that's a fair amount of code to consider. Could you give us a summary of what that's doing - hundreds or thousands of matrix operations of large floating-point double matrices with square-roots and divisions in there somewhere too? Can you profile this in both, can you look at the generated assembler for any obvious pessimisations?Change
Are you running a release build, and not running it in a debugger?Operant
It's worth running it a few times in a for loop and discounting the first few iterations since the JIT compiler needs to work its magic the first time.Obligatory
.NET 4.6 has a brand-new x64 jitter (project RyuJIT), you are not going to get comparable results with previous .NET versions. Best way to report serious perf degradations is by using connect.microsoft.com, hurry while 4.6 is still in beta.Obovate
In fact 4.6 is rc not beta. Here is the report on microsoft connect: "connect.microsoft.com/VisualStudio/feedback/details/1294384".Circlet
Can you confirm runtimes in .net 4.5 say on both architectures to confirm this is indeed a 4.6 issue?Hintz
I'm not sure you benchmarking code is very good. You only run the method one time and don't account for any JITting that may be going on. As @WaiHaLee said, run your method many times in a loop and ignore the time for the first few iterations. Then take an average of the rest of them.Feigin
@Circlet - So you have been asked several times, and you have not answered... is the slowness in the Jitting? or in the jitted code? ie.. is it just taking a long time to jit the code, then execute the code fast or is it actually executing the code slowly? You do this by running the test multiple times and throwing out the first result.Bienne
I looked at both JITs extensively and the new JIT produces worse code in most areas. Expect less performance from .NET 4.6. I reported a ton of stuff in January and the latest RC has nothing at all changed. It is really bad. When you say a.x + a.x that loads x twice from memory. Even basic optimizations are missing right now.Rundle
@ErikFunkenbusch jitting does not take 5sec. This can't be it. The benchmark is good because all one-time costs disappear at this duration.Rundle
Better Connect link: connect.microsoft.com/VisualStudio/feedback/details/1294384Synge
D
4

The reason for perf regression is answered on GitHub; briefly, It seem to repro only on Intel and not on Amd64 machines. Inner loop operation

Av[i] += v[j] * A(i, j);

results in

IN002a: 000093 lea      eax, [rax+r10+1]
IN002b: 000098 cvtsi2sd xmm1, rax
IN002c: 00009C movsd    xmm2, qword ptr [@RWD00]
IN002d: 0000A4 divsd    xmm2, xmm1
IN002e: 0000A8 movsxd   eax, edi
IN002f: 0000AB movaps   xmm1, xmm2
IN0030: 0000AE mulsd    xmm1, qword ptr [r8+8*rax+16]
IN0031: 0000B5 addsd    xmm0, xmm1
IN0032: 0000B9 movsd    qword ptr [rbx], xmm0

Cvtsi2sd does a partial write of lower 8-bytes with upper bytes of xmm register unmodified. For the repro case xmm1 is partially written but there are further uses of xmm1 down the code. This creates a false dependency between cvtsi2sd and other instructions that use xmm1 which affects instruction parallelism. Indeed modifying codegen of Int to Float cast to emit a "xorps xmm1, xmm1" before cvtsi2sd fixes perf regression.

Workaround: Perf regression could also be avoided if we reverse the order of operands in multiply operation in MultiplyAv/MultiplyAvt methods

void MultiplyAv(int n, double[] v, double[] Av)
{
    for (int i = 0; i < n; i++)
    {
        Av[i] = 0;
        for (int j = 0; j < n; j++)  
              Av[i] += v[j] * A(i, j);  //  order of operands reversed
    }
}
Disappear answered 15/5, 2015 at 19:59 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.