Why c++ program compiled for x64 platform is slower than compiled for x86?
Asked Answered
R

3

11

I've wrote program, and compiled it for x64 and x86 platform in Visual Studio 2010 on Intel Core i5-2500. x64 version take about 19 seconds for execution and x86 take about 17 seconds. What can be the reason of such behavior?

#include "timer.h"

#include <vector>
#include <iostream>
#include <algorithm>
#include <string>
#include <sstream>

/********************DECLARATIONS************************************************/
class Vector
{
public:
    Vector():x(0),y(0),z(0){}

    Vector(double x, double y, double z)
        : x(x)
        , y(y)
        , z(z)
    {
    }

    double x;
    double y;
    double z;
};


double Dot(const Vector& a, const Vector& b)
{
    return a.x * b.x + a.y * b.y + a.z * b.z;
}


class Vector2
{
public:
    typedef double value_type;

    Vector2():x(0),y(0){}

    Vector2(double x, double y)
        : x(x)
        , y(y)
    {
    }

    double x;
    double y;
};

/******************************TESTS***************************************************/

void Test(const std::vector<Vector>& m, std::vector<Vector2>& m2)
{
    Vector axisX(0.3f, 0.001f, 0.25f);
    Vector axisY(0.043f, 0.021f, 0.45f);

    std::vector<Vector2>::iterator i2 = m2.begin();

    std::for_each(m.begin(), m.end(),
        [&](const Vector& v)
    {
        Vector2 r(0,0);
        r.x = Dot(axisX, v);
        r.y = Dot(axisY, v);

        (*i2) = r;
        ++i2;
    });
}


int main()
{
    cpptask::Timer timer;

    int len2 = 300;
    size_t len = 5000000;
    std::vector<Vector> m;
    m.reserve(len);
    for (size_t i = 0; i < len; ++i)
    {
        m.push_back(Vector(i * 0.2345, i * 2.67, i * 0.98));
    }

    /***********************************************************************************/
    {
        std::vector<Vector2> m2(m.size());
        double time = 0;
        for (int i = 0; i < len2; ++i)
        {
            timer.Start();
            Test(m, m2);
            time += timer.End();
        }
        std::cout << "Dot product double - " << time / len2 << std::endl;
    }
    /***********************************************************************************/


    return 0;
}
Ramakrishna answered 14/2, 2012 at 20:28 Comment(2)
Interesting. I'm able to reproduce this on a Core i7 920.Scepter
It's worth noting that you could use XMM intrinsics and save a lot more time.Marxismleninism
S
21

Short Answer: It's a compiler hiccup. x64 optimizer fail.


Long Answer:

This x86 version is very slow if SSE2 is disabled. But I'm able to reproduce the results with SSE2 enabled in x86.

If you dive into the assembly of that inner-most loop. The x64 version has two extra memory copies at the end.

x86:

$LL71@main:
movsd   xmm2, QWORD PTR [eax-8]
movsd   xmm0, QWORD PTR [eax-16]
movsd   xmm3, QWORD PTR [eax]
movapd  xmm1, xmm0
mulsd   xmm0, QWORD PTR __real@3fa60418a0000000
movapd  xmm7, xmm2
mulsd   xmm2, QWORD PTR __real@3f95810620000000
mulsd   xmm7, xmm5
mulsd   xmm1, xmm4
addsd   xmm1, xmm7
movapd  xmm7, xmm3
mulsd   xmm3, QWORD PTR __real@3fdcccccc0000000
mulsd   xmm7, xmm6
add eax, 24                 ; 00000018H
addsd   xmm1, xmm7
addsd   xmm0, xmm2
movq    QWORD PTR [ecx], xmm1
addsd   xmm0, xmm3
movq    QWORD PTR [ecx+8], xmm0
lea edx, DWORD PTR [eax-16]
add ecx, 16                 ; 00000010H
cmp edx, esi
jne SHORT $LL71@main

x64:

$LL175@main:
movsdx  xmm3, QWORD PTR [rdx-8]
movsdx  xmm5, QWORD PTR [rdx-16]
movsdx  xmm4, QWORD PTR [rdx]
movapd  xmm2, xmm3
mulsd   xmm2, xmm6
movapd  xmm0, xmm5
mulsd   xmm0, xmm7
addsd   xmm2, xmm0
movapd  xmm1, xmm4
mulsd   xmm1, xmm8
addsd   xmm2, xmm1
movsdx  QWORD PTR r$109492[rsp], xmm2
mulsd   xmm5, xmm9
mulsd   xmm3, xmm10
addsd   xmm5, xmm3
mulsd   xmm4, xmm11
addsd   xmm5, xmm4
movsdx  QWORD PTR r$109492[rsp+8], xmm5
mov rcx, QWORD PTR r$109492[rsp]
mov QWORD PTR [rax], rcx
mov rcx, QWORD PTR r$109492[rsp+8]
mov QWORD PTR [rax+8], rcx
add rax, 16
add rdx, 24
lea rcx, QWORD PTR [rdx-16]
cmp rcx, rbx
jne SHORT $LL175@main

The x64 version has a lot more (unexplained) moves at the end of the loop. It looks like some sort of memory-to-memory data-copy.

EDIT:

It turns out that the x64 optimizer isn't able to optimize out the following copy:

(*i2) = r;

This is why the inner loop has two extra memory copies. If you change the loop to this:

std::for_each(m.begin(), m.end(),
    [&](const Vector& v)
{
    i2->x = Dot(axisX, v);
    i2->y = Dot(axisY, v);
    ++i2;
});

This eliminates the copies. Now the x64 version is just as fast as the x86 version:

x86: 0.0249423
x64: 0.0249348

Lesson Learned: Compilers aren't perfect.

Scepter answered 14/2, 2012 at 20:45 Comment(8)
I don't think it does... but is 'double' 64-bit if compiled for 64-bit versus 32-bit if compiled for 32-bit arch. I believe long changes size, but not sure if double does. I would check it, but visual studio is only letting me compile 32-bit today. double should be 64-bits on both (8 bytes).Fissi
Nah, double is standard IEEE double-precision on x86. The assembly here is pretty clear, it's all scalar double-precision SSE.Scepter
This looks like the same issue: dreamincode.net/forums/topic/127989-compiler-made-optimizations To fix it they used the /O2 compiler optimization which resulted in the 64-bit version being faster than the 32-bit version. Can you try that and see if it helps?Fissi
I've used /O2 optimization for my tests (This flag is enabled by default for release builds in MSVC).Ramakrishna
Both of my tests were also done with /O2. But I think I see the problem now. The x64 compiler is not able to completely optimize the function call to Dot(). The return value for both calls to Dot() are being copied through memory rather than register... strange. x86 doesn't have this problem.Scepter
Could this also explain why a Windows application compiled for 32-bit runs a little more slowly on a 64-bit machine?Ijssel
@JimFell In this particular example, the 32-bit version uses the FPU by default. The FPU is generally slower than SSE. All x64 machines have SSE, so it's enabled by default on x64.Scepter
x86 calling conventions are different. In particular with the use of registers to pass arguments. A good linker/optimizer would make this a non-issue, but clearly the x86 standard calling conventions make it easier to see this optimization.Fairing
M
0

I does not answer your question but I think it is worth to mention:

You should not write vector classes by yourself. For fixed-length vectors rather use boost::Array or cv::Vec2d and cv::Vec3d which has built in dot and other fast functions like operation +,- etc... (also cv::Vec<type,length> is offered)

Mclean answered 30/10, 2012 at 12:7 Comment(0)
C
-2

64-bit normally is a little slower than 32-bit (for code that specifically doesn't take advantage of 64-bit features). One particular problem is that pointers are bigger, reducing the amount that can be held in the cache.

Case answered 14/2, 2012 at 20:39 Comment(4)
That might be true, but where are the pointers here? I see lots of memory access to the large vector and floating point ops. I don't see a lots of memory bandwidth being consumed passing pointers around.Everyplace
But why this article msdn.microsoft.com/en-us/library/windows/desktop/… say that x64 architecture have improved floating point operations.Ramakrishna
The x86_64 ISA includes SSE+SSE2, which can't be said of x86. Therefore, binaries generated with lowest common denominator instructions only, without any hand crafted ASM, and without cpuid detection & separate insn blocks for each SSEage — and that is probably what Microsoft refers to — are slower as they have to do without using SSE.Lelandleler
I wouldn't make any blanket statements about the speed advantage/disadvantage of x64 vs. x86. Sure for memory limited code that may make a difference, but on the other hand x64 has some sometimes useful additional instructions and - which is generally useful - more registers. Both unimportant for FP code though.Wimberly

© 2022 - 2024 — McMap. All rights reserved.