Why is VC++ unable to optimize an integer wrapper?
Asked Answered
C

2

7

In C++, i'm trying to write a wrapper around a 64 bits integer. My expectation is that if written correctly and all methods are inlined such a wrapper should be as performant as the real type. Answer to this question on SO seems to agree with my expectation.

I wrote this code to test my expectation :

class B
{
private:
   uint64_t _v;

public:
   inline B() {};
   inline B(uint64_t v) : _v(v) {};

   inline B& operator=(B rhs) { _v = rhs._v; return *this; };
   inline B& operator+=(B rhs) { _v += rhs._v; return *this; };
   inline operator uint64_t() const { return _v; };
};

int main(int argc, char* argv[])
{
   typedef uint64_t;
   //typedef B T;
   const unsigned int x = 100000000;

   Utils::CTimer timer;
   timer.start();

   T sum = 0;
   for (unsigned int i = 0; i < 100; ++i)
   {
      for (uint64_t f = 0; f < x; ++f)
      {
         sum += f;
      }
   }

   float time = timer.GetSeconds();

   cout << sum << endl
        << time << " seconds" << endl;

   return 0;
}

When I run this with typedef B T; instead of typedef uint64_t T the reported times are consistently 10% slower when compiled with VC++. With g++ the performances are same if I use the wrapper or not.

Since g++ does it I guess there is no technical reason why VC++ can not optimise this correctly. Is there something I could do to make it optimize it?

I already tried to play with the optimisations flag with no success

Calendre answered 4/2, 2015 at 13:1 Comment(8)
Did you run the code from Visual Studio or from a Windows console?Towns
I won't be surprised if g++ folded the entire loop.Betelgeuse
Dive into generated assembly !Earleanearleen
I think I tested both, but I'll need to test again to make sure. Could it make a difference?Feckless
How did you compile? I presume Release but what optimization flags did you use? Also is the g++ code faster or has VC++ already optimized the code?Encratia
VC++ can be "hideously" [ ;) ] effective during optimization, eg using SIMD (vector) operations when it can. Summing integers can be vectorized/parallelized by the compiler. Summing wrappers can'tEncratia
I don't have the exact times with me but g++ versions were faster than VC++ with or without wrapper.Feckless
As @Betelgeuse answered, g++ optimized the loop away entirely. Both benchmarks fail, in the sense that they don't measure the effect of wrapping. On the other hand, they do show that wrapping has side effects, ie it prevents parallelizationEncratia
B
4

For the record, this is what g++ and clang++'s generated assembly at -O2 translates to (in both wrapper and non-wrapper cases), modulo the timing part:

sum = 499999995000000000;
cout << sum << endl;

In other words, it optimized the loop out entirely. Regardless of how hard you try to vectorize the loop, it's rather hard to beat not looping at all :)

Betelgeuse answered 4/2, 2015 at 13:12 Comment(0)
M
3

Using /O2 (maximize speed), both alternatives generate exactly the same assembly using Visual Studio 2012. This is your code, minus the timing and output:

00FB1000  push        ebp  
00FB1001  mov         ebp,esp  
00FB1003  and         esp,0FFFFFFF8h  
00FB1006  sub         esp,8  
00FB1009  mov         edx,64h  
00FB100E  mov         edi,edi  
00FB1010  xorps       xmm0,xmm0  
00FB1013  movlpd      qword ptr [esp],xmm0  
00FB1018  mov         ecx,dword ptr [esp+4]  
00FB101C  mov         eax,dword ptr [esp]  
00FB101F  nop  
00FB1020  add         eax,1  
00FB1023  adc         ecx,0  
00FB1026  jne         main+2Fh (0FB102Fh)  
00FB1028  cmp         eax,5F5E100h  
00FB102D  jb          main+20h (0FB1020h)  
00FB102F  dec         edx  
00FB1030  jne         main+10h (0FB1010h)  
00FB1032  xor         eax,eax

I'd presume that the measured times fluctuate or are not always correct.

Marileemarilin answered 4/2, 2015 at 13:8 Comment(8)
xmm0 ! MMX registers! It did vectorize the operation!Encratia
@PanagiotisKanavos Indeed, a rare sight I'd say.Marileemarilin
Not rare actually, VC is only surpassed by Intel's own compilers in parallelizing code.Encratia
I spent quite some time optimizing a tight loop by hand where Visual Studio would use lots of mulss, but never mulps, although it was perfectly possible. Made me lose a bit of confidence ;)Marileemarilin
What version? Each successive version has a lot of improvements, and there were two major new versions in the last couple of years. Moreover, newer agreements with Intel mean that the latest versions contain larger parts of Intel's vectorization technology, parallel libraries etcEncratia
VS2012, as it was in an ongoing project. I'm gonna check the code in VS2013 and the VS2015 Preview :)Marileemarilin
Also VS2015 - yet another major version coming out :)Encratia
@Betelgeuse On a closer look the MMX register seems to be used to store the inner for loop's counting variable uint64_t f, right?Marileemarilin

© 2022 - 2024 — McMap. All rights reserved.