Why is VC++ unable to optimize an integer wrapper?

Asked 4/2, 2015 at 13:1 Answered 4/2, 2015 at 13:12

In C++, i'm trying to write a wrapper around a 64 bits integer. My expectation is that if written correctly and all methods are inlined such a wrapper should be as performant as the real type. Answer to this question on SO seems to agree with my expectation.

I wrote this code to test my expectation :

class B
{
private:
   uint64_t _v;

public:
   inline B() {};
   inline B(uint64_t v) : _v(v) {};

   inline B& operator=(B rhs) { _v = rhs._v; return *this; };
   inline B& operator+=(B rhs) { _v += rhs._v; return *this; };
   inline operator uint64_t() const { return _v; };
};

int main(int argc, char* argv[])
{
   typedef uint64_t;
   //typedef B T;
   const unsigned int x = 100000000;

   Utils::CTimer timer;
   timer.start();

   T sum = 0;
   for (unsigned int i = 0; i < 100; ++i)
   {
      for (uint64_t f = 0; f < x; ++f)
      {
         sum += f;
      }
   }

   float time = timer.GetSeconds();

   cout << sum << endl
        << time << " seconds" << endl;

   return 0;
}

When I run this with typedef B T; instead of typedef uint64_t T the reported times are consistently 10% slower when compiled with VC++. With g++ the performances are same if I use the wrapper or not.

Since g++ does it I guess there is no technical reason why VC++ can not optimise this correctly. Is there something I could do to make it optimize it?

I already tried to play with the optimisations flag with no success

Calendre answered 4/2, 2015 at 13:1 Comment(8)

Did you run the code from Visual Studio or from a Windows console? – Towns 4/2, 2015 at 13:5

I won't be surprised if g++ folded the entire loop. – Betelgeuse 4/2, 2015 at 13:6

Dive into generated assembly ! – Earleanearleen 4/2, 2015 at 13:6

I think I tested both, but I'll need to test again to make sure. Could it make a difference? – Feckless 4/2, 2015 at 13:6

How did you compile? I presume Release but what optimization flags did you use? Also is the g++ code faster or has VC++ already optimized the code? – Encratia 4/2, 2015 at 13:6

VC++ can be "hideously" [ ;) ] effective during optimization, eg using SIMD (vector) operations when it can. Summing integers can be vectorized/parallelized by the compiler. Summing wrappers can't – Encratia 4/2, 2015 at 13:7

I don't have the exact times with me but g++ versions were faster than VC++ with or without wrapper. – Feckless 4/2, 2015 at 13:14

As @Betelgeuse answered, g++ optimized the loop away entirely. Both benchmarks fail, in the sense that they don't measure the effect of wrapping. On the other hand, they do show that wrapping has side effects, ie it prevents parallelization – Encratia 4/2, 2015 at 13:15

For the record, this is what g++ and clang++'s generated assembly at -O2 translates to (in both wrapper and non-wrapper cases), modulo the timing part:

sum = 499999995000000000;
cout << sum << endl;

In other words, it optimized the loop out entirely. Regardless of how hard you try to vectorize the loop, it's rather hard to beat not looping at all :)

Betelgeuse answered 4/2, 2015 at 13:12 Comment(0)

Using /O2 (maximize speed), both alternatives generate exactly the same assembly using Visual Studio 2012. This is your code, minus the timing and output:

00FB1000  push        ebp  
00FB1001  mov         ebp,esp  
00FB1003  and         esp,0FFFFFFF8h  
00FB1006  sub         esp,8  
00FB1009  mov         edx,64h  
00FB100E  mov         edi,edi  
00FB1010  xorps       xmm0,xmm0  
00FB1013  movlpd      qword ptr [esp],xmm0  
00FB1018  mov         ecx,dword ptr [esp+4]  
00FB101C  mov         eax,dword ptr [esp]  
00FB101F  nop  
00FB1020  add         eax,1  
00FB1023  adc         ecx,0  
00FB1026  jne         main+2Fh (0FB102Fh)  
00FB1028  cmp         eax,5F5E100h  
00FB102D  jb          main+20h (0FB1020h)  
00FB102F  dec         edx  
00FB1030  jne         main+10h (0FB1010h)  
00FB1032  xor         eax,eax

I'd presume that the measured times fluctuate or are not always correct.

Marileemarilin answered 4/2, 2015 at 13:8 Comment(8)

xmm0 ! MMX registers! It did vectorize the operation! – Encratia 4/2, 2015 at 13:9

@PanagiotisKanavos Indeed, a rare sight I'd say. – Marileemarilin 4/2, 2015 at 13:12

Not rare actually, VC is only surpassed by Intel's own compilers in parallelizing code. – Encratia 4/2, 2015 at 13:13

I spent quite some time optimizing a tight loop by hand where Visual Studio would use lots of mulss, but never mulps, although it was perfectly possible. Made me lose a bit of confidence ;) – Marileemarilin 4/2, 2015 at 13:15

What version? Each successive version has a lot of improvements, and there were two major new versions in the last couple of years. Moreover, newer agreements with Intel mean that the latest versions contain larger parts of Intel's vectorization technology, parallel libraries etc – Encratia 4/2, 2015 at 13:17

VS2012, as it was in an ongoing project. I'm gonna check the code in VS2013 and the VS2015 Preview :) – Marileemarilin 4/2, 2015 at 13:19

Also VS2015 - yet another major version coming out :) – Encratia 4/2, 2015 at 13:19

@Betelgeuse On a closer look the MMX register seems to be used to store the inner for loop's counting variable uint64_t f, right? – Marileemarilin 4/2, 2015 at 13:48

Recommended topics

Hot tags