Using this pointer causes strange deoptimization in hot loop
Asked Answered
A

3

140

I recently came across a strange deoptimization (or rather missed optimization opportunity).

Consider this function for efficient unpacking of arrays of 3-bit integers to 8-bit integers. It unpacks 16 ints in each loop iteration:

void unpack3bit(uint8_t* target, char* source, int size) {
   while(size > 0){
      uint64_t t = *reinterpret_cast<uint64_t*>(source);
      target[0] = t & 0x7;
      target[1] = (t >> 3) & 0x7;
      target[2] = (t >> 6) & 0x7;
      target[3] = (t >> 9) & 0x7;
      target[4] = (t >> 12) & 0x7;
      target[5] = (t >> 15) & 0x7;
      target[6] = (t >> 18) & 0x7;
      target[7] = (t >> 21) & 0x7;
      target[8] = (t >> 24) & 0x7;
      target[9] = (t >> 27) & 0x7;
      target[10] = (t >> 30) & 0x7;
      target[11] = (t >> 33) & 0x7;
      target[12] = (t >> 36) & 0x7;
      target[13] = (t >> 39) & 0x7;
      target[14] = (t >> 42) & 0x7;
      target[15] = (t >> 45) & 0x7;
      source+=6;
      size-=6;
      target+=16;
   }
}

Here is the generated assembly for parts of the code:

 ...
 367:   48 89 c1                mov    rcx,rax
 36a:   48 c1 e9 09             shr    rcx,0x9
 36e:   83 e1 07                and    ecx,0x7
 371:   48 89 4f 18             mov    QWORD PTR [rdi+0x18],rcx
 375:   48 89 c1                mov    rcx,rax
 378:   48 c1 e9 0c             shr    rcx,0xc
 37c:   83 e1 07                and    ecx,0x7
 37f:   48 89 4f 20             mov    QWORD PTR [rdi+0x20],rcx
 383:   48 89 c1                mov    rcx,rax
 386:   48 c1 e9 0f             shr    rcx,0xf
 38a:   83 e1 07                and    ecx,0x7
 38d:   48 89 4f 28             mov    QWORD PTR [rdi+0x28],rcx
 391:   48 89 c1                mov    rcx,rax
 394:   48 c1 e9 12             shr    rcx,0x12
 398:   83 e1 07                and    ecx,0x7
 39b:   48 89 4f 30             mov    QWORD PTR [rdi+0x30],rcx
 ...

It looks quite efficent. Simply a shift right followed by an and, and then a store to the target buffer. But now, look what happens when I change the function to a method in a struct:

struct T{
   uint8_t* target;
   char* source;
   void unpack3bit( int size);
};

void T::unpack3bit(int size) {
        while(size > 0){
           uint64_t t = *reinterpret_cast<uint64_t*>(source);
           target[0] = t & 0x7;
           target[1] = (t >> 3) & 0x7;
           target[2] = (t >> 6) & 0x7;
           target[3] = (t >> 9) & 0x7;
           target[4] = (t >> 12) & 0x7;
           target[5] = (t >> 15) & 0x7;
           target[6] = (t >> 18) & 0x7;
           target[7] = (t >> 21) & 0x7;
           target[8] = (t >> 24) & 0x7;
           target[9] = (t >> 27) & 0x7;
           target[10] = (t >> 30) & 0x7;
           target[11] = (t >> 33) & 0x7;
           target[12] = (t >> 36) & 0x7;
           target[13] = (t >> 39) & 0x7;
           target[14] = (t >> 42) & 0x7;
           target[15] = (t >> 45) & 0x7;
           source+=6;
           size-=6;
           target+=16;
        }
}

I thought the generated assembly should be quite the same, but it isn't. Here is a part of it:

...
 2b3:   48 c1 e9 15             shr    rcx,0x15
 2b7:   83 e1 07                and    ecx,0x7
 2ba:   88 4a 07                mov    BYTE PTR [rdx+0x7],cl
 2bd:   48 89 c1                mov    rcx,rax
 2c0:   48 8b 17                mov    rdx,QWORD PTR [rdi] // Load, BAD!
 2c3:   48 c1 e9 18             shr    rcx,0x18
 2c7:   83 e1 07                and    ecx,0x7
 2ca:   88 4a 08                mov    BYTE PTR [rdx+0x8],cl
 2cd:   48 89 c1                mov    rcx,rax
 2d0:   48 8b 17                mov    rdx,QWORD PTR [rdi] // Load, BAD!
 2d3:   48 c1 e9 1b             shr    rcx,0x1b
 2d7:   83 e1 07                and    ecx,0x7
 2da:   88 4a 09                mov    BYTE PTR [rdx+0x9],cl
 2dd:   48 89 c1                mov    rcx,rax
 2e0:   48 8b 17                mov    rdx,QWORD PTR [rdi] // Load, BAD!
 2e3:   48 c1 e9 1e             shr    rcx,0x1e
 2e7:   83 e1 07                and    ecx,0x7
 2ea:   88 4a 0a                mov    BYTE PTR [rdx+0xa],cl
 2ed:   48 89 c1                mov    rcx,rax
 2f0:   48 8b 17                mov    rdx,QWORD PTR [rdi] // Load, BAD!
 ...

As you see, we introduced an additional redundant load from memory before each shift (mov rdx,QWORD PTR [rdi]). It seems like the target pointer (which is now a member instead of a local variable) has to be always reloaded before storing into it. This slows down the code considerably (around 15% in my measurements).

First I thought maybe the C++ memory model enforces that a member pointer may not be stored in a register but has to be reloaded, but this seemed like an awkward choice, as it would make a lot of viable optimizations impossible. So I was very surprised that the compiler did not store target in a register here.

I tried caching the member pointer myself into a local variable:

void T::unpack3bit(int size) {
    while(size > 0){
       uint64_t t = *reinterpret_cast<uint64_t*>(source);
       uint8_t* target = this->target; // << ptr cached in local variable
       target[0] = t & 0x7;
       target[1] = (t >> 3) & 0x7;
       target[2] = (t >> 6) & 0x7;
       target[3] = (t >> 9) & 0x7;
       target[4] = (t >> 12) & 0x7;
       target[5] = (t >> 15) & 0x7;
       target[6] = (t >> 18) & 0x7;
       target[7] = (t >> 21) & 0x7;
       target[8] = (t >> 24) & 0x7;
       target[9] = (t >> 27) & 0x7;
       target[10] = (t >> 30) & 0x7;
       target[11] = (t >> 33) & 0x7;
       target[12] = (t >> 36) & 0x7;
       target[13] = (t >> 39) & 0x7;
       target[14] = (t >> 42) & 0x7;
       target[15] = (t >> 45) & 0x7;
       source+=6;
       size-=6;
       this->target+=16;
    }
}

This code also yields the "good" assembler without additional stores. So my guess is: The compiler is not allowed to hoist the load of a member pointer of a struct, so such a "hot pointer" should always be stored in a local variable.

  • So, why is the compiler unable to optimize out these loads?
  • Is it the C++ memory model that forbids this? Or is it simply a shortcoming of my compiler?
  • Is my guess correct or what is the exact reason why the optimization can't be performed?

The compiler in use was g++ 4.8.2-19ubuntu1 with -O3 optimization. I also tried clang++ 3.4-1ubuntu3 with similar results: Clang is even able to vectorize the method with the local target pointer. However, using the this->target pointer yields the same result: An extra load of the pointer before each store.

I checked the assembler of some similar methods and the result is the same: It seems that a member of this always has to be reloaded before a store, even if such a load could simply be hoisted outside the loop. I will have to rewrite a lot of code to get rid of these additional stores, mainly by caching the pointer myself into a local variable that is declared above the hot code. But I always thought fiddling with such details as caching a pointer in a local variable would surely qualify for premature optimization in these days where compilers have gotten so clever. But it seems I am wrong here. Caching a member pointer in a hot loop seems to be a necessary manual optimization technique.

Abdu answered 10/10, 2014 at 8:38 Comment(16)
Not sure why this got a down-vote - it's an interesting question. FWIW I've seen similar optimisation problems with non-pointer member variables where the solution has been similar, i.e. cache the member variable in a local variable for the lifetime of the method. I'm guessing it's something to do with aliasing rules ?Carp
Looks like the compiler don't optimize because he cannot ensure that the member is not accessed through some "external" code. So if the member can be modified outside, then it should be reloaded each time is accessed. Seems to be considered like a kind of volatile...Abscess
No not using this-> is just syntactic sugar. The problem is related to the nature of the variables (local vs member) and the things that the compiler deduces from this fact.Abscess
Anything to do with pointer aliases ?Acuate
@Jean-BaptisteYunès: Then it would be a memory model issue. I don't know the C++11 memory model in detail, but I am quite sure that it does not mandate this (unless the variable is volatile or atomic which is not the case here).Abdu
@YvesDaoust: Aliasing could be an issue, but I don't think so. Even IF source and target were aliased, the optimization would still be fine, since we do not repeatedly read from source but always cache the input before into variable t.Abdu
Same for me, don't know the memory model in detail. But seems much more a semantic issue than a memory model here. A member can be accessed outside which is not the case for a local variable. A local variable has automatic storage duration which is probably not the case for your member.Abscess
I think your missing the some arch specifier for gcc such as -msse4.2 (or what applies to your machine) for it to issue vector operations.Work
Does this code even work? As far as I can tell it only fills the least significant byte of t. And extending a potentially signed type like char to an unsigned type is bad style as well. I'd either accept an pointer to unsigned chars as input or make a two step cast, first to unsigned char, then to the larger type. Since the algorithm clearly assumes 8 bit bytes, I'd consider using UInt8_t* source.Bamberger
@Bamberger The source is not uint8_t, it is a 3bit int stuffed into bytes. Only the target is a real uint8_t. I use char here as "simply data", not as a specific int type. But, you are right. I missed a reinterpret cast here. I added it now. However, the issue is still there, so the missing cast was definitely not the problem.Abdu
@Abdu You added the cast in completely the wrong place, you need to cast source, not target. And your input is a sequence of 8 bit chars, since else your assumption that 6 chars correspond to 48 bits doesn't hold.Bamberger
@CodesInChaos: You read too fast :D. I fixed that mistake 20 seconds after my first edit. Yes, it is a sequence of 8 bit chars, but that is what char is. I used char instead of uint8_t only to state my intention that this is only data (stored in 8 bit numbers) instead of real 8-bit numbers.Abdu
New version looks still broken to me. You shouldn't dereference this->target since you want the pointer, not its target. Also your code triggers UB if size < 8.Bamberger
@CodesInChaos: You are right, * is another C&P error. The source buffer is guaranteed to have enough trailing memory allocated so that no UB is triggered. Now, the code should finally be correct. The issue remains :).Abdu
I added an answer, I found it a bit frustrating that the existing answer explain the issue without actually connecting the dots. So someone who does not really get it won't benefit from their answers.Catoptrics
As a more semantic matter, "premature optimization" applies only to optimization that is premature, i.e., before profiling has found it to be an issue. In this case, you diligently profiled and decompiled and found the source of an issue and formulated and profiled a solution. It is absolutely not "premature" to apply that solution.Nafis
A
116

Pointer aliasing seems to be the problem, ironically between this and this->target. The compiler is taking into account the rather obscene possibility that you initialized:

this->target = &this

In that case, writing to this->target[0] would alter the contents of this (and thus, this->target).

The memory aliasing problem is not restricted to the above. In principle, any use of this->target[XX] given an (in)appropriate value of XX might point to this.

I am better versed in C, where this can be remedied by declaring pointer variables with the __restrict__ keyword.

Aubreir answered 10/10, 2014 at 9:29 Comment(4)
I can confirm this! Changing target from uint8_t to uint16_t (so that strict aliasing rules kick in) changed it. With uint16_t, the load is always optimized out.Abdu
Relevant: #16138737Franzoni
Changing the contents of this is not what you mean (it is not a variable); you mean changing the contents of *this.Hunchbacked
@Abdu mind elaborating how strict alias kicks in and fixes the issue?Weirdo
C
36

Strict aliasing rules allows char* to alias any other pointer. So this->target may alias with this, and in your code method, the first part of the code,

target[0] = t & 0x7;
target[1] = (t >> 3) & 0x7;
target[2] = (t >> 6) & 0x7;

is in fact

this->target[0] = t & 0x7;
this->target[1] = (t >> 3) & 0x7;
this->target[2] = (t >> 6) & 0x7;

as this may be modified when you modify this->target content.

Once this->target is cached into a local variable, the alias is no longer possible with the local variable.

Cognition answered 10/10, 2014 at 10:2 Comment(2)
So, can we say as a general rule: Whenever you have a char* or void* in your struct, be sure to cache it in a local variable before writing to it?Abdu
In fact it is when you use a char*, not necessary as a member.Cognition
C
26

The issue here is strict aliasing which says that we are allowed to alias through a char* and so that prevents compiler optimization in your case. We are not allowed to alias through a pointer of a different type which would be undefined behavior, normally on SO we see this problem which is users attempting to alias through incompatible pointer types.

It would seem reasonable to implement uint8_t as a unsigned char and if we look at cstdint on Coliru it includes stdint.h which typedefs uint8_t as follows:

typedef unsigned char       uint8_t;

if you used another non-char type then the compiler should be able to optimize.

This is covered in the draft C++ standard section 3.10 Lvalues and rvalues which says:

If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined

and includes the following bullet:

  • a char or unsigned char type.

Note, I posted a comment on possible work arounds in a question that asks When is uint8_t ≠ unsigned char? and the recommendation was:

The trivial workaround, however, is to use the restrict keyword, or to copy the pointer to a local variable whose address is never taken so that the compiler does not need to worry about whether the uint8_t objects can alias it.

Since C++ does not support the restrict keyword you have to rely on compiler extension, for example gcc uses __restrict__ so this is not totally portable but the other suggestion should be.

Catoptrics answered 10/10, 2014 at 12:15 Comment(1)
This is an example of a place where the Standard is worse for optimizers than would be a rule would allow a compiler to assume that between two accesses to an object of type T, or such an access and the start or end of a loop/function wherein it occurs, all accesses to the storage will use the same object unless an intervening operation uses that object (or a pointer/reference to it) to derive a pointer or reference to some other object. Such a rule would eliminate the need for the "character-type exception" which can kill the performance of code that works with sequences of bytes.Oconnell

© 2022 - 2024 — McMap. All rights reserved.