Why does changing `const ull` to `const ull&` in function parameter result in performance gain?
Asked Answered
D

2

28

So with the following code, changing the type of the parameter x from const ull to const ull& (with typedef unsigned long long ull) results in a roughly 25% speedup when compiled with gcc 4.7.2 and flags -O3 -std=c++11 -g, and I can't figure out why this would make such a big difference.

static void inline single_mult(const std::vector<ull>::iterator& data,
                  const std::vector<ull>::const_iterator& rbegin,
                  const std::vector<ull>::const_iterator& rend,
                  const ull x) {
        ull tmp=0, carry=0, i=0;
        for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {
                tmp = x*(*rhs_it) + data[i] + carry;
                if (tmp >= imax) {
                        carry = tmp >> numbits;
                        tmp &= imax - 1;
                } else { 
                        carry = 0;
                }
                data[i++] = tmp;
        }
        data[i] += carry;
}

It is called in the following function (for doing schoolbook long multiplication)

static void naive(std::vector<ull>::iterator data, 
              std::vector<ull>::const_iterator cbegin,
              std::vector<ull>::const_iterator cend  ,
              std::vector<ull>::const_iterator rbegin,
              std::vector<ull>::const_iterator rend) {

    for (auto data_it  = cbegin;
          data_it != cend; ++data_it) {
        if (*data_it != 0) {
            single_mult(data, rbegin, rend, *data_it);
        }
        ++data;
    }
}

The timing is done by calling clock() around a loop to measure how long it takes. Not the most accurate/precise way, but I figured a consistent 25% difference meant the difference was statistically significant.

Full working code block:

#include <vector>
#include <limits>
#include <ctime>
#include <iostream>

typedef unsigned long long ull;
typedef unsigned int uint;
const ull numbits = (ull) std::numeric_limits<uint>::digits;        
const ull imax = 1LL << numbits;     

static void inline single_mult(const std::vector<ull>::iterator& data,
              const std::vector<ull>::const_iterator& rbegin,
              const std::vector<ull>::const_iterator& rend,
              const ull x) {
    ull tmp=0, carry=0, i=0;
    for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {
            tmp = x*(*rhs_it) + data[i] + carry;
            if (tmp >= imax) {
                    carry = tmp >> numbits;
                    tmp &= imax - 1;
            } else {
                    carry = 0;
            }
            data[i++] = tmp;
    }
    data[i] += carry;
}

static void naive(std::vector<ull>::iterator data,
              std::vector<ull>::const_iterator cbegin,
              std::vector<ull>::const_iterator cend  ,
              std::vector<ull>::const_iterator rbegin,
              std::vector<ull>::const_iterator rend) {

    for (auto data_it  = cbegin; data_it != cend; ++data_it) {
            if (*data_it != 0) {
                    single_mult(data, rbegin, rend, *data_it);
            }
    ++data;
    }
}


int main() {
    int N = 10000;
    std::vector<ull> vec(N);
    for (int i=0; i<N; ++i) {
        vec[i] = i;
    }

    auto s1 = clock();
    int num = 10;
    std::vector<ull> result(2*N);
    for (int i=0; i<num; ++i) {
    //Need to zero out the vector or the result will be different.
        std::fill(result.begin(), result.end(), 0);
        naive(result.begin(), vec.cbegin(), vec.cend(), vec.cbegin(), vec.cend());
    }
    s1 = (clock() - s1);
    std::cout << "Took " << float(s1)/CLOCKS_PER_SEC << "seconds total." << std::endl;
}

And runtimes (I named the file that passes the value value.cpp and reference reference.cpp)

$ g++ -O3 -std=c++11 -g -o reference reference.cpp
$ g++ -O3 -std=c++11 -g -o value value.cpp
$ ./reference                                                                                                                                           
Took 1.05seconds total.                                                                        
$ ./value                                                                 
Took 1.83seconds total.            
Dann answered 11/2, 2013 at 3:53 Comment(18)
1) did you check the assembly? 2) is your word size 64 bits?Interpol
I assume that ull stands for unsigned long long, right?Damage
1.)There are differences in the assembly, but I don't understand at all what they would mean. Furthermore, when compiled with -O3 the function gets inlined, so I can't use GDB to view the disassembly of just the function. 2.) yes 3.) yes ull is a typedef for unsigned long longDann
The inlining says something. Can you show how this function is called? I have a hunch I know the cause.Mcdougal
I edited the question with calling code.Dann
Yeah, that just disproved my hypothesis. :(Mcdougal
What metric are you looking at in gprof? self-seconds?Cattima
Are you compiling for x86 or x64?Mcdougal
1)I'm not timing it in gprof, I just have calls to clock() around a loop to measure how long it takes. Not the most accurate/precise way, but I figured a consistent 25% difference meant the difference was statistically significant. 2)x64.Dann
I'm looking at the assembly and I can also see that it's different. The branching behavior is drastically different between the two versions. But I can't explain why.Mcdougal
@0xE6: Oops, I got -g and -p mixed up, sorry.Cattima
I love a good single_mult scotch.Cloudberry
It would help if you'd provide the code in a form that can be copy-pasted and actually compiles and works (ie. a single codeblock including correct imports, all required macros and constants, a main() that works when run, etc.). Lowering the effort required for people trying to help will very likely yield better answers.Nepean
naive has 3 } but only 2 {. Typo?Worsham
1.)Ok, I added a full working code block. 2.) Yes that was a typo, fixed it.Dann
The const-ref version seems to be using a register to pass the value, while the other one uses the stack. What is strange is that the compiler is not able to figure out: "hey, I could really optimize this by using a register".Py
cant see any ull& in your codeMemorandum
There are no ull& in the code i posted. In the full code block I posted, changing, in line 14, from const ull x to const ull& x is what results in the speedup.Dann
Y
31

I was able to reproduce your observation of the speedup, it was even more noticeable for me (1.75x faster). The problem seems to be when you pass x by value it enables the compiler to perform optimizations that it otherwise doesn't do, and these optimizations backfire, they apparently introduce an unintended stall in the processor. The compiler generates a couple of conditional moves, back to back, rather than compare and branch. The compare and branch runs much faster than the back-to-back conditional moves.

I was able to avoid this by simplifying the code that the compiler had trouble with, namely this

if (tmp >= imax) {
    carry = tmp >> numbits;
    tmp &= imax - 1;
} else {
    carry = 0;
}

can be reduced to

carry = tmp >> numbits;
tmp &= imax - 1;

Here's the version of gcc I'm using

g++ --version
g++ (GCC) 4.6.3 20120306 (Red Hat 4.6.3-2)

These are the commands I used, perf record will profile your code and perf report will annotate the source and disassembly with the results of the profile

 g++ -std=gnu++0x -O3 -g single_mult.cpp -o single_mult
 perf record ./single_mult
 perf report

Once in perf report press enter on main and select Annotate main, you'll see a disassembly of your program along with source code and the percent of the time the profiler found your program running at each instruction in the function... actually these numbers must be taken as a hint only, often you will see instructions with big counts when in fact it was the previous instruction that stalled, or missed in the cache, or there was a mis-predicted branch, etc. So when you see a big count look back to see what may have caused it. The reason is the profile is statistical, it interrupts your program at a constant rate and looks to see where the instruction pointer is, and often the interrupt happens while the processor is stalled due to a cache miss or a mis-predicted branch or some internal data dependency.

I increased the number of iterations to allow more time for the profiler

int N = 20000;

When x is passed by value it Took 11.58seconds total and this is what I see, notice the cmovbe instructions

       :            ull tmp=0, carry=0, i=0;                                                                                                                                                     
       :            for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {                                                                                                                       
       :                    tmp = x*(*rhs_it) + data[i] + carry;                                                                                                                                 
 11.10 :          400b40:       4c 89 c9                mov    %r9,%rcx                                                                                                                          
  0.00 :          400b43:       48 89 fa                mov    %rdi,%rdx                                                                                                                         
  0.01 :          400b46:       48 03 14 c6             add    (%rsi,%rax,8),%rdx                                                                                                                
 11.65 :          400b4a:       48 0f af 0c c3          imul   (%rbx,%rax,8),%rcx                                                                                                                
  0.99 :          400b4f:       48 01 ca                add    %rcx,%rdx                                                                                                                         
       :                    if (tmp >= imax) {                                                                                                                                                   
       :                            carry = tmp >> numbits;                                                                                                                                      
  2.25 :          400b52:       48 89 d7                mov    %rdx,%rdi                                                                                                                         
       :                            tmp &= imax - 1;                                                                                                                                            
 10.99 :          400b55:       48 89 d1                mov    %rdx,%rcx                                                                                                                         
       :                      const ull x) {                                                                                                                                                     
       :            ull tmp=0, carry=0, i=0;                                                                                                                                                     
       :            for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {                                                                                                                       
       :                    tmp = x*(*rhs_it) + data[i] + carry;                                                                                                                                 
       :                    if (tmp >= imax) {                                                                                                                                                    
       :                            carry = tmp >> numbits;                                                                                                                                      
  0.69 :          400b58:       48 c1 ef 20             shr    $0x20,%rdi                                                                                                                        
       :                            tmp &= imax - 1;                                                                                                                                              
  9.54 :          400b5c:       83 e1 ff                and    $0xffffffff,%ecx                                                                                                                   
  9.05 :          400b5f:       4c 39 c2                cmp    %r8,%rdx                                                                                                                          
 10.78 :          400b62:       49 0f 46 fb             cmovbe %r11,%rdi                                                                                                                          
       :                    } else {                                                                                                                                                              
       :                            carry = 0;                                                                                                                                                    
       :                    }                                                                                                                                                                     
       :                    data[i++] = tmp;                                                                                                                                                     
 20.73 :          400b66:       48 83 c0 01             add    $0x1,%rax                                                                                                                          
  0.02 :          400b6a:       4c 39 c2                cmp    %r8,%rdx                                                                                                                           
  0.17 :          400b6d:       48 0f 46 ca             cmovbe %rdx,%rcx                                                                                                                         
       :        static void inline single_mult(const std::vector<ull>::iterator& data,                                                                                                            
       :                      const std::vector<ull>::const_iterator& rbegin,                                                                                                                    
       :                      const std::vector<ull>::const_iterator& rend,                                                                                                                      
       :                      const ull x) {                                                                                                                                                     
       :            ull tmp=0, carry=0, i=0;                                                                                                                                                     
       :            for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {                                                                                                                       
 11.47 :          400b71:       4c 39 d0                cmp    %r10,%rax                                                                                                                         
       :                            carry = tmp >> numbits;                                                                                                                                      
       :                            tmp &= imax - 1;                                                                                                                                             
       :                    } else {                                                                                                                                                             
       :                            carry = 0;                                                                                                                                                   
       :                    }                                                                                                                                                                    
       :                    data[i++] = tmp;                                                                                                                                                     
  0.01 :          400b74:       48 89 4c c6 f8          mov    %rcx,-0x8(%rsi,%rax,8)                                                                                                            
       :        static void inline single_mult(const std::vector<ull>::iterator& data,                                                                                                           
       :                      const std::vector<ull>::const_iterator& rbegin,                                                                                                                    
       :                      const std::vector<ull>::const_iterator& rend,                                                                                                                      
       :                      const ull x) {                                                                                                                                                     
       :            ull tmp=0, carry=0, i=0;                                                                                                                                                     
       :            for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {                                                                                                                       
  0.53 :          400b79:       75 c5                   jne    400b40 <main+0x250>                                                                                                               
       :                    } else {                                                                                                                                                             
       :                            carry = 0;                                                                                                                                                   
       :                    }                                                                                                                                                                    
       :                    data[i++] = tmp;                                                                                                                                                     
       :            }                                                                                                                                                                             
       :            data[i] += carry;                                                                                                                                                            
  0.00 :          400b7b:       4a 01 3c d6             add    %rdi,(%rsi,%r10,8)                                                                                                                
  0.01 :          400b7f:       48 83 c5 08             add    $0x8,%rbp                                                                                                                         

When x is passed by reference it Took 6.59seconds total, this is what I see

       :            ull tmp=0, carry=0, i=0;                                                                                                                                                     
       :            for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {                                                                                                                       
 20.90 :          400b30:       48 8b 17                mov    (%rdi),%rdx                                                                                                                       
       :                    tmp = x*(*rhs_it) + data[i] + carry;                                                                                                                                 
  1.38 :          400b33:       49 0f af 14 c1          imul   (%r9,%rax,8),%rdx                                                                                                                   
  4.82 :          400b38:       48 03 0c c6             add    (%rsi,%rax,8),%rcx                                                                                                                
 22.41 :          400b3c:       48 01 ca                add    %rcx,%rdx                                                                                                                         
       :                    if (tmp >= imax) {                                                                                                                                                   
       :                            carry = tmp >> numbits;                                                                                                                                      
       :                            tmp &= imax - 1;                                                                                                                                               
       :                    } else {                                                                                                                                                             
       :                            carry = 0;                                                                                                                                                   
  2.95 :          400b3f:       31 c9                   xor    %ecx,%ecx                                                                                                                         
       :                      const std::vector<ull>::const_iterator& rend,                                                                                                                      
       :                      const ull &x) {                                                                                                                                                    
       :            ull tmp=0, carry=0, i=0;                                                                                                                                                     
       :            for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {                                                                                                                       
       :                    tmp = x*(*rhs_it) + data[i] + carry;                                                                                                                                 
       :                    if (tmp >= imax) {                                                                                                                                                   
  0.23 :          400b41:       4c 39 d2                cmp    %r10,%rdx                                                                                                                         
  0.00 :          400b44:       76 0a                   jbe    400b50 <main+0x260>                                                                                                               
       :                            carry = tmp >> numbits;                                                                                                                                      
  2.27 :          400b46:       48 89 d1                mov    %rdx,%rcx                                                                                                                         
       :                            tmp &= imax - 1;                                                                                                                                             
  1.29 :          400b49:       83 e2 ff                and    $0xffffffff,%edx                                                                                                                  
       :                      const ull &x) {                                                                                                                                                    
       :            ull tmp=0, carry=0, i=0;                                                                                                                                                     
       :            for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {                                                                                                                       
       :                    tmp = x*(*rhs_it) + data[i] + carry;                                                                                                                                 
       :                    if (tmp >= imax) {                                                                                                                                                   
       :                            carry = tmp >> numbits;                                                                                                                                      
  0.26 :          400b4c:       48 c1 e9 20             shr    $0x20,%rcx                                                                                                                        
       :                            tmp &= imax - 1;                                                                                                                                             
       :                    } else {                                                                                                                                                             
       :                            carry = 0;                                                                                                                                                   
       :                    }                                                                                                                                                                    
       :                    data[i++] = tmp;                                                                                                                                                     
 19.67 :          400b50:       48 83 c0 01             add    $0x1,%rax                                                                                                                         
       :        static void inline single_mult(const std::vector<ull>::iterator& data,                                                                                                           
       :                      const std::vector<ull>::const_iterator& rbegin,                                                                                                                    
       :                      const std::vector<ull>::const_iterator& rend,                                                                                                                      
       :                      const ull &x) {                                                                                                                                                    
       :            ull tmp=0, carry=0, i=0;                                                                                                                                                     
       :            for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {                                                                                                                       
  0.53 :          400b54:       4c 39 c0                cmp    %r8,%rax                                                                                                                          
       :                            carry = tmp >> numbits;                                                                                                                                      
       :                            tmp &= imax - 1;                                                                                                                                             
       :                    } else {                                                                                                                                                             
       :                            carry = 0;                                                                                                                                                   
       :                    }                                                                                                                                                                      
       :                    data[i++] = tmp;                                                                                                                                                     
  0.39 :          400b57:       48 89 54 c6 f8          mov    %rdx,-0x8(%rsi,%rax,8)                                                                                                            
       :        static void inline single_mult(const std::vector<ull>::iterator& data,                                                                                                           
       :                      const std::vector<ull>::const_iterator& rbegin,                                                                                                                    
       :                      const std::vector<ull>::const_iterator& rend,                                                                                                                      
       :                      const ull &x) {                                                                                                                                                    
       :            ull tmp=0, carry=0, i=0;                                                                                                                                                     
       :            for (auto rhs_it = rbegin; rhs_it != rend; ++rhs_it) {                                                                                                                       
 22.91 :          400b5c:       75 d2                   jne    400b30 <main+0x240>                                                                                                               
       :                    } else {                                                                                                                                                             
       :                            carry = 0;                                                                                                                                                   
       :                    }                                                                                                                                                                    
       :                    data[i++] = tmp;                                                                                                                                                     
       :            }                                                                                                                                                                            
       :            data[i] += carry;                                                                                                                                                            
  0.00 :          400b5e:       4a 01 0c c6             add    %rcx,(%rsi,%r8,8)                                                                                                                 
  0.00 :          400b62:       48 83 c7 08             add    $0x8,%rdi                                                                                                                         
Yorgo answered 11/2, 2013 at 20:6 Comment(12)
Thanks! That seems to be a reasonable explanation.Dann
Thank you! I'm glad you liked it.Yorgo
This is likely some optimizer bug, you should report it to GCC developer with the testcase and analisys: gcc.gnu.org/bugzilla/ but read this first gcc.gnu.org/bugs thanksSara
Note: I made the test with GCC 4.7.2 (Fedora 17 x86_64) with -march=native on an Intel Xeon X5650 (detected by GCC as a Core i7) and I got the same results.Sara
Thank you, @ydroneaud - I'll file the bug report and note that the behavior was confirmed in GCC 4.7.2 and provide a link to this page.Yorgo
@Yorgo Thank you. I'm waiting to see the explanation of GCC developers. Feel free to post the bug number here. Regards.Sara
The status of the discussion in that gcc bug report is that it is difficult for gcc to generate integer scalar conditional moves for x86 that run faster than compare and branch. The suggested workaround is to use -fno-tree-loop-if-convert gcc option to kill the optimization that is causing trouble in this case.Yorgo
I can also verify this: in my experience x86 cmov instructions are rarely more efficient than classic compare & branch. I think the instructions were more helpful on Intel Pentium 4, which had severe branch performance issues. But no chip since -- either by Intel or AMD -- benefits from something like cmovbe unless the branch in question is extremely likely to be mis-predicted (in my tests, the prediction rate needs to be less than 85% for a cmov performance win, which is considered abysmal by modern predictor standards).Vintager
@Yorgo thank you for posting the bug on bugzilla and keep us informed of the analysis by GCC developers.Sara
@Yorgo Could you please look at Why does gcc generate 15-20% faster code if I optimize for SIZE instead of speed? Based on your answer here, I believe you would be able to answer my question as well. Many thanks!Muntin
@Muntin I took a quick look and tried to reproduce the timings on the machines I have available without success, I'll look further to see if I find one where the timings are different.Yorgo
@Yorgo Thanks, I greatly appreciate your efforts. I hope you can reproduce the phenomenon on some machine.Muntin
H
3

Without seeing the assembler code, it's hard to be certain, but in general: passing an unsigned long long by value may require the compiler to generate extra code to copy the value onto the stack.

Passing it by reference allows the compiler to simply pass the pointer, which it may already have in a convenient register.

Heir answered 11/2, 2013 at 14:12 Comment(2)
This is what I initially thought too, but the OP says the function is inlined. Furthermore, there's an inner-loop that (which I presume) takes up much more time than parameter passing overhead.Mcdougal
If it would help, I can give the assembly of the main function from the code block I posted, as gdb cannot find either naive or single_mult. It's some 200-odd lines for each of the cases, and there seem to many significant differences, so I'm unsure of a good way to do that.Dann

© 2022 - 2024 — McMap. All rights reserved.