Is the ranged based for loop beneficial to performance?
Asked Answered
S

6

49

Reading various questions here on Stack Overflow about C++ iterators and performance**, I started wondering if for(auto& elem : container) gets "expanded" by the compiler into the best possible version? (Kind of like auto, which the compiler infers into the right type right away and is therefore never slower and sometimes faster).

** For example, does it matter if you write

for(iterator it = container.begin(), eit = container.end(); it != eit; ++it)

or

for(iterator it = container.begin(); it != container.end(); ++it)

for non-invalidating containers?

Stumpy answered 30/5, 2012 at 18:2 Comment(2)
How does using auto instead of typedefs help performance?Memberg
@JonathanWakely Sorry, mentioning typedefs was an error on my part. I can't find a proper reference, but what I ment is what Stephen T. Lavavej mentaions at 49:30+ in channel9.msdn.com/Events/GoingNative/GoingNative-2012/STL11-Magic-SecretsStumpy
A
42

The Standard is your friend, see [stmt.ranged]/1

For a range-based for statement of the form

for ( for-range-declaration : expression ) statement

let range-init be equivalent to the expression surrounded by parentheses

( expression )

and for a range-based for statement of the form

for ( for-range-declaration : braced-init-list ) statement

let range-init be equivalent to the braced-init-list. In each case, a range-based for statement is equivalent to

{
  auto && __range = range-init;
  for ( auto __begin = begin-expr,
             __end = end-expr;
        __begin != __end;
        ++__begin )
  {
    for-range-declaration = *__begin;
    statement
  }
}

So yes, the Standard guarantees that the best possible form is achieved.

And for a number of containers, such as vector, it is undefined behavior to modify (insert/erase) them during this iteration.

Assertion answered 30/5, 2012 at 18:40 Comment(3)
@Evgeni: modifying a container while looping over it is problematic in general (in many other languages also). There are usually ways to do it correctly, but they are often more difficult than a trivial range-based loop. The same goes here: you have to iterate manually and update the iterator with for example the result from eraseLiverpudlian
How do you conclude that it is guaranteed to be best possible performance? I'm not clear how that conclusion is reached.Samhita
@thc: I am not saying best possible performance, but best possible form. Specifically, note that (1) range-init is only evaluated once, (2) begin-expr and end-expr are only evaluated once and (3) pre-increment is used. Algorithmitically speaking, this is the form of least complexity. Whether it optimizes to best possible performance is likely, but not guaranteed. Optimizers are fickle.Assertion
I
28

Out of curiosity I decided to look at the assembly code for both approaches:

int foo1(const std::vector<int>& v) {
    int res = 0;
    for (auto x : v)
        res += x;
    return res;
}

int foo2(const std::vector<int>& v) {
    int res = 0;
    for (std::vector<int>::const_iterator it = v.begin(); it != v.end(); ++it)
      res += *it;
    return res;
}

And the assembly code (with -O3 and gcc 4.6) is exactly the same for both approaches (code for foo2 is omitted, since it is exactly the same):

080486d4 <foo1(std::vector<int, std::allocator<int> > const&)>:
80486d4:       8b 44 24 04             mov    0x4(%esp),%eax
80486d8:       8b 10                   mov    (%eax),%edx
80486da:       8b 48 04                mov    0x4(%eax),%ecx
80486dd:       b8 00 00 00 00          mov    $0x0,%eax
80486e2:       39 ca                   cmp    %ecx,%edx
80486e4:       74 09                   je     80486ef <foo1(std::vector<int, std::allocator<int> > const&)+0x1b>
80486e6:       03 02                   add    (%edx),%eax
80486e8:       83 c2 04                add    $0x4,%edx
80486eb:       39 d1                   cmp    %edx,%ecx
80486ed:       75 f7                   jne    80486e6 <foo1(std::vector<int, std::allocator<int> > const&)+0x12>
80486ef:       f3 c3                   repz ret 

So, yes, both approaches are the same.

UPDATE: The same observation holds for other containers (or element types) such as vector<string> and map<string, string>. In those cases, it is especially important to use a reference in the ranged-based loop. Otherwise a temporary is created and lots of extra code appears (in the previous examples it was not needed since the vector contained just int values).

For the case of map<string, string> the C++ code snippet used is:

int foo1(const std::map<std::string, std::string>& v) {
    int res = 0;
    for (const auto& x : v) {
        res += (x.first.size() + x.second.size());
    }
    return res;
}

int foo2(const std::map<std::string, std::string>& v) {
    int res = 0;
    for (auto it = v.begin(), end = v.end(); it != end; ++it) {
        res += (it->first.size() + it->second.size());
    }
    return res;
}

And the assembly code (for both cases) is:

8048d70:       56                      push   %esi
8048d71:       53                      push   %ebx
8048d72:       31 db                   xor    %ebx,%ebx
8048d74:       83 ec 14                sub    $0x14,%esp
8048d77:       8b 74 24 20             mov    0x20(%esp),%esi
8048d7b:       8b 46 0c                mov    0xc(%esi),%eax
8048d7e:       83 c6 04                add    $0x4,%esi
8048d81:       39 f0                   cmp    %esi,%eax
8048d83:       74 1b                   je     8048da0 
8048d85:       8d 76 00                lea    0x0(%esi),%esi
8048d88:       8b 50 10                mov    0x10(%eax),%edx
8048d8b:       03 5a f4                add    -0xc(%edx),%ebx
8048d8e:       8b 50 14                mov    0x14(%eax),%edx
8048d91:       03 5a f4                add    -0xc(%edx),%ebx
8048d94:       89 04 24                mov    %eax,(%esp)
8048d97:       e8 f4 fb ff ff          call   8048990 <std::_Rb_tree_increment(std::_Rb_tree_node_base const*)@plt>
8048d9c:       39 c6                   cmp    %eax,%esi
8048d9e:       75 e8                   jne    8048d88 
8048da0:       83 c4 14                add    $0x14,%esp
8048da3:       89 d8                   mov    %ebx,%eax
8048da5:       5b                      pop    %ebx
8048da6:       5e                      pop    %esi
8048da7:       c3                      ret    
Inglebert answered 30/5, 2012 at 18:14 Comment(6)
The compiler optimized v.end() in this case, it may not always be able to do so (for other containers).Sunward
Like @Sunward says, this is not proof.Holo
the OP also included the option to cache v.end() and the assembly code would still look the same. If you are more comfortable with that...Inglebert
Nice, thank you! Am I right with the assumpttion that they (teh codez) are the same because std::vector invalidates the upon insertion (reallocation, actually)? And that for other containers (e.g. std::list) the end() gets cached?Stumpy
@Evgeni: no you are wrong. It depends on the complexity of end, the transparency of the loop body (if you pass a reference to the container to an opaque functions all bets are off) and whether the optimizer is good enough to deduce it or not. Despite what most books do, it is unconditionally better to cache the end() call for for-each equivalents. It also documents that end() won't bulge for future readers.Assertion
note that using std::for_each also results in the same compiled code.Imperishable
S
27

Range-for is as fast as possible since it caches the end iterator[citation provided], uses pre-increment and only dereferences the iterator once.

so if you tend to write:

for(iterator i = cont.begin(); i != cont.end(); i++) { /**/ }

Then, yes, range-for may be slightly faster, since it's also easier to write there's no reason not to use it (when appropriate).

N.B. I said it's as fast as possible, it isn't however faster than possible. You can achieve the exact same performance if you write your manual loops carefully.

Sunward answered 30/5, 2012 at 18:8 Comment(0)
N
5

No. It is same as the old for loop with iterators. After all, the range-based for works with iterators internally. The compiler just produces equivalent code for both.

Naif answered 30/5, 2012 at 18:5 Comment(1)
I knew it uses iterators, but what liked to know if the compiler produces the best possible version of the loop with iterators...Stumpy
S
5

It's possibly faster, in rare cases. Since you can't name the iterator, an optimizer can more easily prove that your loop cannot modify the iterator. This affects e.g. loop unrolling optimizations.

Smithy answered 30/5, 2012 at 22:13 Comment(0)
P
1

Hardware accelerators unroll loops, thus do not want to recompute non-trivial S::end() (such as for(auto x = s->begin(); s->end() != x; ++x) {}) for each iteration of loops.

std::for_each (or for(auto &x : s) {}) should solve altera-id-dependent-backward-branch ( What is a loop with an ID-dependent backward branch? ) plus other such issues.

Note: for(auto x : s) {} can trigger problems (such as performance-for-range-copy or stack-use-after-scope) unless you mark iterators as references (for(auto &x : s) {})

Pyemia answered 18/6 at 14:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.