Why does random extra code improve performance?
Asked Answered
M

3

17
Struct Node {
    Node *N[SIZE];
    int value;
};

struct Trie {
    Node *root;

    Node* findNode(Key *key) {
        Node *C = &root;
        char u;
        while (1) {
            u = key->next();
            if (u < 0) return C;
         // if (C->N[0] == C->N[0]); // this line will speed up execution significantly
            C = C->N[u];
            if (C == 0) return 0;
        }
    }
    void addNode(Key *key, int value){...};
};

In this implementation of Prefix Tree (aka Trie) I found out that 90% of findNode() execution time is taken by a single operation C=C->N[u];

In my attempt to speed up this code, I randomly added the line that is commented in the snipped above, and code became 30% faster! Why is that?

UPDATE

Here is complete program.

#include "stdio.h"
#include "sys/time.h"

long time1000() {
  timeval val;
  gettimeofday(&val, 0);
  val.tv_sec &= 0xffff;
  return val.tv_sec * 1000 + val.tv_usec / 1000;
}

struct BitScanner {
    void *p; 
    int count, pos;
    BitScanner (void *p, int count) {
        this->p = p;
        this->count = count;
        pos = 0;
    }
    int next() {
        int bpos = pos >> 1;
        if (bpos >= count) return -1;
        unsigned char b = ((unsigned char*)p)[bpos];
        if (pos++ & 1) return (b >>= 4);
        return b & 0xf;
    }

};

struct Node {
    Node *N[16];
    __int64_t value;
    Node() : N(), value(-1) { }
};

struct Trie16 {
    Node root;

    bool add(void *key, int count, __int64_t value) {
        Node *C = &root;
        BitScanner B(key, count);
        while (true) {
            int u = B.next();
            if (u < 0) {
                if (C->value == -1) {
                    C->value = value;
                    return true; // value added
                }
                C->value = value;
                return false; // value replaced
            }
            Node *Q = C->N[u];
            if (Q) {
                C = Q;
            } else {
                C = C->N[u] = new Node;
            }
        }
    }

    Node* findNode(void *key, int count) {
        Node *C = &root;
        BitScanner B(key, count);
        while (true) {
            char u = B.next();
            if (u < 0) return C;
//          if (C->N[0] == C->N[1]);
            C = C->N[0+u];
            if (C == 0) return 0;
        }
    }
};

int main() {
    int T = time1000();
    Trie16 trie;
    __int64_t STEPS = 100000, STEP = 500000000, key;
    key = 0;
    for (int i = 0; i < STEPS; i++) {
        key += STEP;
        bool ok = trie.add(&key, 8, key+222);
    }
    printf("insert time:%i\n",time1000() - T); T = time1000();
    int err = 0;
    key = 0;
    for (int i = 0; i < STEPS; i++) {
        key += STEP;
        Node *N = trie.findNode(&key, 8);
        if (N==0 || N->value != key+222) err++;
    }
    printf("find time:%i\n",time1000() - T); T = time1000();
    printf("errors:%i\n", err);
}
Myocarditis answered 16/7, 2015 at 8:50 Comment(22)
What compile flags did you use? Also did you make multiple tests, or just one?Sella
Memory access speed is a common bottleneck these days where everything else is fast. Beware of those ->, they can be very expensive.Grating
@Aleksandar, I did multiple tests, hundreds in fact, this amused me and captured my attention for hours. I used both clang and gcc with both -O0 and -O3.Myocarditis
Did you try to look at assembly?Refraction
@Myocarditis Where did you get the idea to add code in your attempt to gain speed?Swanky
You should post an MCVE.Carousel
@JorenHeit, I was debugging and wanted to count number of pointer access, and I found that the counter added for debugging improves performance significantly.Myocarditis
@Myocarditis That's weird indeed!Swanky
I think that improvement is not related to dummy++, but to if (C->N[0] == C->N[1]). I guess that this check causes the caches are used and data for C = C->N[u]are read in the cache immediately.Refraction
With small programs this is common behavior because of the instruction cache having different content. My suggestion is to compile and save both binaries and run them one after the other and then reverse the order, and compare the times.Burgage
What's the value/type of SIZE?Furnishings
Maybe the CPU starts doing prefetch of N when it sees it being accessed multiple times?Harte
I created complete example, should I add it to a post?Myocarditis
@exebook, yes, this will give possibility to reproduce itTrucker
@Cobusve, tried that, the effect persists.Myocarditis
I just found out that dummy++ part is redundant, if (C->N[0] == C->N[1]); alone is enough to cause the effect.Myocarditis
Could you try replacing the "random code" by __builtin_prefetch on gcc? gcc.gnu.org/onlinedocs/gcc/Other-Builtins.htmlEwing
@erenon, I replaced "random code" with __builtin_prefetch(C->N[u]);, and it does the same effect. In fact even 5% faster.Myocarditis
@exebook: Problem solved, then?Ewing
@erenon, right, but I am not sure should I accept Alexander Balabin's answer.Myocarditis
The key do-nothing line is different between the original post and the update. In the original post, you wrote if (C->N[0] == C->N[0]);, which tests a tautology and then does nothing. I'd expect the compiler to optimize that out (assuming no side effects from operator overloads). On the other hand, if (C->N[0] == C->N[1]); is not a tautology and could affect caching, branch prediction, speculative execution, etc.Team
@AdrianMcCarthy, but both have the same effect.Myocarditis
H
7

This is largely a guess but from what I read about CPU data prefetcher it would only prefetch if it sees multiple access to the same memory location and that access matches prefetch triggers, for example looks like scanning. In your case if there is only single access to C->N the prefetcher would not be interested, however if there are multiple and it can predict that the later access is further into the same bit of memory that can make it to prefetch more than one cache line.

If the above was happening then C->N[u] would not have to wait for memory to arrive from RAM therefore would be faster.

Harte answered 16/7, 2015 at 9:27 Comment(1)
Correct! As I commented the magic is made by if (C->N[0] == C->N[1]) not by dummy++;Refraction
V
1

It looks like what you are doing is preventing processor stalls by delaying the execution of code until the data is available locally.

Doing it this way is very error prone unlikely to continue working consistently. The better way is to get the compiler to do this. By default most compilers generate code for a generic processor family. BUT if you look at the available flags you can usually find flags for specifying your specific processor so it can generate more specific code (like pre-fetches and stall code).

See: GCC: how is march different from mtune? the second answer goes into some detail: https://mcmap.net/q/56955/-how-is-march-different-from-mtune

Vacillate answered 16/7, 2015 at 14:39 Comment(1)
FTR: on gcc and clang look for -march=native.Ewing
P
-2

Since each write operation is costly than the read. Here If you see that, C = C->N[u]; it means CPU is executing write in each iteration for the variable C. But when you perform if (C->N[0] == C->N[1]) dummy++; write on dummy is executed only if C->N[0] == C->N[1]. So you have save many write instructions of CPU by using if condition.

Parka answered 16/7, 2015 at 9:26 Comment(1)
If you talk about CPU instructions, dummy++ and C = C->N[u]; will make same sense.Parka

© 2022 - 2024 — McMap. All rights reserved.