Why does random extra code improve performance?

Asked 16/7, 2015 at 8:50 Answered 16/7, 2015 at 14:39

Struct Node {
    Node *N[SIZE];
    int value;
};

struct Trie {
    Node *root;

    Node* findNode(Key *key) {
        Node *C = &root;
        char u;
        while (1) {
            u = key->next();
            if (u < 0) return C;
         // if (C->N[0] == C->N[0]); // this line will speed up execution significantly
            C = C->N[u];
            if (C == 0) return 0;
        }
    }
    void addNode(Key *key, int value){...};
};

In this implementation of Prefix Tree (aka Trie) I found out that 90% of findNode() execution time is taken by a single operation C=C->N[u];

In my attempt to speed up this code, I randomly added the line that is commented in the snipped above, and code became 30% faster! Why is that?

UPDATE

Here is complete program.

#include "stdio.h"
#include "sys/time.h"

long time1000() {
  timeval val;
  gettimeofday(&val, 0);
  val.tv_sec &= 0xffff;
  return val.tv_sec * 1000 + val.tv_usec / 1000;
}

struct BitScanner {
    void *p; 
    int count, pos;
    BitScanner (void *p, int count) {
        this->p = p;
        this->count = count;
        pos = 0;
    }
    int next() {
        int bpos = pos >> 1;
        if (bpos >= count) return -1;
        unsigned char b = ((unsigned char*)p)[bpos];
        if (pos++ & 1) return (b >>= 4);
        return b & 0xf;
    }

};

struct Node {
    Node *N[16];
    __int64_t value;
    Node() : N(), value(-1) { }
};

struct Trie16 {
    Node root;

    bool add(void *key, int count, __int64_t value) {
        Node *C = &root;
        BitScanner B(key, count);
        while (true) {
            int u = B.next();
            if (u < 0) {
                if (C->value == -1) {
                    C->value = value;
                    return true; // value added
                }
                C->value = value;
                return false; // value replaced
            }
            Node *Q = C->N[u];
            if (Q) {
                C = Q;
            } else {
                C = C->N[u] = new Node;
            }
        }
    }

    Node* findNode(void *key, int count) {
        Node *C = &root;
        BitScanner B(key, count);
        while (true) {
            char u = B.next();
            if (u < 0) return C;
//          if (C->N[0] == C->N[1]);
            C = C->N[0+u];
            if (C == 0) return 0;
        }
    }
};

int main() {
    int T = time1000();
    Trie16 trie;
    __int64_t STEPS = 100000, STEP = 500000000, key;
    key = 0;
    for (int i = 0; i < STEPS; i++) {
        key += STEP;
        bool ok = trie.add(&key, 8, key+222);
    }
    printf("insert time:%i\n",time1000() - T); T = time1000();
    int err = 0;
    key = 0;
    for (int i = 0; i < STEPS; i++) {
        key += STEP;
        Node *N = trie.findNode(&key, 8);
        if (N==0 || N->value != key+222) err++;
    }
    printf("find time:%i\n",time1000() - T); T = time1000();
    printf("errors:%i\n", err);
}

Myocarditis answered 16/7, 2015 at 8:50 Comment(22)

What compile flags did you use? Also did you make multiple tests, or just one? – Sella 16/7, 2015 at 8:52

Memory access speed is a common bottleneck these days where everything else is fast. Beware of those ->, they can be very expensive. – Grating 16/7, 2015 at 8:57

@Aleksandar, I did multiple tests, hundreds in fact, this amused me and captured my attention for hours. I used both clang and gcc with both -O0 and -O3. – Myocarditis 16/7, 2015 at 8:57

Did you try to look at assembly? – Refraction 16/7, 2015 at 9:0

@Myocarditis Where did you get the idea to add code in your attempt to gain speed? – Swanky 16/7, 2015 at 9:1

You should post an MCVE. – Carousel 16/7, 2015 at 9:4

@JorenHeit, I was debugging and wanted to count number of pointer access, and I found that the counter added for debugging improves performance significantly. – Myocarditis 16/7, 2015 at 9:4

@Myocarditis That's weird indeed! – Swanky 16/7, 2015 at 9:6

I think that improvement is not related to dummy++, but to if (C->N[0] == C->N[1]). I guess that this check causes the caches are used and data for C = C->N[u]are read in the cache immediately. – Refraction 16/7, 2015 at 9:11

With small programs this is common behavior because of the instruction cache having different content. My suggestion is to compile and save both binaries and run them one after the other and then reverse the order, and compare the times. – Burgage 16/7, 2015 at 9:12

What's the value/type of SIZE? – Furnishings 16/7, 2015 at 9:12

Maybe the CPU starts doing prefetch of N when it sees it being accessed multiple times? – Harte 16/7, 2015 at 9:12

I created complete example, should I add it to a post? – Myocarditis 16/7, 2015 at 9:22

@exebook, yes, this will give possibility to reproduce it – Trucker 16/7, 2015 at 9:22

@Cobusve, tried that, the effect persists. – Myocarditis 16/7, 2015 at 9:28

I just found out that dummy++ part is redundant, if (C->N[0] == C->N[1]); alone is enough to cause the effect. – Myocarditis 16/7, 2015 at 9:32

Could you try replacing the "random code" by __builtin_prefetch on gcc? gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html – Ewing 16/7, 2015 at 9:53

@erenon, I replaced "random code" with __builtin_prefetch(C->N[u]);, and it does the same effect. In fact even 5% faster. – Myocarditis 16/7, 2015 at 10:5

@exebook: Problem solved, then? – Ewing 16/7, 2015 at 10:35

@erenon, right, but I am not sure should I accept Alexander Balabin's answer. – Myocarditis 16/7, 2015 at 11:38

The key do-nothing line is different between the original post and the update. In the original post, you wrote if (C->N[0] == C->N[0]);, which tests a tautology and then does nothing. I'd expect the compiler to optimize that out (assuming no side effects from operator overloads). On the other hand, if (C->N[0] == C->N[1]); is not a tautology and could affect caching, branch prediction, speculative execution, etc. – Team 16/7, 2015 at 17:35

@AdrianMcCarthy, but both have the same effect. – Myocarditis 16/7, 2015 at 17:41

This is largely a guess but from what I read about CPU data prefetcher it would only prefetch if it sees multiple access to the same memory location and that access matches prefetch triggers, for example looks like scanning. In your case if there is only single access to C->N the prefetcher would not be interested, however if there are multiple and it can predict that the later access is further into the same bit of memory that can make it to prefetch more than one cache line.

If the above was happening then C->N[u] would not have to wait for memory to arrive from RAM therefore would be faster.

Harte answered 16/7, 2015 at 9:27 Comment(1)

Correct! As I commented the magic is made by if (C->N[0] == C->N[1]) not by dummy++; – Refraction 16/7, 2015 at 9:38

It looks like what you are doing is preventing processor stalls by delaying the execution of code until the data is available locally.

Doing it this way is very error prone unlikely to continue working consistently. The better way is to get the compiler to do this. By default most compilers generate code for a generic processor family. BUT if you look at the available flags you can usually find flags for specifying your specific processor so it can generate more specific code (like pre-fetches and stall code).

See: GCC: how is march different from mtune? the second answer goes into some detail: https://mcmap.net/q/56955/-how-is-march-different-from-mtune

Vacillate answered 16/7, 2015 at 14:39 Comment(1)

FTR: on gcc and clang look for -march=native. – Ewing 16/7, 2015 at 15:5

-2

Since each write operation is costly than the read. Here If you see that, C = C->N[u]; it means CPU is executing write in each iteration for the variable C. But when you perform if (C->N[0] == C->N[1]) dummy++; write on dummy is executed only if C->N[0] == C->N[1]. So you have save many write instructions of CPU by using if condition.

Parka answered 16/7, 2015 at 9:26 Comment(1)

If you talk about CPU instructions, dummy++ and C = C->N[u]; will make same sense. – Parka 16/7, 2015 at 9:50

Recommended topics

Hot tags