Optimize lookup tables to simple ALU
Asked Answered
B

2

10

Question

Say you have a simple function that returns a value based on a look table for example:

See edit about assumptions.

uint32_t
lookup0(uint32_t r) {
    static const uint32_t tbl[] = { 0, 1, 2, 3 };
    if(r >= (sizeof(tbl) / sizeof(tbl[0]))) {
        __builtin_unreachable();
    }

    /* Can replace with: `return r`.  */
    return tbl[r];
}


uint32_t
lookup1(uint32_t r) {
    static const uint32_t tbl[] = { 0, 0, 1, 1 };
    if(r >= (sizeof(tbl) / sizeof(tbl[0]))) {
        __builtin_unreachable();
    }

    /* Can replace with: `return r / 2`.  */
    return tbl[r];
}

Is there any super-optimization infrastructure or algorithm that can take go from the lookup table to the optimized ALU implementation.

Motivation

The motivation is I'm building some locks for NUMA machines and want to be able to configure my code generically. Its pretty common that in NUMA locks you will need to do cpu_id -> numa_node. I can obviously setup the lookup table during configuration, but since I'm fighting for every drop of memory bandwidth I can, I am hoping to generically reach a solution that will be able to cover most layouts.

Looking at how modern compilers do:

Neither clang or gcc are able to do this at the moment.

Clang is able to get lookup0 if you rewrite it as a switch/case statement.

lookup0(unsigned int):                            # @lookup0(unsigned int)
        movl    %edi, %eax
        movl    lookup0(unsigned int)::tbl(,%rax,4), %eax
        retq

...

case0(unsigned int):                              # @case0(unsigned int)
        movl    %edi, %eax
        retq

but can't get lookup1.

lookup1(unsigned int):                            # @lookup1(unsigned int)
        movl    %edi, %eax
        movl    .Lswitch.table.case1(unsigned int)(,%rax,4), %eax
        retq

...

case1(unsigned int):                              # @case1(unsigned int)
        movl    %edi, %eax
        movl    .Lswitch.table.case1(unsigned int)(,%rax,4), %eax
        retq

Gcc cant get either.

lookup0(unsigned int):
        movl    %edi, %edi
        movl    lookup0(unsigned int)::tbl(,%rdi,4), %eax
        ret
lookup1(unsigned int):
        movl    %edi, %edi
        movl    lookup1(unsigned int)::tbl(,%rdi,4), %eax
        ret
case0(unsigned int):
        leal    -1(%rdi), %eax
        cmpl    $2, %eax
        movl    $0, %eax
        cmovbe  %edi, %eax
        ret
case1(unsigned int):
        subl    $2, %edi
        xorl    %eax, %eax
        cmpl    $1, %edi
        setbe   %al
        ret

I imagine I can cover a fair amount of the necessary cases with some custom brute-force approach, but was hoping this was a solved problem.

Edit:

The only true assumption is:

  1. All inputs are have an index in the LUT.
  2. All values are positive (think that makes things easier) and will be true for just about any sys-config thats online.
  3. (Edit4) Would add one more assumption. The LUT is dense. That is it covers a range [<low_bound>, <bound_bound>] but nothing outside of that range.

In my case for CPU topology, I would generally expect sizeof(LUT) >= <max_value_in_lut> but that is specific to the one example I gave and would have some counter-examples.

Edit2:

I wrote a pretty simple optimizer that does a reasonable job for the CPU topologies I've tested here. But obviously it could be a lot better.

Edit3:

There seems to be some confusion about the question/initial example (I should have been clearer).

The example lookup0/lookup1 are arbitrary. I am hoping to find a solution that can scale beyond 4 indexes and with different values.

The use case I have in mind is CPU topology so ~256 - 1024 is where I would expect the upper bound in size but for a generic LUT it could obviously get much larger.

Boito answered 18/5, 2022 at 16:45 Comment(14)
It is not clear to me what you want. You provided an answer to your own question which is specific to a given table. To my understanding, you want a generic solution. Your godbolt example show that GCC produce a conditional move which are fast. They are not as fast as a simple return or a shift, but far more than a LUT when the cache is cold and than a jump table. If you expect compiler to produce an optimal code based on ALU-only instructions then there is no solution as you show they are not able to do that yet (unless it is Ok for you to write a Clang plugin).Bitthia
Compilers are getting better over time for micro-optimizations but humans can still do much better than compilers in many cases (especially specific ones) if they spent enough time to optimize a small code. The reason is that there are too many tricks that can be applied on a problem and compilers often do not have all the specificity of the target problem. Additionally, there are many case like this where adding compiler optimization does not worth the effort since compiler developers should also consider the maintainability and the performance of the optimizers.Bitthia
I don't that you need this __builtin_unreachable stuff. Accessing an array out of bound is already UB. The compiler should take advantage of that.Cymry
@JérômeRichard indeed but the point is to do it generically based on system configs. The question isn't "what is the best way to optimize this LUT into ALU". Its "whats the best mechanism for generically optimizing a LUT into some ALU".Boito
@Cymry either way, I prefer to be explicit about things and I'm not really interested in targeting something outside of gcc/clang.Boito
Is there any assumption for the lookup table. Like that the values are non-decreasing? Or that the length is always 4?Cymry
I wonder if return ((r == 0) * tbl[0]) + ((r == 1) * tbl[1]) +((r == 2) * tbl[2]) +((3 == 0) * tbl[3]) would simply enough?Basion
@Cymry I added an edit with my assumptions.Boito
@chux-ReinstateMonica hmm. That does suprising well for case1 (especially clang) but surprisingly poorly for case0.Boito
There is absolutely no chance, with a basic C code using GCC/Clang, to generate a (fast) code that behave like a LUT with 1024 values containing integers > 256 and that makes no memory accesses without making assumption of the values/pattern of the LUT itself. The only way I can think is to generate a huge assembly code full of lea instructions so to generate big numbers with instruction which will cause a huge inefficient code. If your LUT takes 1 KiB and can contain arbitrary random values, then the Shannon's source coding theorem prevent any efficient code to be generated.Bitthia
In the end, it seems like a XY problem to me. Stored in a compact way, the LUT can take 10 bits per item with up to 1024 item (ie. 1.2 KiB of RAM). You can store it on each NUMA node so to avoid remote NUMA accesses. Since the values will be fetched by many cores, it should be stored in the cache (see LFU caching) and only the first core will fetch it anyway otherwise. It will not even be loaded from RAM but from other NUMA nodes thanks to cache coherence. The total memory will be like <200KiB and accesses should be relatively fast due to local accesses, caching and parallel cache line fetchesBitthia
@JérômeRichard indeed. Ending up with a LUT isn't the worst case but there are many topologies that can be optimized to more efficient ALU. If the ALU generated is worse than the LUT I'm happy to stick with the LUT. But if the LUT is worse than ALU I would prefer the ALU.Boito
I advise you to detect some specific main patterns manually (and use a fallback naive LUT function otherwise), physical IDs are often far from being random (node IDs may be though). There are 3-4 very frequent patterns on HPC machines (eg. typically 0 1 2 3 ... or 0 2 4 8... or 0 8 1 9 2 10...). This is because modern architectures are typically homogeneous NUMA nodes of SMPs and the cores in the same SMP often have the same predictable patterns. The number of NUMA node is generally very small (having >64 NUMA nodes would be rather crazy). Having a smaller LUT of <=128 bytes per node is cheap.Bitthia
Note that having information about all cores accessible about all others is known to cause scalability issues. This is often not easy to to better though. The usual solution is to work at the granularity of the NUMA node: each core can only get quickly the information of its own node and it requests the info to other nodes otherwise. This means accesses to remote node infos (or more generally global operations) are slower, but this is the deal with NUMA systems and users should also be aware of that.Bitthia
S
4

The best "generic" solution I am aware of is the following:

int compute(int r)
{
    static const int T[] = {0,0,1,1};
    const int lut_size = sizeof(T) / sizeof(T[0]);

    int result = 0;

    for(int i=0 ; i<lut_size ; ++i)
        result += (r == i) * T[i];

    return result;
}

In -O3, GCC and Clang unroll the loop, propagate constants, and generate an intermediate code similar to the following:

int compute(int r)
{
    return (r == 0) * 0 + (r == 1) * 0 + (r == 2) * 1 + (r == 3) * 1;
}

GCC/Clang optimizers know that multiplication can be replaced with conditional moves (since developers often use this as a trick to guide compilers generating assembly codes without conditional branches).

The resulting assembly is the following for Clang:

compute:
        xor     ecx, ecx
        cmp     edi, 2
        sete    cl
        xor     eax, eax
        cmp     edi, 3
        sete    al
        add     eax, ecx
        ret

The same applies for GCC. There is no branches nor memory accesses (at least as long as the values are small). Multiplication by small values are also replaced with the fast lea instruction.

A more complete test is available on Godbolt.

Note that this method should work for bigger tables but if the table is too big, then the loop will not be automatically unrolled. You can tell the compiler to use a more aggressive unrolling thanks to compilation flags. That being said, a LUT will likely be faster if it is big since having a huge code to load and execute is slow in this pathological case.

Spermogonium answered 18/5, 2022 at 22:6 Comment(6)
This won't scale well if we have something like an 80 core machine. Also this isn't really generic. Its assuming size == 4. Whats the method of going from a LUT to compute?Boito
The initial question state a LUT of size 4 and not 80... Besides, as mentioned in the end of the answer, this is a pathological case where if the mapping is not trivial, then the code will be big anyway. The compiler will have to load values from memory since there cannot be generated unless the pattern is trivial. Loading constant from memory is similar to use a LUT. Loading a huge code is not much better...Bitthia
I previously hardcoded 4 but it was not needed. I modified the code so the size of the table can be bigger.Bitthia
The question was about a generic lut. The example was just to show current compiler infrastructure doesn't good a good job at optimizing luts.Boito
@Noah: If generic means that there is absolutely no structure or pattern to a LUT with >= 256 entries, what is desired is not a realistic goal, regardless of how capable the compiler infrastructure is. If LUT entries could be described as a (linear or otherwise) combination of a few (low single digits) basic functions, there might be a fighting chance.Compromise
@Compromise thats fair. At the end of the day if the ALU is more expensive would just fallback on LUT. I ended up writing a naive script for this that maxes out pretty early.Boito
C
3

You could pack the array into a long integer and use bitshifts and anding to extract the result.

For example for the table {2,0,3,1} could be handled with:

uint32_t lookup0(uint32_t r) {
    static const uint32_t tbl = (2u <<  0) | (0u <<  8) |
                                (3u << 16) | (1u << 24);
    return (tbl >> (8 * r)) & 0xff;
}

It produces relatively nice assembly:

lookup0:                                # @lookup0
        lea     ecx, [8*rdi]
        mov     eax, 16973826
        shr     eax, cl
        movzx   eax, al
        ret

Not perfect but branchless and with no indirection.

This method is quite generic and it could support vectorization by "looking up" multiple inputs at the same time.

There are a few tricks to allow handling larger arrays like using longer integers (i.e. uint64_t or __uint128_t extension). Other approach is splitting bits of value in array like high and low byte, lookup them and combine using bitwise operations.

Cymry answered 18/5, 2022 at 21:34 Comment(1)
This is good for small tables, upvoted for future readers looking for that. It might not work for the size of table the querent has in mind, where an actual LUT in memory is probably best as a fallback if there isn't a pattern that allows a simple arithmetic function. (Unless I-cache budget is high enough to allow pretty ridiculous stuff with lots of immediates still being better than a cache miss.)Resplendent

© 2022 - 2024 — McMap. All rights reserved.