Question
Say you have a simple function that returns a value based on a look table for example:
See edit about assumptions.
uint32_t
lookup0(uint32_t r) {
static const uint32_t tbl[] = { 0, 1, 2, 3 };
if(r >= (sizeof(tbl) / sizeof(tbl[0]))) {
__builtin_unreachable();
}
/* Can replace with: `return r`. */
return tbl[r];
}
uint32_t
lookup1(uint32_t r) {
static const uint32_t tbl[] = { 0, 0, 1, 1 };
if(r >= (sizeof(tbl) / sizeof(tbl[0]))) {
__builtin_unreachable();
}
/* Can replace with: `return r / 2`. */
return tbl[r];
}
Is there any super-optimization infrastructure or algorithm that can take go from the lookup table to the optimized ALU implementation.
Motivation
The motivation is I'm building some locks for NUMA machines and want to be able to configure my code generically. Its pretty common that in NUMA locks you will need to do cpu_id
-> numa_node
. I can obviously setup the lookup table during configuration, but since I'm fighting for every drop of memory bandwidth I can, I am hoping to generically reach a solution that will be able to cover most layouts.
Looking at how modern compilers do:
Neither clang
or gcc
are able to do this at the moment.
Clang is able to get lookup0
if you rewrite it as a switch
/case
statement.
lookup0(unsigned int): # @lookup0(unsigned int)
movl %edi, %eax
movl lookup0(unsigned int)::tbl(,%rax,4), %eax
retq
...
case0(unsigned int): # @case0(unsigned int)
movl %edi, %eax
retq
but can't get lookup1
.
lookup1(unsigned int): # @lookup1(unsigned int)
movl %edi, %eax
movl .Lswitch.table.case1(unsigned int)(,%rax,4), %eax
retq
...
case1(unsigned int): # @case1(unsigned int)
movl %edi, %eax
movl .Lswitch.table.case1(unsigned int)(,%rax,4), %eax
retq
Gcc cant get either.
lookup0(unsigned int):
movl %edi, %edi
movl lookup0(unsigned int)::tbl(,%rdi,4), %eax
ret
lookup1(unsigned int):
movl %edi, %edi
movl lookup1(unsigned int)::tbl(,%rdi,4), %eax
ret
case0(unsigned int):
leal -1(%rdi), %eax
cmpl $2, %eax
movl $0, %eax
cmovbe %edi, %eax
ret
case1(unsigned int):
subl $2, %edi
xorl %eax, %eax
cmpl $1, %edi
setbe %al
ret
I imagine I can cover a fair amount of the necessary cases with some custom brute-force approach, but was hoping this was a solved problem.
Edit:
The only true assumption is:
- All inputs are have an index in the LUT.
- All values are positive (think that makes things easier) and will be true for just about any sys-config thats online.
- (Edit4) Would add one more assumption. The LUT is dense. That is it covers a range
[<low_bound>, <bound_bound>]
but nothing outside of that range.
In my case for CPU topology, I would generally expect sizeof(LUT) >= <max_value_in_lut>
but that is specific to the one example I gave and would have some counter-examples.
Edit2:
I wrote a pretty simple optimizer that does a reasonable job for the CPU topologies I've tested here. But obviously it could be a lot better.
Edit3:
There seems to be some confusion about the question/initial example (I should have been clearer).
The example lookup0
/lookup1
are arbitrary. I am hoping to find a solution that can scale beyond 4 indexes and with different values.
The use case I have in mind is CPU topology so ~256 - 1024 is where I would expect the upper bound in size but for a generic LUT it could obviously get much larger.
__builtin_unreachable
stuff. Accessing an array out of bound is already UB. The compiler should take advantage of that. – Cymryreturn ((r == 0) * tbl[0]) + ((r == 1) * tbl[1]) +((r == 2) * tbl[2]) +((3 == 0) * tbl[3])
would simply enough? – Basion