Rust compiler not optimising lzcnt? (and similar functions)
Asked Answered
H

1

8

What was done:

This follows as a result of experimenting on Compiler Explorer as to ascertain the compiler's (rustc's) behaviour when it comes to the log2()/leading_zeros() and similar functions. I came across this result with seems exceedingly both bizarre and concerning:

Compiler Explorer link

Code:

pub fn lzcnt0(val: u64) -> u64 {
    val.leading_zeros() as u64
}

pub unsafe fn lzcnt1(val: u64) -> u64 {
    core::arch::x86_64::_lzcnt_u64(val)
}

pub unsafe fn lzcnt2(val: u64) -> u64 {
    asm_lzcnt(val)
}

#[inline]
pub unsafe fn asm_lzcnt(val: u64) -> u64 {
    let lzcnt: u64;
    core::arch::asm!("lzcnt {}, {}", in(reg) val, lateout(reg) lzcnt, options(nomem, nostack));
    lzcnt
}

Output:

example::lzcnt0:
        test    rdi, rdi
        je      .LBB0_2
        bsr     rax, rdi
        xor     rax, 63
        ret
.LBB0_2:
        mov     eax, 64
        ret

example::lzcnt1:
        jmp     core::core_arch::x86_64::abm::_lzcnt_u64

core::core_arch::x86_64::abm::_lzcnt_u64:
        lzcnt   rax, rdi
        ret

example::lzcnt2:
        lzcnt   rdi, rax
        ret

The compiler options are to best emulate cargo's 'release' configuration (with opt-level=3 for good measure), and otherwise trying my best to get the compiler to optimise the functions. The specific target shouldn't matter, as long as it targets x86-64, I've tried x86_64-{pc-windows-{msvc,gnu},unknown-linux-gnu}.

What was expected:

All of these outputs should be identical to lzcnt2. Instruction Performance Tables lzcnt is evidently a fast instruction across the board and should be used, and having an unnecessary branch in such a low level function is dismal. What's weirder, the function _lzcnt_u64() calls leading_zeros() under the hood - which the compiler is happy to magic away (there's no checks or asserts either), but won't seem to do it for the underlying function. What's more, the compiler won't inline the lzcnt instruction even in that case? (the implementation marks the function a #[inline] too) Sure, a jmp isn't as bad, but it's entirely unnecessary as should be avoided.

What it could be:

  • Compiler bug?
  • Purposeful choice I don't understand?
  • I don't understand how to use Compiler Explorer properly?
  • Other?

I'm seeing similar results in functions like log2 and (I presume) others that rely on the ctlz rust compiler intrinsic in their implementation.

If you understand compilers sufficiently, any clarification would be greatly appreciated. I don't fancy writing loads of utility functions for little reason, but I'll do so if there's no better alternative.

P.S. If your answer is along the lines of that the performance gain is negligible in most situations, and/or that I shouldn't care due to code quality or similar reasoning: I understand the sentiment, but that's not the point of this question. I'm writing for bare-metal, hot code in a personal project.

Hooge answered 25/12, 2021 at 16:24 Comment(0)
A
9

Old x86-64 CPUs don't support lzcnt, so rustc/llvm won't emit it by default. (They would execute it as bsr but the behavior is not identical.)

Use -C target-feature=+lzcnt to enable it. Try.

More generally, you may wish to use -C target-cpu=XXX to enable all the features of a specific CPU model. Use rustc --print target-cpus for a list.

In particular, -C target-cpu=native will generate code for the CPU that rustc itself is running on, e.g. if you will run the code on the same machine where you are compiling it.

Alyciaalyda answered 25/12, 2021 at 16:33 Comment(1)
Thank you very much, I'll give those target-features a look - just what I needed. I'll accept your answer as soon as SO will let me.Hooge

© 2022 - 2024 — McMap. All rights reserved.