For a hacky good-enough version, we know rdi
has a valid address. It's very likely that edi
is not a small integer, thus 2 byte mov ecx, edi
. But this isn't safe is RDI could be pointing just past a 4GiB boundary so it's hard to prove it's safe. Unless you're using an ILP32 ABI like x32 so all pointers are below the 4GiB mark.
So you might need to copy the full RDI with push rdi / pop rcx, 1 byte each. But that adds extra latency for short-string startup. It should be safe if you have no strings with length higher than their starting address. (But that is plausible for static storage in .data, .bss, or .rodata if you have any huge arrays; for example Linux non-PIE executables are loaded at around 0x401000
= 1<<22.)
This is great if you just want rdi pointing to the terminating 0
byte, instead of actually needing a count. Or if you have the start pointer in another register so you can do sub edi, edx
or something and get the length that way instead of processing the rcx
result. (If you know the result fits in 32 bits, you don't need sub rdi, rdx
because you know the upper bits of that would be zero anyway. And high input bits don't affect low output bits for add/sub; carry propagates left to right.)
For strings known to be under 255 bytes, you can use mov cl, -1
(2 bytes). That makes rcx
at least 0xFF, and higher depending on what high garbage was left in it. (This has a partial-reg stall on Nehalem and earlier when RCX is read, otherwise just a dependency on the old RCX). Anyway, then mov al, -2
/ sub al, cl
to get the length as an 8-bit integer. This may or may not be useful.
Depending on the caller, rcx
might have already been holding a pointer value, in which case you could leave it untouched if you can use pointer subtraction.
Out of the options you proposed
lea ecx,[rax-1]
is very good because you just xor-zeroed eax
, and it's a cheap 1 uop instruction with 1 cycle latency and can run on multiple execution ports on all mainstream CPUs.
When you already have another register with a known constant value, especially one that's xor-zeroed, 3-byte lea
almost always the most efficient 3-byte way to create a constant, if it works. (See Set all bits in CPU register to 1 efficiently).
I'm fully aware there are implementations using modern cpu instructions, but this legacy approach seems the smallest one.
Yes, repne scasb
is very compact. Its startup overhead is maybe something like 15 cycles on a typical Intel CPU, and according to Agner Fog, it issues >=6n uops with a throughput of >= 2n cycles, where n
is the count (i.e. 2 cycles per byte it compares for long compares, where the startup overhead is hidden), so it dwarfs the cost of lea
.
Something with a false dependency on ecx
could delay its startup, so you definitely want lea
.
repne scasb
is probably fast enough for whatever you're doing, but it's slower than pcmpeqb
/ pmovmsbk
/ cmp
. For short fixed-length strings, integer cmp
/ jne
is very good when the length is 4 or 8 bytes (including the terminating 0), assuming you can safely over-read your strings, i.e. you don't have to worry about ""
at the end of a page. This method has overhead that scales with string length, though. For string length=7, for example, you could do 4, 2, and 1 operand sizes, or you could do two dword compares overlapping by 1 byte. like cmp dword [rdi], first_4_bytes / jne
; cmp dword [rdi+3], last_4_bytes / jne
.
More details about LEA
On a Sandybridge-family CPU, the lea
could be dispatched to an execution unit in the same cycle as it and the xor
-zero were issued into the out-of-order CPU core. xor
-zeroing is handled in the issue/rename stage, so the uop enters the ROB in an "already-executed" state. It's impossible for the instruction to ever have to wait for RAX. (Unless an interrupt happens between the xor and the lea
, but even then I think there'd be a serializing instruction after restoring RAX and before the lea
could execute, so it couldn't be stuck waiting.)
Simple lea
can run on port0 or port1 on SnB, or port1 / port5 on Skylake (2 per clock throughput, but sometimes different ports on different SnB-family CPUs). It's 1 cycle latency, so it's hard to do much better.
It's unlikely you'd see any speedup from using mov ecx, -1
(5 bytes) which can run on any ALU port.
On an AMD Ryzen, lea r32, [m]
in 64-bit mode is treated as a "slow" LEA that can only run on 2 ports, and has 2c latency instead of 1. Worse, Ryzen doesn't eliminate xor-zeroing.
The microbenchmark test you did only measures throughput for the versions with no false dependencies, not latency. This is often a useful measure, and you did happen to get the right answer that lea
is the best choice.
Whether pure throughput accurately reflects anything about your real use case is another matter. You might actually depend on the latency, not throughput, if the string compare is on the critical path as part of a long or loop-carried data dependency chain not broken by a jcc
to give you branch-prediction + speculative execution. (But branchless code is often larger, so this is unlikely).
stc
/ sbb ecx,ecx
is interesting, but only AMD CPUs treat sbb
as dependency-breaking (only depending on CF, not the integer register). On Intel Haswell and earlier, sbb
is a 2 uop instruction (because it has 3 inputs: 2 GP integer + flags). It has 2c latency, which is why it performs so badly. (The latency is a loop-carried dep chain.)
Shortening other parts of the sequence
Depending what you're doing, you may be able to use strlen+2
just as well, but offsetting another constant or something. dec ecx
is only 1 byte in 32-bit code, but x86-64 doesn't have the short-form inc/dec
instructions. So not / dec isn't as cool in 64-bit code.
After repne scas
, you have ecx = -len - 2
(if you started with ecx = -1
), and not
gives you -x-1
(i.e. +len + 2 - 1
).
; eax = 0
; ecx = -1
repne scasb ; ecx = -len - 2
sub eax, ecx ; eax = +len + 2