Minimal opcode size x86-64 strlen implementation
Asked Answered
C

2

7

I'm investigating a minimal opcode size x86-64 strlen implementation for my code golfing / binary executable that is not supposed to exceed some size (think of demoscene for simplicity).
General idea comes from here, size optimization ideas from here and here.

Input string address is in rdi , max length should be not bigger than Int32

xor   eax,eax ; 2 bytes
or    ecx,-1  ; 3 bytes
repne scasb   ; 2 bytes
not   ecx     ; 2 bytes
dec   ecx     ; 2 bytes

Final result is in ecx in 11 bytes total.

The question is about setting ecx to -1

Option 1 already stated

or ecx,-1 ; 3 bytes

Option 2

lea ecx,[rax-1] ; 3 bytes 

Option 3

stc         ; 1 byte
sbb ecx,ecx ; 2 bytes

Option 4 , probably the slowest one

push -1 ; 2 bytes
pop rcx ; 1 byte

I understand that:
Option 1 has dependency on previous ecx value
Option 2 has dependency on previous rax value
Option 3 I'm not sure if it has dependency on previous ecx value?
Option 4 is the slowest one?

Is there a clear winner here?
The criteria is to keep the opcodes size as small as possible and choose the best one performance wise.
I'm fully aware there are implementations using modern cpu instructions, but this legacy approach seems the smallest one.

Canoewood answered 15/4, 2018 at 10:10 Comment(2)
Could someone kindly explain reason for downvoting this question?Canoewood
@BoPersson added the why. I'm after saving every byte possible to pull off as much functionality as possible in a given executable limit (4096 bytes). Does this affect the answer in anyway? I thought the criteria are well defined already.Canoewood
O
3

For a hacky good-enough version, we know rdi has a valid address. It's very likely that edi is not a small integer, thus 2 byte mov ecx, edi. But this isn't safe is RDI could be pointing just past a 4GiB boundary so it's hard to prove it's safe. Unless you're using an ILP32 ABI like x32 so all pointers are below the 4GiB mark.

So you might need to copy the full RDI with push rdi / pop rcx, 1 byte each. But that adds extra latency for short-string startup. It should be safe if you have no strings with length higher than their starting address. (But that is plausible for static storage in .data, .bss, or .rodata if you have any huge arrays; for example Linux non-PIE executables are loaded at around 0x401000 = 1<<22.)

This is great if you just want rdi pointing to the terminating 0 byte, instead of actually needing a count. Or if you have the start pointer in another register so you can do sub edi, edx or something and get the length that way instead of processing the rcx result. (If you know the result fits in 32 bits, you don't need sub rdi, rdx because you know the upper bits of that would be zero anyway. And high input bits don't affect low output bits for add/sub; carry propagates left to right.)

For strings known to be under 255 bytes, you can use mov cl, -1 (2 bytes). That makes rcx at least 0xFF, and higher depending on what high garbage was left in it. (This has a partial-reg stall on Nehalem and earlier when RCX is read, otherwise just a dependency on the old RCX). Anyway, then mov al, -2 / sub al, cl to get the length as an 8-bit integer. This may or may not be useful.

Depending on the caller, rcx might have already been holding a pointer value, in which case you could leave it untouched if you can use pointer subtraction.


Out of the options you proposed

lea ecx,[rax-1] is very good because you just xor-zeroed eax, and it's a cheap 1 uop instruction with 1 cycle latency and can run on multiple execution ports on all mainstream CPUs.

When you already have another register with a known constant value, especially one that's xor-zeroed, 3-byte lea almost always the most efficient 3-byte way to create a constant, if it works. (See Set all bits in CPU register to 1 efficiently).


I'm fully aware there are implementations using modern cpu instructions, but this legacy approach seems the smallest one.

Yes, repne scasb is very compact. Its startup overhead is maybe something like 15 cycles on a typical Intel CPU, and according to Agner Fog, it issues >=6n uops with a throughput of >= 2n cycles, where n is the count (i.e. 2 cycles per byte it compares for long compares, where the startup overhead is hidden), so it dwarfs the cost of lea.

Something with a false dependency on ecx could delay its startup, so you definitely want lea.

repne scasb is probably fast enough for whatever you're doing, but it's slower than pcmpeqb / pmovmsbk / cmp. For short fixed-length strings, integer cmp / jne is very good when the length is 4 or 8 bytes (including the terminating 0), assuming you can safely over-read your strings, i.e. you don't have to worry about "" at the end of a page. This method has overhead that scales with string length, though. For string length=7, for example, you could do 4, 2, and 1 operand sizes, or you could do two dword compares overlapping by 1 byte. like cmp dword [rdi], first_4_bytes / jne; cmp dword [rdi+3], last_4_bytes / jne.


More details about LEA

On a Sandybridge-family CPU, the lea could be dispatched to an execution unit in the same cycle as it and the xor-zero were issued into the out-of-order CPU core. xor-zeroing is handled in the issue/rename stage, so the uop enters the ROB in an "already-executed" state. It's impossible for the instruction to ever have to wait for RAX. (Unless an interrupt happens between the xor and the lea, but even then I think there'd be a serializing instruction after restoring RAX and before the lea could execute, so it couldn't be stuck waiting.)

Simple lea can run on port0 or port1 on SnB, or port1 / port5 on Skylake (2 per clock throughput, but sometimes different ports on different SnB-family CPUs). It's 1 cycle latency, so it's hard to do much better.

It's unlikely you'd see any speedup from using mov ecx, -1 (5 bytes) which can run on any ALU port.

On an AMD Ryzen, lea r32, [m] in 64-bit mode is treated as a "slow" LEA that can only run on 2 ports, and has 2c latency instead of 1. Worse, Ryzen doesn't eliminate xor-zeroing.


The microbenchmark test you did only measures throughput for the versions with no false dependencies, not latency. This is often a useful measure, and you did happen to get the right answer that lea is the best choice.

Whether pure throughput accurately reflects anything about your real use case is another matter. You might actually depend on the latency, not throughput, if the string compare is on the critical path as part of a long or loop-carried data dependency chain not broken by a jcc to give you branch-prediction + speculative execution. (But branchless code is often larger, so this is unlikely).

stc / sbb ecx,ecx is interesting, but only AMD CPUs treat sbb as dependency-breaking (only depending on CF, not the integer register). On Intel Haswell and earlier, sbb is a 2 uop instruction (because it has 3 inputs: 2 GP integer + flags). It has 2c latency, which is why it performs so badly. (The latency is a loop-carried dep chain.)


Shortening other parts of the sequence

Depending what you're doing, you may be able to use strlen+2 just as well, but offsetting another constant or something. dec ecx is only 1 byte in 32-bit code, but x86-64 doesn't have the short-form inc/dec instructions. So not / dec isn't as cool in 64-bit code.

After repne scas, you have ecx = -len - 2 (if you started with ecx = -1), and not gives you -x-1 (i.e. +len + 2 - 1).

 ; eax = 0
 ; ecx = -1
repne scasb      ; ecx = -len - 2
sub   eax, ecx   ; eax = +len + 2
Orestes answered 20/4, 2018 at 15:53 Comment(1)
Fantastic answer. I can use sub eax, ecx followed by lea edx,[rax-2] to save 1 byte as I need length as the 3rd argument in a System V callCanoewood
C
2

I did some benchmarking on Intel Core i7 4850HQ Haswell 2,3 GHz, release build no debugger attached. In each loop I measure 1000 sequences of asm instructions and repeat it 10 milion times to average result.

I've made macros for repeating asm instructions 100 times.

#define lea100 asm{xor   eax,eax};asm { lea ecx,[rax-1] }; // <== Copy pasted 100times
#define or100 asm{xor   eax,eax};asm { or ecx,-1 }; // <== Copy pasted 100times
#define sbb100 asm{xor   eax,eax};asm { stc };asm{sbb ecx,ecx}; // <== Copy pasted 100times
#define stack100 asm ("xor %eax,%eax;.byte 0x6A; .byte 0xFF ;pop %rcx;"); // <== Copy pasted 100times

Testing C code with inline asm for MacOS

#include <stdio.h>
#include <CoreServices/CoreServices.h>
#include <mach/mach.h>
#include <mach/mach_time.h>
int main(int argc, const char * argv[]) {
    uint64_t        start;
    uint64_t        end;
    uint64_t        elapsed;
    Nanoseconds     elapsedNano;

    uint64_t sum = 0;
    for (int i = 0; i < 10000000 ; i++) {

// this will become
// call       imp___stubs__mach_absolute_time  
// mov        r14, rax
    start = mach_absolute_time();

//10x lea100 for example for total 1000 

// call       imp___stubs__mach_absolute_time
// sub        rax, r14
    end = mach_absolute_time();

    elapsed = end - start;
    elapsedNano = AbsoluteToNanoseconds( *(AbsoluteTime *) &elapsed );
    uint64_t nano = * (uint64_t *) &elapsedNano;
        sum += nano;
    }
    printf("%f\n",sum/10000000.0);
    return 0;
}

Results

xor eax,eax
lea ecx,[rax-1]

205-216 ns

xor eax,eax
or ecx,-1

321-355 ns

xor eax,eax
push -1 
pop rcx 

322-359 ns

xor eax,eax
stc     
sbb ecx,ecx

612-692 ns

Canoewood answered 19/4, 2018 at 17:52 Comment(3)
You're testing throughput, not latency. A few of them (or and sbb) bottleneck on a false dependency through ECX. The lea version should be running ~4 instructions per clock on Sandybridge-family, while the or ecx, -1 version should be half that, bottlenecked at 1 or per clock because of dependency chains. So you have significant measurement overhead. (Or you're testing on a first-gen i7, a Nehalem which is a different microarchitecture family. "i7" is not a useful description).Orestes
@PeterCordes I was waiting for an insight like this. I've added the i7 type, does it help? Do you have any suggestions how to reduce measurement overhead? Or maybe totally different approach?Canoewood
You don't need to time more accurately, just use lea. It's better in every way than any of the alternatives. (lea with 32-bit operand-size in 64-bit mode does have an extra cycle of latency on AMD, according to agner.org/optimize, but still good throughput and only 2 cycles of latency. It's definitely your best bet, and repne scasb is pretty slow by comparison so avoiding the false deps is more than fine. So really it's possible that there could be cases where or ecx,-1 has a minor advantage, but it's unlikely especially on Intel.) Working on an answer...Orestes

© 2022 - 2024 — McMap. All rights reserved.