(I cloned the repo and tweaked the .c and .S to compile better with clang -Oz: 992 bytes, down from the original 1208 with gcc. See the WIP-clang-tuning branch in my fork, until I get around to cleaning that up and sending a pull request. With clang, inline asm for the syscalls does save size overall, especially once main has no calls and no rets. IDK if I want to hand-golf the whole .asm
after regenerating from compiler output; there are certainly chunks of it where significant savings are possible, e.g. using lodsb
in loops.)
It looks like they need r9
to be 0
before a call to any of these labels, either with a register global var or maybe gcc -ffixed-r9
to tell GCC to keep its hands off that register permanently. Otherwise GCC would have left whatever garbage in r9
, just like other registers.
Their functions are declared with normal prototypes, not 6 args with dummy 0
args to get every call site to actually zero r9
, so that's not how they're doing it.
special way of encoding syscalls
I wouldn't describe that as "encoding syscalls". Maybe "defining syscall wrapper functions". They're defining their own wrapper function for each syscall, in an optimized way that falls through into one common handler at the bottom. In the C compiler's asm output, you'll still see call write
.
(It might have been more compact for the final binary to use inline asm to let the compiler inline a syscall
instruction with the args in the right registers, instead of making it look like a normal function that clobbers all the call-clobbered registers. Especially if compiled with clang -Oz
which would use 3-byte push 2
/ pop rax
instead of 5-byte mov eax, 2
to set up the call number. push imm8
/pop
/syscall
is the same size as call rel32
.)
Yes, you can define functions in hand-written asm with .global foo
/ foo:
. You could look at this as one large function with multiple entry points for different syscalls. In asm, execution always passes to the next instruction, regardless of labels, unless you use a jump/call/ret instruction. The CPU doesn't know about labels.
So it's just like a C switch(){}
statement without break;
between case:
labels, or like C labels you can jump to with goto
. Except of course in asm you can do this at global scope, while in C you can only goto within a function. And in asm you can call
instead of just goto
(jmp
).
static long callnum = 0; // r9 = 0 before a call to any of these
...
socket:
callnum += 38;
close:
callnum++; // can use inc instead of add 1
open: // missed optimization in their asm
callnum++;
write:
callnum++;
read:
tmp=callnum;
callnum=0;
retval = syscall(tmp, args);
Or if you recast this as a chain of tailcalls, where we can omit even the jmp foo
and instead just fall through: C like this truly could compile to the hand-written asm, if you had a smart enough compiler. (And you could solve the arg-type
register long callnum asm("r9"); // GCC extension
long open(args...) {
callnum++;
return write(args...);
}
long write(args...) {
callnum++;
return read(args...); // tailcall
}
long read(args...){
tmp=callnum;
callnum=0; // reset callnum for next call
return syscall(tmp, args...);
}
args...
are the arg-passing registers (RDI, RSI, RDX, RCX, R8) which they simply leave unmodified. R9 is the last arg-passing register for x86-64 System V, but they didn't use any syscalls that take 6 args. setsockopt
takes 5 args so they couldn't skip the mov r10, rcx
. But they were able to use r9 for something else, instead of needing it to pass the 6th arg.
That's amusing that they're trying so hard to save bytes at the expense of performance, but still use xor rbp,rbp
instead of xor ebp,ebp
. Unless they build with gcc -Wa,-Os start.S
, GAS won't optimize away the REX prefix for you. (Does GCC optimize assembly source file?)
They could save another byte with xchg rax, r9
(2 bytes including REX) instead of mov rax, r9
(REX + opcode + modrm). (Code golf.SE tips for x86 machine code)
I'd also have used xchg eax, r9d
because I know Linux system call numbers fit in 32 bits, although it wouldn't save code size because a REX prefix is still needed to encode the r9d
register number. Also, in the cases where they only need to add 1, inc r9d
is only 3 bytes, vs. add r9d, 1
being 4 bytes (REX + opcode + modrm + imm8). (The no-modrm short-form encoding of inc
is only available in 32-bit mode; in 64-bit mode it's repurposed as a REX prefix.)
mov rsi,rsp
could also save a byte as push rsp
/ pop rsi
(1 byte each) instead of 3-byte REX + mov. That would make room for returning main's return value with xchg edi, eax
before call exit
.
But since they're not using libc, they could inline that exit
, or put the syscalls below _start
so they can just fall into it, because exit
happens to be the highest-numbered syscall! Or at least jmp exit
since they don't need stack alignment, and jmp rel8
is more compact than call rel32
.
Also how does the separate httpd.asm custom binary work? Just hand-optimized assembly combining the C source and start assembly?
No, that's fully stand-alone incorporating the start.S code (at the ?_017:
label), and maybe hand-tweaked compiler output. Perhaps from hand-tweaking disassembly of a linked executable, hence not having nice label names even for the part from the hand-written asm. (Specifically, from Agner Fog's objconv
, which uses that format for labels in its NASM-syntax disassembly.)
(Ruslan also pointed out stuff like jnz
after cmp
, instead of jne
which has the more appropriate semantic meaning for humans, so another sign of it being compiler output, not hand-written.)
I don't know how they arranged to get the compiler not to touch r9
. It seems just luck. The readme indicates that just compiling the .c and .S works for them, with their GCC version.
As far as the ELF headers, see the comment at the top of the file, which links A Whirlwind Tutorial on Creating Really Teensy ELF Executables for Linux - you'd assemble this with nasm -fbin
and the output is a complete ELF binary, ready to run. Not a .o that you need to link + strip, so you get to account for every single byte in the file.
clang -Oz
, I got the .c + .S version down to 992 bytes. See the top of my answer. – Coacher