How does this C program without libc work?
Asked Answered
W

2

27

I came across a minimal HTTP server that is written without libc: https://github.com/Francesco149/nolibc-httpd

I can see that basic string handling functions are defined, leading to the write syscall:

#define fprint(fd, s) write(fd, s, strlen(s))
#define fprintn(fd, s, n) write(fd, s, n)
#define fprintl(fd, s) fprintn(fd, s, sizeof(s) - 1)
#define fprintln(fd, s) fprintl(fd, s "\n")
#define print(s) fprint(1, s)
#define printn(s, n) fprintn(1, s, n)
#define printl(s) fprintl(1, s)
#define println(s) fprintln(1, s)

And the basic syscalls are declared in the C file:

size_t read(int fd, void *buf, size_t nbyte);
ssize_t write(int fd, const void *buf, size_t nbyte);
int open(const char *path, int flags);
int close(int fd);
int socket(int domain, int type, int protocol);
int accept(int socket, sockaddr_in_t *restrict address,
           socklen_t *restrict address_len);
int shutdown(int socket, int how);
int bind(int socket, const sockaddr_in_t *address, socklen_t address_len);
int listen(int socket, int backlog);
int setsockopt(int socket, int level, int option_name, const void *option_value,
               socklen_t option_len);
int fork();
void exit(int status);

So I guess the magic happens in start.S, which contains _start and a special way of encoding syscalls by creating global labels which fall through and accumulating values in r9 to save bytes:

.intel_syntax noprefix

/* functions: rdi, rsi, rdx, rcx, r8, r9 */
/*  syscalls: rdi, rsi, rdx, r10, r8, r9 */
/*                           ^^^         */
/* stack grows from a high address to a low address */

#define c(x, n) \
.global x; \
x:; \
  add r9,n

c(exit, 3)       /* 60 */
c(fork, 3)       /* 57 */
c(setsockopt, 4) /* 54 */
c(listen, 1)     /* 50 */
c(bind, 1)       /* 49 */
c(shutdown, 5)   /* 48 */
c(accept, 2)     /* 43 */
c(socket, 38)    /* 41 */
c(close, 1)      /* 03 */
c(open, 1)       /* 02 */
c(write, 1)      /* 01 */
.global read     /* 00 */
read:
  mov r10,rcx
  mov rax,r9
  xor r9,r9
  syscall
  ret

.global _start
_start:
  xor rbp,rbp
  xor r9,r9
  pop rdi     /* argc */
  mov rsi,rsp /* argv */
  call main
  call exit

Is this understanding correct? GCC use the symbols defined in start.S for the syscalls, then the program starts in _start and calls main from the C file?

Also how does the separate httpd.asm custom binary work? Just hand-optimized assembly combining the C source and start assembly?

Wiburg answered 29/3, 2021 at 8:37 Comment(1)
BTW, with clang -Oz, I got the .c + .S version down to 992 bytes. See the top of my answer.Coacher
C
13

(I cloned the repo and tweaked the .c and .S to compile better with clang -Oz: 992 bytes, down from the original 1208 with gcc. See the WIP-clang-tuning branch in my fork, until I get around to cleaning that up and sending a pull request. With clang, inline asm for the syscalls does save size overall, especially once main has no calls and no rets. IDK if I want to hand-golf the whole .asm after regenerating from compiler output; there are certainly chunks of it where significant savings are possible, e.g. using lodsb in loops.)


It looks like they need r9 to be 0 before a call to any of these labels, either with a register global var or maybe gcc -ffixed-r9 to tell GCC to keep its hands off that register permanently. Otherwise GCC would have left whatever garbage in r9, just like other registers.

Their functions are declared with normal prototypes, not 6 args with dummy 0 args to get every call site to actually zero r9, so that's not how they're doing it.


special way of encoding syscalls

I wouldn't describe that as "encoding syscalls". Maybe "defining syscall wrapper functions". They're defining their own wrapper function for each syscall, in an optimized way that falls through into one common handler at the bottom. In the C compiler's asm output, you'll still see call write.

(It might have been more compact for the final binary to use inline asm to let the compiler inline a syscall instruction with the args in the right registers, instead of making it look like a normal function that clobbers all the call-clobbered registers. Especially if compiled with clang -Oz which would use 3-byte push 2 / pop rax instead of 5-byte mov eax, 2 to set up the call number. push imm8/pop/syscall is the same size as call rel32.)


Yes, you can define functions in hand-written asm with .global foo / foo:. You could look at this as one large function with multiple entry points for different syscalls. In asm, execution always passes to the next instruction, regardless of labels, unless you use a jump/call/ret instruction. The CPU doesn't know about labels.

So it's just like a C switch(){} statement without break; between case: labels, or like C labels you can jump to with goto. Except of course in asm you can do this at global scope, while in C you can only goto within a function. And in asm you can call instead of just goto (jmp).

    static long callnum = 0;     // r9 = 0  before a call to any of these

    ...
    socket:
       callnum += 38;
    close:
       callnum++;         // can use inc instead of add 1
    open:                 // missed optimization in their asm
       callnum++;
    write:
       callnum++;
    read:
       tmp=callnum;
       callnum=0;
       retval = syscall(tmp, args);

Or if you recast this as a chain of tailcalls, where we can omit even the jmp foo and instead just fall through: C like this truly could compile to the hand-written asm, if you had a smart enough compiler. (And you could solve the arg-type

register long callnum asm("r9");     // GCC extension

long open(args...) {
   callnum++;
   return write(args...);
}
long write(args...) {
   callnum++;
   return read(args...); // tailcall
}
long read(args...){
       tmp=callnum;
       callnum=0;            // reset callnum for next call
       return syscall(tmp, args...);
}

args... are the arg-passing registers (RDI, RSI, RDX, RCX, R8) which they simply leave unmodified. R9 is the last arg-passing register for x86-64 System V, but they didn't use any syscalls that take 6 args. setsockopt takes 5 args so they couldn't skip the mov r10, rcx. But they were able to use r9 for something else, instead of needing it to pass the 6th arg.


That's amusing that they're trying so hard to save bytes at the expense of performance, but still use xor rbp,rbp instead of xor ebp,ebp. Unless they build with gcc -Wa,-Os start.S, GAS won't optimize away the REX prefix for you. (Does GCC optimize assembly source file?)

They could save another byte with xchg rax, r9 (2 bytes including REX) instead of mov rax, r9 (REX + opcode + modrm). (Code golf.SE tips for x86 machine code)

I'd also have used xchg eax, r9d because I know Linux system call numbers fit in 32 bits, although it wouldn't save code size because a REX prefix is still needed to encode the r9d register number. Also, in the cases where they only need to add 1, inc r9d is only 3 bytes, vs. add r9d, 1 being 4 bytes (REX + opcode + modrm + imm8). (The no-modrm short-form encoding of inc is only available in 32-bit mode; in 64-bit mode it's repurposed as a REX prefix.)

mov rsi,rsp could also save a byte as push rsp / pop rsi (1 byte each) instead of 3-byte REX + mov. That would make room for returning main's return value with xchg edi, eax before call exit.

But since they're not using libc, they could inline that exit, or put the syscalls below _start so they can just fall into it, because exit happens to be the highest-numbered syscall! Or at least jmp exit since they don't need stack alignment, and jmp rel8 is more compact than call rel32.


Also how does the separate httpd.asm custom binary work? Just hand-optimized assembly combining the C source and start assembly?

No, that's fully stand-alone incorporating the start.S code (at the ?_017: label), and maybe hand-tweaked compiler output. Perhaps from hand-tweaking disassembly of a linked executable, hence not having nice label names even for the part from the hand-written asm. (Specifically, from Agner Fog's objconv, which uses that format for labels in its NASM-syntax disassembly.)

(Ruslan also pointed out stuff like jnz after cmp, instead of jne which has the more appropriate semantic meaning for humans, so another sign of it being compiler output, not hand-written.)

I don't know how they arranged to get the compiler not to touch r9. It seems just luck. The readme indicates that just compiling the .c and .S works for them, with their GCC version.

As far as the ELF headers, see the comment at the top of the file, which links A Whirlwind Tutorial on Creating Really Teensy ELF Executables for Linux - you'd assemble this with nasm -fbin and the output is a complete ELF binary, ready to run. Not a .o that you need to link + strip, so you get to account for every single byte in the file.

Coacher answered 29/3, 2021 at 17:54 Comment(6)
In the httpd.asm file that mediocrevegetable1 found, the syscalls are below _start, and it could just fall through into _exit, called ?_017 there, but there's a call ?_017 instruction just before the ?_017 label. It seems likely they're just relying on GCC not using R9, and hopefully verifying that during their hand-tweaking.Storage
In general would the code be smaller if they used 32 bit abi? I think when I wrote my code golf answers a while ago that's what I used always. Maybe that should be an answer on the x86 golfing tips page.Wiburg
@qwr: maybe, especially with compiler-generated code that isn't careful to avoid REX prefixes wherever possible.Coacher
Aren't there like ~20+ bytes to be gained by just dropping the function prototypes? If you declared them all without spec you could pass an extra argument... then increment mechanism can alias, since it just counts the args.Bardo
@l.k: I don't follow what you're saying at all. Without prototypes, every function call would need to zero EAX because it's passing 0 args in XMM regs. And everything else would be the same, except for implicit int return values. (But they all return signed int anyway, the way ssize_t is defined). If you want each call-site to explicitly zero r8 or r9, that would be more extra code size compared to what's happening now.Coacher
...I'd been wondering what as up with that caller-cleared eax you sometimes see. That makes a lot of sense, thanks. Kinda surprised I'd never read that, but TBH I tend to tune out around floating point.Bardo
I
6

You're pretty much correct about what's going on. Very interesting, I've never seen something like this before. But basically as you said, every time it calls the label, as you said, r9 keeps adding up until it reaches read, whose syscall number is 0. This is why the order is pretty clever. Assuming r9 is 0 before read is called (the read label itself zeroes r9 before calling the correct syscall), no adding is needed because r9 already has the correct syscall number that is needed. write's syscall number is 1, so it only needs to be added by 1 from 0, which is shown in the macro call. open's syscall number is 2, so first it is added by 1 at the open label, then again by 1 at the write label, and then the correct syscall number is put into rax at the read label. And so on. Parameter registers like rdi, rsi, rdx, etc. are also not touched so it basically acts like a normal function call.

Also how does the separate httpd.asm custom binary work? Just hand-optimized assembly combining the C source and start assembly?

I'm assuming you're talking about this file. Not sure exactly what's going on here, but it looks like an ELF file is manually being created, probably to reduce size further.

Inez answered 29/3, 2021 at 9:22 Comment(9)
Doesn't look like hand-written-from-scratch assembly. More like a hand-tuned disassembly. First, the labels are just numbers instead of readable names. Then, some strange mnemonic choices like jnz instead of jne after cmp.Colorant
@Colorant true, now that I think of it. I wonder what could have produced that assembly then, it looks like NASM as opposed to what seems like GAS in start.S, so I wonder what could have produced that. Anyway, removed the last line of the post.Inez
?_033: labels look like the style of Agner Fog's objconv (which does support NASM syntax for output).Colorant
@Colorant ah ok, that makes sense.Inez
also couldnt the program have just incremented rax instead of r9? not sure why they used r9Wiburg
@Wiburg I suspect that is because rax wouldn't necessarily be 0 from before. In fact, you can even see this in the code itself: right after main is called, exit is called. What if main didn't return 0? They could zero out rax after calling main but that would waste a few bytes too. I don't know, that's just my theory.Inez
Also, rax always gives the return value of a function, so perhaps using rax would have brought some more issues related to that too (In fact, now that I think of it, this would probably make it very hard to use rax. The return value of a function may not be 0 in any function, and if it weren't, there would be a slight issue then if rax were incremented directly.Inez
I say it because I think inc eax was a 1 byte instruction, although I don't really remember. zeroing out eax only takes 1 byteWiburg
@Wiburg true, but there are probably gonna be a lot of syscalls that will not return 0. For example you would probably always have xor eax, eax after a call to write, socket, etc. So it may only cost 1 byte to 0 out eax, but the amount of zeroing that would have to be done would probably be a lot.Inez

© 2022 - 2024 — McMap. All rights reserved.