Until recent Linux kernel versions (sometime before 5.4), you could simply compile with gcc -z execstack
- that would make all pages executable, including read-only data (.rodata
), and read-write data (.data
) where char code[] = "..."
goes.
Now -z execstack
only applies to the actual stack, so it currently works only for non-const local arrays. i.e. move char code[] = ...
into main
. Modern systems make as few pages executable as possible as hardening against exploits.
See Linux default behavior against `.data` section for the kernel change, and Unexpected exec permission from mmap when assembly files included in the project for the old behaviour: enabling Linux's READ_IMPLIES_EXEC
process for that program. (In Linux 5.4, that Q&A shows you'd only get READ_IMPLIES_EXEC
for a missing PT_GNU_STACK
, like a really old binary; modern GCC -z execstack
would set PT_GNU_STACK = RWX
metadata in the executable, which Linux 5.4 would handle as making only the stack itself executable. At some point before that, PT_GNU_STACK = RWX
did result in READ_IMPLIES_EXEC
.)
The other option is to make system calls at runtime to copy into an executable page, or change permissions on the page it's in. That's still more complicated than using a local array to get GCC to copy code into executable stack memory.
(I don't know if there's an easy way to enable READ_IMPLIES_EXEC
under modern kernels. Having no GNU-stack attribute at all in an ELF binary does that for 32-bit code, but not 64-bit.)
Yet another option is __attribute__((section(".text"))) const char code[] = ...;
Working example: https://godbolt.org/z/draGeh.
If you need the array to be writeable, e.g. for shellcode that inserts some zeros into strings, you could maybe link with ld -N
. But probably best to use -z execstack and a local array.
Two problems in the question:
- exec permission on the page, because you used an array that will go in the noexec read+write
.data
section.
- your machine code doesn't end with a
ret
instruction so even if it did run, execution would fall into whatever was next in memory instead of returning.
And BTW, the REX prefix is redundant. "\x31\xc0"
xor eax,eax
has exactly the same effect as xor rax,rax
.
You need the page containing the machine code to have execute permission. x86-64 page tables have a separate bit for execute separate from read permission, unlike legacy 386 page tables.
The easiest way to get static arrays to be in read+exec memory was to compile with gcc -z execstack
. (Used to make the stack and other sections executable, now only the stack).
typedef int (*intfunc_int)(int);
int main(void)
{
unsigned char execbuf[] = { // compile with -zexecstack
0x8d, 0x47, 0x01, // lea 0x1(%rdi),%eax
0xc3 // ret
};
// a string initializer like char execbuf[] = "\xc3"; also works
// Tell GCC we're about to run this data as code. x86 has coherent I-cache,
// but this also stops optimization from removing the initialization as dead stores.
__builtin___clear_cache (execbuf, execbuf+sizeof(execbuf)-1);
// Without this, the store disappears
intfunc_int fptr = (intfunc_int) execbuf; // cast to function pointer.
int res = fptr(2); // deref the function pointer
return res; // returns 3 on non-Windows ISAs where the first arg is in EDI
}
Compiles to simple asm (Godbolt - also showing that it's broken without the __builtin___clear_cache
- it will skip the store and just jump to uninitialized stack space.) This runs correctly with -z execstack
, will segfault without it.
# GCC -O3 for x86-64
main:
sub rsp, 24 # GCC reserves 16 bytes more stack space than it needed
mov edi, 2 # function arg
mov DWORD PTR [rsp+12], -1023326323 # store 4 bytes of machine code
lea rax, [rsp+12] # pointer into a register
call rax # call through the function pointer
add rsp, 24
ret
Older GNU ld
linker used to make .rodata
read+exec
Until recently (2018 or 2019), the standard toolchain (binutils ld
) would put section .rodata
into the same ELF segment as .text
, so they'd both have read+exec permission. Thus using const char code[] = "...";
was sufficient for executing manually-specified bytes as data, without execstack.
But on my Arch Linux system with GNU ld (GNU Binutils) 2.31.1
, that's no longer the case. readelf -a
shows that the .rodata
section went into an ELF segment with .eh_frame_hdr
and .eh_frame
, and it only has Read permission. .text
goes in a segment with Read + Exec, and .data
goes in a segment with Read + Write (along with the .got
and .got.plt
). (What's the difference of section and segment in ELF file format)
I assume this change is to make ROP and Spectre attacks harder by not having read-only data in executable pages where sequences of useful bytes could be used as "gadgets" that end with the bytes for a ret
or jmp reg
instruction.
// See above for char code[] = {...} inside main with -z execstack, for current Linux
// This is broken on recent Linux, used to work without execstack.
#include <stdio.h>
// can be non-const if you use gcc -z execstack. static is also optional
static const char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi] // retval = a+b;
0xC3 // ret
};
static const char ret0_code[] = "\x31\xc0\xc3"; // xor eax,eax ; ret
// the compiler will append a 0 byte to terminate the C string,
// but that's fine. It's after the ret.
int main () {
// void* cast is easier to type than a cast to function pointer,
// and in C can be assigned to any other pointer type. (not C++)
int (*sum) (int, int) = (void*)code;
int (*ret0)(void) = (void*)ret0_code;
// run code
int c = sum (2, 3);
return ret0();
}
On older Linux systems: gcc -O3 shellcode.c && ./a.out
(Works because of const
on global/static arrays)
On Linux before 5.5 (or so) gcc -O3 -z execstack shellcode.c && ./a.out
(works because of -zexecstack
regardless of where your machine code is stored). Fun fact: gcc allows -zexecstack
with no space, but clang only accepts clang -z execstack
.
These also work on Windows, where read-only data goes in .rdata
instead of .rodata
.
The compiler-generated main
looks like this (from objdump -drwC -Mintel
). You can run it inside gdb
and set breakpoints on code
and ret0_code
(I actually used gcc -no-pie -O3 -zexecstack shellcode.c hence the addresses near 401000
0000000000401020 <main>:
401020: 48 83 ec 08 sub rsp,0x8 # stack aligned by 16 before a call
401024: be 03 00 00 00 mov esi,0x3
401029: bf 02 00 00 00 mov edi,0x2 # 2 args
40102e: e8 d5 0f 00 00 call 402008 <code> # note the target address in the next page; that's where .rodata goes
401033: 48 83 c4 08 add rsp,0x8
401037: e9 c8 0f 00 00 jmp 402004 <ret0_code> # optimized tailcall
Or use system calls to modify page permissions
Instead of compiling with gcc -zexecstack
, you can instead use mmap(PROT_EXEC)
to allocate new executable pages, or mprotect(PROT_EXEC)
to change existing pages to executable. (Including pages holding static data.) You also typically want at least PROT_READ
and sometimes PROT_WRITE
, of course.
Using mprotect
on a static array means you're still executing the code from a known location, maybe making it easier to set a breakpoint on it.
On Windows you can use VirtualAlloc or VirtualProtect.
Telling the compiler that data is executed as code
Normally compilers like GCC assume that data and code are separate. This is like type-based strict aliasing, but even using char*
doesn't make it well-defined to store into a buffer and then call that buffer as a function pointer.
In GNU C, you also need to use __builtin___clear_cache(buf, buf + len)
after writing machine code bytes to a buffer, because the optimizer doesn't treat dereferencing a function pointer as reading bytes from that address. Dead-store elimination can remove the stores of machine code bytes into a buffer, if the compiler proves that the store isn't read as data by anything. https://codegolf.stackexchange.com/questions/160100/the-repetitive-byte-counter/160236#160236 and https://godbolt.org/g/pGXn3B has an example where gcc really does do this optimization, because gcc "knows about" malloc
. Also the first code block in this answer, where we use a local array in executable stack space.
(And on non-x86 architectures where I-cache isn't coherent with D-cache, it actually will do any necessary cache syncing. On x86 it's purely a compile-time optimization blocker and doesn't expand to any instructions itself, because a jump or call is sufficient on paper for JIT or self-modifying code, and in practice it's completely impossible to observe stale code after a store on real x86 CPUs.)
Re: the weird name with three underscores: It's the usual __builtin_name
pattern, but name
is __clear_cache
.
My edit on @AntoineMathys's answer added this.
In practice GCC/clang don't "know about" mmap(MAP_ANONYMOUS)
the way they know about malloc
. So in practice the optimizer will assume that the memcpy into the buffer might be read as data by the non-inline function call through the function pointer, even without __builtin___clear_cache()
. (Unless you declared the function type as __attribute__((const))
.)
On x86, where I-cache is coherent with data caches, having the stores happen in asm before the call is sufficient for correctness. On other ISAs, __builtin___clear_cache()
will actually emit special instructions as well as ensuring the right compile-time ordering.
It's good practice to include it when copying code into a buffer because it doesn't cost performance, and stops hypothetical future compilers from breaking your code. (e.g. if they do understand that mmap(MAP_ANONYMOUS)
gives newly-allocated anonymous memory that nothing else has a pointer to, just like malloc.)
With current GCC, I was able to provoke GCC into really doing an optimization we don't want by using __attribute__((const))
to tell the optimizer sum()
is a pure function (that only reads its args, not global memory). GCC then knows sum()
can't read the result of the memcpy
as data.
With another memcpy
into the same buffer after the call, GCC does dead-store elimination into just the 2nd store after the call. This results in no store before the first call so it executes the 00 00 add [rax], al
bytes, segfaulting.
// demo of a problem on x86 when not using __builtin___clear_cache
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
int main ()
{
char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi]
0xC3 // ret
};
__attribute__((const)) int (*sum) (int, int) = NULL;
// copy code to executable buffer
sum = mmap (0,sizeof(code),PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANON,-1,0);
memcpy (sum, code, sizeof(code));
//__builtin___clear_cache(sum, sum + sizeof(code));
int c = sum (2, 3);
//printf ("%d + %d = %d\n", a, b, c);
memcpy(sum, (char[]){0x31, 0xc0, 0xc3, 0}, 4); // xor-zero eax, ret, padding for a dword store
//__builtin___clear_cache(sum, sum + 4);
return sum(2,3);
}
Compiled on the Godbolt compiler explorer with GCC9.2 -O3
main:
push rbx
xor r9d, r9d
mov r8d, -1
mov ecx, 34
mov edx, 7
mov esi, 4
xor edi, edi
sub rsp, 16
call mmap
mov esi, 3
mov edi, 2
mov rbx, rax
call rax # call before store
mov DWORD PTR [rbx], 12828721 # 0xC3C031 = xor-zero eax, ret
add rsp, 16
pop rbx
ret # no 2nd call, CSEd away because const and same args
Passing different args would have gotten another call reg
, but even with __builtin___clear_cache
the two sum(2,3)
calls can CSE. __attribute__((const))
doesn't respect changes to the machine code of a function. Don't do it. It's safe if you're going to JIT the function once and then call many times, though.
Uncommenting the first __clear_cache
results in
mov DWORD PTR [rax], -1019804531 # lea; ret
call rax
mov DWORD PTR [rbx], 12828721 # xor-zero; ret
... still CSE and use the RAX return value
The first store is there because of __clear_cache
and the sum(2,3)
call. (Removing the first sum(2,3)
call does let dead-store elimination happen across the __clear_cache
.)
The second store is there because the side-effect on the buffer returned by mmap
is assumed to be important, and that's the final value main
leaves.
Godbolt's ./a.out
option to run the program still seems to always fail (exit status of 255); maybe it sandboxes JITing? It works on my desktop with __clear_cache
and crashes without.
mprotect
on a page holding existing C variables.
You can also give a single existing page read+write+exec permission. This is an alternative to compiling with -z execstack
You don't need __clear_cache
on a page holding read-only C variables because there's no store to optimize away. You would still need it for initializing a local buffer (on the stack). Otherwise GCC will optimize away the initializer for this private buffer that a non-inline function call definitely doesn't have a pointer to. (Escape analysis). It doesn't consider the possibility that the buffer might hold the machine code for the function unless you tell it that via __builtin___clear_cache
.
#include <stdio.h>
#include <sys/mman.h>
#include <stdint.h>
// can be non-const if you want, we're using mprotect
static const char code[] = {
0x8D, 0x04, 0x37, // lea eax,[rdi+rsi] // retval = a+b;
0xC3 // ret
};
static const char ret0_code[] = "\x31\xc0\xc3";
int main () {
// void* cast is easier to type than a cast to function pointer,
// and in C can be assigned to any other pointer type. (not C++)
int (*sum) (int, int) = (void*)code;
int (*ret0)(void) = (void*)ret0_code;
// hard-coding x86's 4k page size for simplicity.
// also assume that `code` doesn't span a page boundary and that ret0_code is in the same page.
uintptr_t page = (uintptr_t)code & -4095ULL; // round down
mprotect((void*)page, 4096, PROT_READ|PROT_EXEC|PROT_WRITE); // +write in case the page holds any writeable C vars that would crash later code.
// run code
int c = sum (2, 3);
return ret0();
}
I used PROT_READ|PROT_EXEC|PROT_WRITE
in this example so it works regardless of where your variable is. If it was a local on the stack and you left out PROT_WRITE
, call
would fail after making the stack read only when it tried to push a return address.
Also, PROT_WRITE
lets you test shellcode that self-modifies, e.g. to edit zeros into its own machine code, or other bytes it was avoiding.
$ gcc -O3 shellcode.c # without -z execstack
$ ./a.out
$ echo $?
0
$ strace ./a.out
...
mprotect(0x55605aa3f000, 4096, PROT_READ|PROT_WRITE|PROT_EXEC) = 0
exit_group(0) = ?
+++ exited with 0 +++
If I comment out the mprotect
, it does segfault with recent versions of GNU Binutils ld
which no longer put read-only constant data into the same ELF segment as the .text
section.
If I did something like ret0_code[4] = 0xc3;
, I would need __builtin___clear_cache(ret0_code+2, ret0_code+2)
after that to make sure the store wasn't optimized away, but if I don't modify the static arrays then it's not needed after mprotect
. It is needed after mmap
+memcpy
or manual stores, because we want to execute bytes that have been written in C (with memcpy
).
ret
because the indirect function call you do should be acall
which pushes the return address onto the stack. Atleast, this is my best educated guess, I have never seen anything like this. – Supersensiblexorq %rax, %rax
. Use `xor eax, eax instead – Dewitt