Overload symbols of running process (LD_PRELOAD attachment)
Asked Answered
T

2

14

I'm working on a heap profiler for Linux, called heaptrack. Currently, I rely on LD_PRELOAD to overload various (de-)allocation functions, and that works extremely well.

Now I would like to extend the tool to allow runtime attaching to an existing process, which was started without LD_PRELOADing my tool. I can dlopen my library via GDB just fine, but that won't overwrite malloc etc. I think, this is because at that point the linker already resolved the position dependent code of the already running process - correct?

So what do I do instead to overload malloc and friends?

I am not proficient with assembler code. From what I've read so far, I guess I'll somehow have to patch malloc and the other functions, such that they first call back to my trace function and then continue with their actual implementation? Is that correct? How do I do that?

I hope there are existing tools out there, or that I can leverage GDB/ptrace for that.

Timbrel answered 25/11, 2014 at 21:58 Comment(13)
I just stumbled upon ltrace, which is supposed to support runtime attachement, but the malloc filter won't work then. So I have the feeling, that a simple ptrace approach won't work?Timbrel
I'm not sure what you mean by "the malloc filter won't work". ltrace -e 'malloc+free' -p xxxxx seems to work just fine here (ltrace 0.7.3 running on linux 3.13.0 / x86_64).Edessa
@xbug: Odd, this is exactly what I tried and it does not work for me. I use the same ltrace version, but Linux 3.17.4-1-ARCH, i.e. from ArchLinux. If I runtime-attach ltrace to any application, it stays silent. If I otoh start the application with ltrace, it works. Any idea what might be going on?Timbrel
@xbug: I just build ltrace from sources, and with that version, runtime attachement seems to work. It seems to be extremely slow though which makes it essentially useless for for me.Timbrel
@milianw: I do believe I've described a ptrace-based solution here; are you aware of it? The latter example in that answer replaces an address with a write syscall, in your case you'd replace the initial parts of the target functions with jumps to the interposed functions. The technique is not simple (the hard part is finding the addresses in the target binary to overwrite), and it's very architecture-specific, but after the interposing, there is no extra overhead or speed penalty at all.Apiarian
@NominalAnimal: Nope, I wasn't aware of that. Very interesting. I'll see if I eventually figure out how to code this up to call my own function in malloc, and getting access to both, input arguments and return value...Timbrel
I've taken some interest to heaptrack and its statistics-gathering process. What information, exactly, must it record? Is it enough to record the arguments, return value and immediate caller of malloc/free? Or must the entire backtrace be examined? It is my view that your tool will be fastest (and strategy, different) if you adopt the strategy that amasses on-the-fly the minimum amount of data required to reconstitute the events of interest. Currently I envision patching malloc's first instruction as a jmp to an injected page of code, accompanied with a large buffer for call records.Aggressive
@IwillnotexistIdonotexist: I also see patching malloc(), memalign(), posix_memalign(), free() et al. as the way to go. Using ptrace to attach to the target process, and anonymously mapping writable pages, then copying position-independent executable code to that page, is not hard at all. The attaching process can use elf tools and /proc/PID/maps to locate the target addresses. This should work for even static binaries (no libdl). Difficult part is to disassemble/duplicate the asm op(s) under the jump instruction -- unless it is a jump instruction itself, of course.Apiarian
@NominalAnimal Precisely; However, I was more concerned with the correctness of the injection since, strictly speaking, it is possible for the attach to occur while a thread is in the prologue of these functions. Worse, the compiled form of these fn's may include a branch backwards to somewhere within this prologue. The former can be solved by having ptrace() single-step all threads until they leave the prologues of all functions being injected (this should take no time at all), and then the process is patched. For the second, some primitive binary analysis and relocation will be required.Aggressive
@IwillnotexistIdonotexist: I've explored ptracing multithreaded processes in this answer, including single-stepping individual threads; it seems robust and straightforward. On x86-64, the prologue (replaced part) is 5 to 13 bytes -- 5 bytes if replacement code is within a 32-bit offset to %rip, 13 bytes if an arbitrary 64-bit pushq %rax; movabs $constant, %rax ; jmp *%rax sequence is needed. Instruction analysis (those 5-13 bytes) is nasty. I'd prefer to mmap complete replacement functions instead. Would that be an acceptable option?Apiarian
@NominalAnimal But what if, hypothetically, the code generated by the compiler branches backwards into the replaced part? Then you'd have a process jump from some branch in malloc to where it expected certain instructions, but there it will find either the unexpected, or no instruction at all, and crash. More broadly, how can I be sure that the program will never attempt to execute anything in the replaced prologue ever again? To solve this problem in full generality would require solving the Halting Problem, and is indeed very nasty in its full generality.Aggressive
@NominalAnimal But: I figure that the first instruction of the prologue is likely to be >=2 bytes. A 2-byte rel8 JMP (stage 1) at prologue could trampoline you to a place with, say, 5 bytes free between two functions or within one. You'd then use a 5-byte rel32 JMP (stage 2) to jump to your true injected code, or to your 13-byte sequence (stage 3) that jumps to anywhere in the 64-bit address space. As for mmap-ing complete replacements, I must be sure that a thread has exited the replaced function and will not come back into it other than through the entry point.Aggressive
@IwillnotexistIdonotexist: Exactly! If the code uses functions from a known C library version, then we can tell the function address ranges (by compiling test binaries against the same library versions); and glibc et al. have public linkage only to the functions themselves, not within them. For robustness, one could single-step each thread until it is out of C library code altogether. However, this would lead to requiring helper code to be compiled against each c library version used... on the other hand, no instruction analysis!Apiarian
P
19

Just for the lulz, another solution without ptracing your own process or touching a single line of assembly or playing around with /proc. You only have to load the library in the context of the process and let the magic happen.

The solution I propose is to use the constructor feature (brought from C++ to C by gcc) to run some code when a library is loaded. Then this library just patch the GOT (Global Offset Table) entry for malloc. The GOT stores the real addresses for the library functions so that the name resolution happen only once. To patch the GOT you have to play around with the ELF structures (see man 5 elf). And Linux is kind enough to give you the aux vector (see man 3 getauxval) that tells you where to find in memory the program headers of the current program. However, better interface is provided by dl_iterate_phdr, which is used below.

Here is an example code of library that does exactly this when the init function is called. Although the same could probably be achieved with a gdb script.

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <dlfcn.h>
#include <sys/auxv.h>
#include <elf.h>
#include <link.h>
#include <sys/mman.h>


struct strtab {
    char *tab;
    ElfW(Xword) size;
};


struct jmpreltab {
    ElfW(Rela) *tab;
    ElfW(Xword) size;
};


struct symtab {
    ElfW(Sym) *tab;
    ElfW(Xword) entsz;
};



/* Backup of the real malloc function */
static void *(*realmalloc)(size_t) = NULL;


/* My local versions of the malloc functions */
static void *mymalloc(size_t size);


/*************/
/* ELF stuff */
/*************/
static const ElfW(Phdr) *get_phdr_dynamic(const ElfW(Phdr) *phdr,
        uint16_t phnum, uint16_t phentsize) {
    int i;

    for (i = 0; i < phnum; i++) {
        if (phdr->p_type == PT_DYNAMIC)
            return phdr;
        phdr = (ElfW(Phdr) *)((char *)phdr + phentsize);
    }

    return NULL;
}



static const ElfW(Dyn) *get_dynentry(ElfW(Addr) base, const ElfW(Phdr) *pdyn,
        uint32_t type) {
    ElfW(Dyn) *dyn;

    for (dyn = (ElfW(Dyn) *)(base + pdyn->p_vaddr); dyn->d_tag; dyn++) {
        if (dyn->d_tag == type)
            return dyn;
    }

    return NULL;
}



static struct jmpreltab get_jmprel(ElfW(Addr) base, const ElfW(Phdr) *pdyn) {
    struct jmpreltab table;
    const ElfW(Dyn) *dyn;

    dyn = get_dynentry(base, pdyn, DT_JMPREL);
    table.tab = (dyn == NULL) ? NULL : (ElfW(Rela) *)dyn->d_un.d_ptr;

    dyn = get_dynentry(base, pdyn, DT_PLTRELSZ);
    table.size = (dyn == NULL) ? 0 : dyn->d_un.d_val;
    return table;
}



static struct symtab get_symtab(ElfW(Addr) base, const ElfW(Phdr) *pdyn) {
    struct symtab table;
    const ElfW(Dyn) *dyn;

    dyn = get_dynentry(base, pdyn, DT_SYMTAB);
    table.tab = (dyn == NULL) ? NULL : (ElfW(Sym) *)dyn->d_un.d_ptr;
    dyn = get_dynentry(base, pdyn, DT_SYMENT);
    table.entsz = (dyn == NULL) ? 0 : dyn->d_un.d_val;
    return table;
}



static struct strtab get_strtab(ElfW(Addr) base, const ElfW(Phdr) *pdyn) {
    struct strtab table;
    const ElfW(Dyn) *dyn;

    dyn = get_dynentry(base, pdyn, DT_STRTAB);
    table.tab = (dyn == NULL) ? NULL : (char *)dyn->d_un.d_ptr;
    dyn = get_dynentry(base, pdyn, DT_STRSZ);
    table.size = (dyn == NULL) ? 0 : dyn->d_un.d_val;
    return table;
}



static void *get_got_entry(ElfW(Addr) base, struct jmpreltab jmprel,
        struct symtab symtab, struct strtab strtab, const char *symname) {

    ElfW(Rela) *rela;
    ElfW(Rela) *relaend;

    relaend = (ElfW(Rela) *)((char *)jmprel.tab + jmprel.size);
    for (rela = jmprel.tab; rela < relaend; rela++) {
        uint32_t relsymidx;
        char *relsymname;
        relsymidx = ELF64_R_SYM(rela->r_info);
        relsymname = strtab.tab + symtab.tab[relsymidx].st_name;

        if (strcmp(symname, relsymname) == 0)
            return (void *)(base + rela->r_offset);
    }

    return NULL;
}



static void patch_got(ElfW(Addr) base, const ElfW(Phdr) *phdr, int16_t phnum,
        int16_t phentsize) {

    const ElfW(Phdr) *dphdr;
    struct jmpreltab jmprel;
    struct symtab symtab;
    struct strtab strtab;
    void *(**mallocgot)(size_t);

    dphdr = get_phdr_dynamic(phdr, phnum, phentsize);
    jmprel = get_jmprel(base, dphdr);
    symtab = get_symtab(base, dphdr);
    strtab = get_strtab(base, dphdr);
    mallocgot = get_got_entry(base, jmprel, symtab, strtab, "malloc");

    /* Replace the pointer with our version. */
    if (mallocgot != NULL) {
        /* Quick & dirty hack for some programs that need it. */
        /* Should check the returned value. */
        void *page = (void *)((intptr_t)mallocgot & ~(0x1000 - 1));
        mprotect(page, 0x1000, PROT_READ | PROT_WRITE);
        *mallocgot = mymalloc;
    }
}



static int callback(struct dl_phdr_info *info, size_t size, void *data) {
    uint16_t phentsize;
    data = data;
    size = size;

    printf("Patching GOT entry of \"%s\"\n", info->dlpi_name);
    phentsize = getauxval(AT_PHENT);
    patch_got(info->dlpi_addr, info->dlpi_phdr, info->dlpi_phnum, phentsize);

    return 0;
}



/*****************/
/* Init function */
/*****************/
__attribute__((constructor)) static void init(void) {
    realmalloc = malloc;
    dl_iterate_phdr(callback, NULL);
}



/*********************************************/
/* Here come the malloc function and sisters */
/*********************************************/
static void *mymalloc(size_t size) {
    printf("hello from my malloc\n");
    return realmalloc(size);
}

And an example program that just loads the library between two malloc calls.

#include <stdio.h>
#include <stdlib.h>
#include <dlfcn.h>



void loadmymalloc(void) {
    /* Should check return value. */
    dlopen("./mymalloc.so", RTLD_LAZY);
}



int main(void) {
    void *ptr;

    ptr = malloc(42);
    printf("malloc returned: %p\n", ptr);

    loadmymalloc();

    ptr = malloc(42);
    printf("malloc returned: %p\n", ptr);

    return EXIT_SUCCESS;
}

The call to mprotect is usually useless. However I found that gvim (which is compiled as a shared object) needs it. If you also want to catch the references to malloc as pointers (which may allow to later call the real function and bypass yours), you can apply the very same process to the symbol table pointed to by the DT_RELA dynamic entry.

If the constructor feature is not available for you, all you have to do is resolve the init symbol from the newly loaded library and call it.

Note that you may also want to replace dlopen so that libraries loaded after yours gets patched as well. Which may happen if you load your library quite early or if the application has dynamically loaded plugins.

Partin answered 4/12, 2014 at 21:44 Comment(11)
This does look promising. And indeed, it works for the simple test you added. But in more complicated scenarios, e.g. when I attach to a bigger application via gdb and then call (void) dlopen("/tmp/libinject.so", 0x0001) there, I see that the lib gets initialized, but fails to find the malloc address. When I try it with kwrite e.g., the symbols it finds are __libc_start_main, __gmon_start__, kdemain.Timbrel
BTW, if I have time, I definitely plan to look more into your code. This looks extremely promising. Can I still "donate" you with bounty points if this works out in the end? If so, I'd be more than willing to if this works out in the end.Timbrel
Is there a reason you use getauxval instead of dl_iterate_phdr from link.h?Timbrel
I've tried to iterate over all dynamic sections with dl_iterate_phdr, in the hope to make this overloading work even in apps that load in shared libraries, but can't get it to work... My code lives here (note: C++11 syntax) paste.kde.org/ptobkcije <-- it crashes when trying to overwrite the found malloc address, even though I check for readable and writable dynamic sections via p_flags... Any idea what I'm doing wrong? Do I need to call mprotect somewhere?Timbrel
Woha! I got it, for shared libraries, I need to take the dlpi_addr offset into account! Both, when casting the ElfW(Dyn) from p_vaddr, as well as when writing the symbol rela->r_offset, I need to add the dlpi_addr offset, and magically it works. So many thanks Celelibi, without your help, I would never found a way to write this up! How can I show you my gratitude? I've now accepted your answer, but the original bounty already timed out. Can I give you another bounty? Or anything else? Many thanks, really! Here's a link to my latest code: bit.ly/1Axmk4YTimbrel
The main reason I didn't use ld_iterate_phdr is because I didn't know this function. :) I modified the code in my answer to use it and to apply the patch to all the loaded shared objects. You usually don't need to use mprotect, but some programs (like vim, compiled as a shared object itself) may require it.Partin
I read your C++ code, and I don't know why you need to skip ld-linux-x86-64.so and your own library. It works for me. I don't know either why you search for a PT_DYNAMIC segment with read and write permission. Normally, all that matters are the PT_LOAD segments. And I would suggest to put all your functions static.Partin
Without the mprotect, I get the crashes on ld-linux etc. pp. That helps, thanks! I could also get rid of the flag checks then. Note that all my functions are static, as they are in an anonymous namespace. I'll reward you with the additional bounty tomorrow - thanks again Celelibi!Timbrel
@Timbrel I completed my answer to also tell to overload dlopen if needed. Because the program may load libraries afterward.Partin
Yep, I also think I'll need to add some update code for dlopen. Many thanks again Celelibi. I rewarded you with a bounty, as you might have seen. Have fun with the imaginary reputation ;-) I hope you'll help more people the way you did help me. Really awesome!Timbrel
Just in case getauxval() isn't in your glibc (it was added in glibc 2.16), here's an alternative that gets the AT_PHENT that should work on older platforms: ` struct PVTM_AUXV{ unsigned long type; unsigned long val; }; unsigned long int pvtmGetProgHeaderEnt(){ struct PVTM_AUXV auxv; int fd = open("/proc/self/auxv", O_RDONLY); if (fd != -1){ do{ if (read(fd, &auxv, sizeof(auxv)) == sizeof(auxv)){ if (auxv.type == 4 /* AT_PHENT */) { close (fd); return auxv.val; }} else{ close(fd); return 0; } } while (1); } return 0; }`Compulsive
L
4

This can not be done without tweaking with assembler a bit. Basically, you will have to do what gdb and ltrace do: find malloc and friends virtual addresses in the process image and put breakpoints at their entry. This process usually involves temporary rewriting the executable code, as you need to replace normal instructions with "trap" ones (such as int 3 on x86).

If you want to avoid doing this yourself, there exists linkable wrapper around gdb (libgdb) or you can build ltrace as a library (libltrace). As ltrace is much smaller, and the library variety of it is available out of the box, it will probably allow you to do what you want at lower effort.

For example, here's the best part of the "main.c" file from the ltrace package:

int
main(int argc, char *argv[]) {
    ltrace_init(argc, argv);

 /*
    ltrace_add_callback(callback_call, EVENT_SYSCALL);
    ltrace_add_callback(callback_ret, EVENT_SYSRET);
    ltrace_add_callback(endcallback, EVENT_EXIT);

    But you would probably need EVENT_LIBCALL and EVENT_LIBRET
 */

    ltrace_main();
    return 0;
}

http://anonscm.debian.org/cgit/collab-maint/ltrace.git/tree/?id=0.7.3

Lenorelenox answered 28/11, 2014 at 12:19 Comment(13)
Thanks for the hints. LTrace seems to have an extremely high overhead though. So high, that it becomes unpractical for me to use it. I may need to wait for the perf subsystem to support native "scripts" which I could then use to hookup to a custom userspace breakpoint...Timbrel
You will end up in the same place. Execution tracing of any kind is rather slow and even hardware breakpoints can slow things down very considerably. To my opinion, the only reasonably fast approach will be to scan all the modules loaded for the process and then, using their disk images as references, redo the dynamic linking process for symbols of interest (so instead of link to malloc, process image would now link to accounting stub, forwarding to malloc). This is not difficult per se, but the effort to get it right may be considerable.Lenorelenox
So ltrace, or similarly GDB, cannot just do the rewrite for me once and then "detach"? I mean after malloc/free where rewritten in libc, I'd expect to have no further overhead, besides the additional jump and what I add in my own tool. Why is that not the case?Timbrel
The issue of "hot" dll injection is mostly of interest to people developing exploits, so this stuff is not very visible publicly. Here's an example of "hot" symbol injector: github.com/ice799/injectso64Lenorelenox
Correct me if I'm wrong, but doesn't injectso64 "just" inject a shared library? That can be (trivially) accomplished using a small GDB script as well, by calling dlopen manually. Or does injectso64 also rewrite functions? That's what I'm really interested about. Maybe LTTng is what I'm looking for?Timbrel
Hm, no LTTng seems to be about predefined trace points. I think the holy grail would be a userspace API to get access to UProbes...Timbrel
"injectos" will rewrite the relocation entry in the running process so it will call your stuff instead of the shared library it was calling before. That's how exploits operate, gdb is not going to do anything like that.Lenorelenox
I fail to see how it is doing that - can you shed some light on that? The examples all seem to rely on the _init() function being called when the shared library is loaded. So I think what I need to understand is how inject_code() in inject.c can be rewritten to overload malloc or similar to call a custom function of mine instead.Timbrel
@Timbrel inject_code() does just enough to load your .so into the target process address space. Once you're in there, you're basically done since you can have the target do anything on your behalf. "injectso" has its shortcomings, but that's probably the closest you can get to having "an alternative to LD_PRELOAD [...] attach to a running process". Provided you also inject ld.so into the target if it's not already there, you may even be able to use the same .so file both with LD_PRELOAD and with "injectso".Edessa
(correction: "provided that you inject libdl.so ...")Edessa
@xbug: See my original question, getting into the process is easy (GDB attach and call dlopen on your .so, done). What I need to do though is overwriting the symbols of e.g. malloc to call my function and then delegate to the original function. With LD_PRELOAD the linker handles that for me. Now I'm looking for a way to do this manually when attaching to a process. If injectos can also do this, somehow, I'd be very interested. My knowledge of assembly is apparently not enough to understand what it is doing.Timbrel
@millianw You're right, the injectso guys clearly omitted the bit of the code which will walk the PLT table and replace arbitrary symbols at will. They probably assumed that a worthy hacker will be able to do it on his own, like this guy here: shadowwhowalks.blogspot.com.au/2013/01/…Lenorelenox
Haha, yeah apparently I need to learn quite a bit more before I can implement this. Many thanks so far oakad, I've upvoted you already. Should noone give me a "better" answer in the next three days, you'll get the bounty and I'll accept your answer as well. Thanks.Timbrel

© 2022 - 2024 — McMap. All rights reserved.