How do I merge two binary executables?
Asked Answered
S

1

8

This question follows on from another question I asked before. In short, this is one of my attempts at merging two fully linked executables into a single fully linked executable. The difference is that the previous question deals with merging an object file to a full linked executable which is even harder because it means I need to manually deal with relocations.

What I have are the following files:

example-target.c:

#include <stdlib.h>
#include <stdio.h>

int main(void)
{
    puts("1234");
    return EXIT_SUCCESS;
}

example-embed.c:

#include <stdlib.h>
#include <stdio.h>

/*
 * Fake main. Never used, just there so we can perform a full link.
 */
int main(void)
{
    return EXIT_SUCCESS;
}

void func1(void)
{
    puts("asdf");
}

My goal is to merge these two executables to produce a final executable which is the same as example-target, but additionally has another main and func1.

From the point of view of the BFD library, each binary is composed (amongst other things) of a set of sections. One of the first problems I faced was that these sections had conflicting load addresses (such that if I was to merge them, the sections would overlap).

What I did to solve this was to analyse example-target programmatically to get a list of the load address and sizes of each of its sections. I then did the same for example-embed and used this information to dynamically generate a linker command for example-embed.c which ensures that all of its sections are linked at addresses that do not overlap with any of the sections in example-target. Hence example-embed is actually fully linked twice in this process: once to determine how many sections and what sizes they are, and once again to link with a guarantee that there are no section clashes with example-target.

On my system, the linker command produced is:

-Wl,--section-start=.new.interp=0x1004238,--section-start=.new.note.ABI-tag=0x1004254,
--section-start=.new.note.gnu.build-id=0x1004274,--section-start=.new.gnu.hash=0x1004298,
--section-start=.new.dynsym=0x10042B8,--section-start=.new.dynstr=0x1004318,
--section-start=.new.gnu.version=0x1004356,--section-start=.new.gnu.version_r=0x1004360,
--section-start=.new.rela.dyn=0x1004380,--section-start=.new.rela.plt=0x1004398,
--section-start=.new.init=0x10043C8,--section-start=.new.plt=0x10043E0,
--section-start=.new.text=0x1004410,--section-start=.new.fini=0x10045E8,
--section-start=.new.rodata=0x10045F8,--section-start=.new.eh_frame_hdr=0x1004604,
--section-start=.new.eh_frame=0x1004638,--section-start=.new.ctors=0x1204E28,
--section-start=.new.dtors=0x1204E38,--section-start=.new.jcr=0x1204E48,
--section-start=.new.dynamic=0x1204E50,--section-start=.new.got=0x1204FE0,
--section-start=.new.got.plt=0x1204FE8,--section-start=.new.data=0x1205010,
--section-start=.new.bss=0x1205020,--section-start=.new.comment=0xC04000

(Note that I prefixed section names with .new using objcopy --prefix-sections=.new example-embedobj to avoid section name clashes.)

I then wrote some code to generate a new executable (borrowed some code both from objcopy and Security Warrior book). The new executable should have:

  • All the sections of example-target and all the sections of example-embed
  • A symbol table which contains all the symbols from example-target and all the symbols of example-embed

The code I wrote is:

#include <stdlib.h>
#include <stdio.h>
#include <stdbool.h>
#include <bfd.h>
#include <libiberty.h>

struct COPYSECTION_DATA {
    bfd *      obfd;
    asymbol ** syms;
    int        symsize;
    int        symcount;
};

void copy_section(bfd * ibfd, asection * section, PTR data)
{
    struct COPYSECTION_DATA * csd  = data;
    bfd *             obfd = csd->obfd;
    asection *        s;
    long              size, count, sz_reloc;

    if((bfd_get_section_flags(ibfd, section) & SEC_GROUP) != 0) {
        return;
    }

    /* get output section from input section struct */
    s        = section->output_section;
    /* get sizes for copy */
    size     = bfd_get_section_size(section);
    sz_reloc = bfd_get_reloc_upper_bound(ibfd, section);

    if(!sz_reloc) {
        /* no relocations */
        bfd_set_reloc(obfd, s, NULL, 0);
    } else if(sz_reloc > 0) {
        arelent ** buf;

        /* build relocations */
        buf   = xmalloc(sz_reloc);
        count = bfd_canonicalize_reloc(ibfd, section, buf, csd->syms);
        /* set relocations for the output section */
        bfd_set_reloc(obfd, s, count ? buf : NULL, count);
        free(buf);
    }

    /* get input section contents, set output section contents */
    if(section->flags & SEC_HAS_CONTENTS) {
        bfd_byte * memhunk = NULL;
        bfd_get_full_section_contents(ibfd, section, &memhunk);
        bfd_set_section_contents(obfd, s, memhunk, 0, size);
        free(memhunk);
    }
}

void define_section(bfd * ibfd, asection * section, PTR data)
{
    bfd *      obfd = data;
    asection * s    = bfd_make_section_anyway_with_flags(obfd,
            section->name, bfd_get_section_flags(ibfd, section));
    /* set size to same as ibfd section */
    bfd_set_section_size(obfd, s, bfd_section_size(ibfd, section));

    /* set vma */
    bfd_set_section_vma(obfd, s, bfd_section_vma(ibfd, section));
    /* set load address */
    s->lma = section->lma;
    /* set alignment -- the power 2 will be raised to */
    bfd_set_section_alignment(obfd, s,
            bfd_section_alignment(ibfd, section));
    s->alignment_power = section->alignment_power;
    /* link the output section to the input section */
    section->output_section = s;
    section->output_offset  = 0;

    /* copy merge entity size */
    s->entsize = section->entsize;

    /* copy private BFD data from ibfd section to obfd section */
    bfd_copy_private_section_data(ibfd, section, obfd, s);
}

void merge_symtable(bfd * ibfd, bfd * embedbfd, bfd * obfd,
        struct COPYSECTION_DATA * csd)
{
    /* set obfd */
    csd->obfd     = obfd;

    /* get required size for both symbol tables and allocate memory */
    csd->symsize  = bfd_get_symtab_upper_bound(ibfd) /********+
            bfd_get_symtab_upper_bound(embedbfd) */;
    csd->syms     = xmalloc(csd->symsize);

    csd->symcount =  bfd_canonicalize_symtab (ibfd, csd->syms);
    /******** csd->symcount += bfd_canonicalize_symtab (embedbfd,
            csd->syms + csd->symcount); */

    /* copy merged symbol table to obfd */
    bfd_set_symtab(obfd, csd->syms, csd->symcount);
}

bool merge_object(bfd * ibfd, bfd * embedbfd, bfd * obfd)
{
    struct COPYSECTION_DATA csd = {0};

    if(!ibfd || !embedbfd || !obfd) {
        return FALSE;
    }

    /* set output parameters to ibfd settings */
    bfd_set_format(obfd, bfd_get_format(ibfd));
    bfd_set_arch_mach(obfd, bfd_get_arch(ibfd), bfd_get_mach(ibfd));
    bfd_set_file_flags(obfd, bfd_get_file_flags(ibfd) &
            bfd_applicable_file_flags(obfd));

    /* set the entry point of obfd */
    bfd_set_start_address(obfd, bfd_get_start_address(ibfd));

    /* define sections for output file */
    bfd_map_over_sections(ibfd, define_section, obfd);
    /******** bfd_map_over_sections(embedbfd, define_section, obfd); */

    /* merge private data into obfd */
    bfd_merge_private_bfd_data(ibfd, obfd);
    /******** bfd_merge_private_bfd_data(embedbfd, obfd); */

    merge_symtable(ibfd, embedbfd, obfd, &csd);

    bfd_map_over_sections(ibfd, copy_section, &csd);
    /******** bfd_map_over_sections(embedbfd, copy_section, &csd); */

    free(csd.syms);
    return TRUE;
}

int main(int argc, char **argv)
{
    bfd * ibfd;
    bfd * embedbfd;
    bfd * obfd;

    if(argc != 4) {
        perror("Usage: infile embedfile outfile\n");
        xexit(-1);
    }

    bfd_init();
    ibfd     = bfd_openr(argv[1], NULL);
    embedbfd = bfd_openr(argv[2], NULL);

    if(ibfd == NULL || embedbfd == NULL) {
        perror("asdfasdf");
        xexit(-1);
    }

    if(!bfd_check_format(ibfd, bfd_object) ||
            !bfd_check_format(embedbfd, bfd_object)) {
        perror("File format error");
        xexit(-1);
    }

    obfd = bfd_openw(argv[3], NULL);
    bfd_set_format(obfd, bfd_object);

    if(!(merge_object(ibfd, embedbfd, obfd))) {
        perror("Error merging input/obj");
        xexit(-1);
    }

    bfd_close(ibfd);
    bfd_close(embedbfd);
    bfd_close(obfd);
    return EXIT_SUCCESS;
}

To summarise what this code does, it takes 2 input files (ibfd and embedbfd) to generate an output file (obfd).

  • Copies format/arch/mach/file flags and start address from ibfd to obfd
  • Defines sections from both ibfd and embedbfd to obfd. Population of the sections happens separately because BFD mandates that all sections are created before any start to be populated.
  • Merge private data of both input BFDs to the output BFD. Since BFD is a common abstraction above many file formats, it is not necessarily able to comprehensively encapsulate everything required by the underlying file format.
  • Create a combined symbol table consisting of the symbol table of ibfd and embedbfd and set this as the symbol table of obfd. This symbol table is saved so it can later be used to build relocation information.
  • Copy the sections from ibfd to obfd. As well as copying the section contents, this step also deals with building and setting the relocation table.

In the code above, some lines are commented out with /******** */. These lines deal with the merging of example-embed. If they are commented out, what happens is that obfd is simply built as a copy of ibfd. I have tested this and it works fine. However, once I comment these lines back in the problems start occurring.

With the uncommented version which does the full merge, it still generates an output file. This output file can be inspected with objdump and found to have all the sections, code and symbol tables of both inputs. However, objdump complains with:

BFD: BFD (GNU Binutils for Ubuntu) 2.21.53.20110810 assertion fail ../../bfd/elf.c:1708
BFD: BFD (GNU Binutils for Ubuntu) 2.21.53.20110810 assertion fail ../../bfd/elf.c:1708

On my system, 1708 of elf.c is:

BFD_ASSERT (elf_dynsymtab (abfd) == 0);

elf_dynsymtab is a macro in elf-bfd.h for:

#define elf_dynsymtab(bfd)  (elf_tdata(bfd) -> dynsymtab_section)

I'm not familiar with the ELF layer, but I believe this is a problem reading the dynamic symbol table (or perhaps saying it's not present). For the time, I am trying to avoid having to reach down directly into the ELF layer unless necessary. Is anyone able to tell me what I'm doing wrong either in my code or conceptually?

If it is helpful, I can also post the code for the linker command generation or compiled versions of the example binaries.


I realise that this is a very large question and for this reason, I would like to properly reward anyone who is able to help me with it. If I am able to solve this with the help of someone, I am happy to award a 500+ bonus.

Shick answered 15/3, 2012 at 14:36 Comment(8)
Why are you trying to do this? What is the motivation? Do you have the source code of the two binaries? Seems rather foolish IMHO.Zacharyzacherie
@EdHeal See the linked question at the top to his other question, which has some rationale.Rangel
@EdHeal: I am making a static executable editor, which can take a target, inject user defined routines into it (the role of the example-embed) and then statically detour the code of the new binary to link up the original code to the injected code (I have already written a disassembler/CFG analysis engine and I can also edit arbitrary instructions so this injection is the final piece of the puzzle). For the usecases I need to care about, it can be assumed we have access to the source code of the user defined routines but not the target.Shick
Does your final executable have a .dynsymtab section? (readelf -WS exename)Rangel
This is going to require thinking and sleeping on.Zacharyzacherie
@DanFego: The final executable has a .dynsym section and .new.dynsym section. readelf -WS results in 'readelf: Error: File contains multiple dynamic symbol tables'. So perhaps I have to manually merge those sections.Shick
@MikeKwan Yeah, that would be one step in the right direction, though you'll definitely be getting into ELF-y territory there.Rangel
@DanFego: Yep, I guess I should check for SHT_REL or SHT_RELA when iterating through sections then reach down into the ELF layer in those cases and perform the merge manually.Shick
S
1

Why do all of this manually? Given that you have all symbol information (which you must if you want to edit the binary in a sane way), wouldn't it be easier to SPLIT the executable into separate object files (say, one object file per function), do your editing, and relink it?

Straightforward answered 15/3, 2012 at 16:3 Comment(6)
How can an executable be split into an object file without source code? I can assume symbol information is available for the embed object, but not for the target (although if I can get it working with this assumption first, that would be fine).Shick
ELF executables can retain both relocation information and symbol table. When both pieces are present, it is relatively simple to split the executable into object files as the symbol table also says whether a symbol is data or code. Also, why are you trying to merge executables? It would be easier to inject an object file.Straightforward
I was trying to merge executables because I thought it would be easier since I no longer have to deal with relocation. I wasn't able to find a way to inject an object file when I tried that instead. How would I go about splitting the executable into object files? Thanks for your attention btw.Shick
You can't meaningfully adjust jump instructions and data references without relocation information. And you can't rely solely on BFD -- you'll eventually have to learn details of ELF. Re splitting, in a nutshell: for each function symbol (address+length), output the function + code that is recursively reachable from the relocations that point into the function's code. You'll also have to copy data relocations. The complete data segment can go in its own file.Straightforward
I'm not familiar with the object file format, but I'll read up on this. I don't quite understand what you mean by 'code that is recursively reachable from the relocations'. Are you suggesting generating a CFG with functions with symbols as the roots? Has what you are suggesting been done before? Or is there a name for the process so I can read up more on it?Shick
Maybe you'll find this useful: elfsh.asgardlabs.org Otherwise, I think I know how I would do it, but I can't describe it. Also, none of this would work (including CFG building) if the program constructs function addresses at runtime.Straightforward

© 2022 - 2024 — McMap. All rights reserved.