Cannot understand how incremental assemblers/compilers/image-based systems work

Asked 31/5, 2022 at 19:21 Answered 1/6, 2022 at 10:37

Solved assembly common-lisp smalltalk incremental-compiler

I am learning a little assembly and for my next project I would like to learn how to make an incremental assembler. By "incremental assembler" I mean an assembler that accepts new code while it's running.

Generally, the workflow with assembly is that you write your files and feed them to the assembler+linker and get an executable at the other end. Contrast this with image-based systems such as Smalltalk or SBCL (lisp), where you have a running image, to which functions/expressions are added incrementally.

Does anyone know how that is accomplished in such systems? Assuming we are working on linux OS, do they simply edit the ELF file and reload the whole image every time a new function/expression is executed? or is there a way to load the content of the ELF file and then execute assembly on top of it on the fly (i.e. without writing any additional files to disk)?

Does anyone have a minimal example they can point me to? or books/blogs about such image-based systems and how they are made?

Bibliophage answered 31/5, 2022 at 19:21 Comment(2)

Are there incremental assemblers? Surely that must constrain how you write functions in asm, like no jumping between functions. (And thus the assembler would have to know which labels represent that start/end of a function, instead of just the body inside a function.) For C compilers, though, MSVC does do incremental compiling and linking in debug mode, with extra levels of indirection between functions so the new address of a function can change in only one place. (I haven't looked at the details of how it works, but I think something like that, like function pointers or dynamic linking) – Frecklefaced 31/5, 2022 at 19:26

Maybe I misunderstood you, but are you talking about runtime machine code generation, like JIT compilation? If so, what you have to do is allocate a chunk of memory that the OS allows to contain executable machine code (e. g. with mmap or VirtualAlloc). Then put the generated machine code at this location, and jmp or call with the address to where the generated code is. So to answer partially, you don't have to reload an ELF executable every time a new function (sequence of machine instructions) is generated. – Norbertonorbie 31/5, 2022 at 19:42

For an image-based system like most Lisps the answer to this is simple (but sometimes fiddly).

The compiler / assembler takes source code and its end result, as the end result of any compiler or assembler, is one or more arrays of octets representing the resulting object code and perhaps some data associated with it, as well as information on the names which are referred to by the code, defined by the code, relocation information and so on.

In a traditional system, those arrays are then laboriously written into a file (long ago, when machines had almost no memory, it was probably necessary to write them to files as they were created), and then some program is invoked which glues several of those files together into a single file, patching up references and so on. That resulting file is then loaded into the memory of the machine, yet more patching up done, and finally the machine is told to run it. And the program instantly crashes and the whole process now needs to be done again. (I have skirted over details here).

And there then needs to be some kind of protocol – in the form of one or more file standard formats – which allows all these multiple different tools to be drag the data to and from memory as many times as they need to. ELF is one such standard: there have been dozens of others.

In an image-based system none of that bureaucracy is needed: what happens is that the compiler / assembler produces an array of octets of some kind as before, as well as some representations of data. All this data simply lives in memory, and most of the patching up of that array is probably done as it is created. That array is now executable code, so all that needs to happen in principle is that the machine needs to be told 'start running this'. In practice, on a modern machine, more needs to be done: the memory it is in needs to be marked as executable, and probably a little dance needs to happen because memory marked as executable can't be written, and so on.

You can see this at work:

> (defun foo (x y)
    (+ x y))
foo

> (compile 'foo)
foo
nil
nil

> (describe (function foo))

#<Function foo 80200109F4> is a function
code           #<code foo (76) 80200109C0>
constants      (0 #<Function foo 80200109F4> foo #(#(1 0) 0) (x y))

So the foo function (the thing the compiler produced) has two components: its code, which is some object which is wrpping the array of octets that the machine will execute. In fact in the implementation I'm using (LispWorks) there are a couple of functions to ask things about the function's code:

> (system:function-code-length #'foo)
76

it's 76 octets long, and if you (disassemble 'foo) you will see that this is indeed the length of the code:

> (disassemble 'foo)
[...]
      75:      90               nop

You can find its address in memory:

> (system:function-code-address #'foo)
550292752884

And you can see that this address can change when the GC relocates it:

> (clean-down)
51183616

> (system:function-code-address #'foo)
559151419204

(clean-down in LW does a fairly big GC: it 'frees memory and reduces the size of the image, if possible'.)

In summary: what an incremental, image-based compiler / assembler does is the same as what a file-based compiler / assembler does ... except without writing the data into a file, copying it into another file, and then reading that final file back into memory and without the conspiracy of file formats needed to do that. It just relies on the fact that the compiled code is already in memory and runs it there.

Fideicommissum answered 1/6, 2022 at 10:37 Comment(1)

Most OSes do allow having pages mapped writeable and executable at the same time. But yes a few like OpenBSD do strictly enforce W^X, and won't obey an mmap or mprotect with PROT_WRITE|PROT_EXEC. And as extra security, a program might want to avoid ever having any pages write+exec at the same time, so yes a little dance could be needed. (And/or to ease portability to OpenBSD and others) – Frecklefaced 1/6, 2022 at 17:16

The basic approach of Just-In-Time compilation is that an interpreter (or virtual machine) takes one method/function at a time and creates machine code for those that are executed frequently. This machine code is created in a read/write/execute memory segment and is not part of the ELF. Then, when that method/function is to be executed the interpreter or VM jumps to the machine code and waits for it to return from the code that was created on-the-fly. So the entire application is not subject to the compiler or assembler, just selected methods/functions and the remainder of the framework or application (interpreter or VM) remains as it was in the original ELF.

Olmstead answered 31/5, 2022 at 19:45 Comment(3)

"segment" has multiple other meanings in this context. (including ELF segment). What you're talking about is basically malloc except making the memory executable. Most ELF systems are Unix-like, so mmap(..., MAP_ANONYMOUS, PROT_EXEC|PROT_WRITE|PROT_READ, ...) to allocate some executable pages to write machine-code into, in case a more specific example is more useful for some future readers. – Frecklefaced 31/5, 2022 at 19:51

Incremental compiling (as you change the source and rebuild) doesn't have to be JIT the way a JVM makes machine code as the program executes, though. For example MSVC does incremental builds that are still basically ahead-of-time (maybe even with execution paused in a debugger), with indirection through function pointers so just the changed functions get rebuilt, and an entry in a table of pointers updated. (See this Q&A which mentions MSVC "incremental linking thunks".) This doesn't involve a VM, more like a fancy debugger / code-injector. – Frecklefaced 31/5, 2022 at 19:59

Found a better Q&A about MSVC incremental builds and its "edit and continue" feature: Address of function is not actual code address. Answers have some links to docs, but as far as what it does in-memory during a debug session it's probably pretty simple, just allocating executable memory and updating pointers. – Frecklefaced 31/5, 2022 at 20:13

Recommended topics

Hot tags