How are functions encoded/stored in memory?

Asked 15/8, 2014 at 23:58 Answered 9/6, 2022 at 7:33

Solved function memory encoding cpu-architecture machine-code

I understand how things like numbers and letters are encoded in binary, and thus can be stored as 0's and 1's.

But how are functions stored in memory? I don't see how they could be stored as 0's and 1's, and I don't see how something could be stored in memory as anything besides 0's and 1's.

Propositus answered 15/8, 2014 at 23:58 Comment(2)

That's a very broad question, and the answer depends a lot on the language used (and thus whether it's compiled, interpreted, converted into bytecode, tokens...). – Staw 16/8, 2014 at 0:5

I'll give you a hint: In JPG files, all files starts and ends with a particular byte sequence (FF D8 and FF D9). Some sequences are encoded to particularly infer their type within the data. – Chloris 16/8, 2014 at 0:57

They are in fact stored into memory as 0's and 1's

Here is a real world example:

int func(int a, int b) {
    return (a + b);
}

Here is an example of 32-bit x86 machine instructions that a compiler might generate for the function (in a text representation known as assembly code):

func:
        push    ebp
        mov     ebp, esp
        mov     edx, [ebp+8]
        mov     eax, [ebp+12]
        add     eax, edx
        pop     ebp
        ret

Going into how each of these instructions work is beyond the scope of this question, but each one of these symbols (such as add, pop, mov, etc) and their parameters are encoded into 1's and 0's. This table shows many of the Intel instructions and a summary of how they are encoded. See also the x86 tag wiki for links to docs/guides/manuals.

So how does one go about converting code from text assembly into machine-readable bytes (aka machine code)? Take for example, the instruction add eax, edx. This page shows how the add instruction is encoded. eax and edx are something called registers, spots in the processor used to hold information for processing. Variables in computer programming will often map to registers at some point. Because we are adding registers and the registers are 32-bit, we select the opcode 000000001 (see also Intel's official instruction-set reference manual entry for ADD, which lists all the forms available).

The next step is for specifying the operands. This section of the same previous page shows how this is done with the example "add ecx, eax" which is very similar to our own. The first two bits have to be '11' to show we are adding registers. The next 3 bits specifies the first register, in our case we pick edx rather than the eax in their example, which leaves us with '100'. The next 3 bits specifies our eax, so we have a final result of

00000001 11100000

Which is 01 D0 in hexadecimal. A similar process can be applied to converting any instruction into binary. The tool used to do this automatically is called an assembler.

So, running the above assembly code through an assembler produces the following output:

66 55 66 89 E5 66 67 8B 55 O8 66 67 8B 45 0C 66 01 D0 66 5D C3

Note the 01 D0 near the end of the string, this is our "add" instruction. Converting machine-code bytes back into text assembly-language mnemonics is called disassembling:

 address | machine code  |  disassembly
   0:      55              push   ebp
   1:      89 e5           mov    ebp, esp
   3:      8b 55 08        mov    edx, [ebp+0x8]
   6:      8b 45 0c        mov    eax, [ebp+0xc]
   9:      01 d0           add    eax, edx
   b:      5d              pop    ebp
   c:      c3              ret

Addresses start at zero because this is only a .o, not a linked binary. So they're just relative to the start of the file's .text section.

You can see this for any function you like on the Godbolt Compiler Explorer (or on your own machine on any binary, freshly-compiled or not, using a disassembler).

You may notice there is no mention of the name "func" in the final output. This is because in machine code, a function is referenced by its location in RAM, not its name. The compiler-output object file may have a func entry in its symbol table referring to this block of machine code, but the symbol table is read by software, not something the CPU hardware can decode and run directly. The bit-patterns of the machine code are seen and decoded directly by transistors in the CPU.

Sometimes it is hard for us to understand how computers encode instructions like this at a low level because as programmers or power users, we have tools to avoid ever dealing with them directly. We rely on compilers, assemblers, and interpreters to do the work for us. Nonetheless, anything a computer ever does must eventually be specified in machine code.

Breadstuff answered 16/8, 2014 at 1:31 Comment(7)

Ahh ok, how enlightening! I don't understand the details, but I could see how a function is just a series of instructions and how these instructions could be encoded as 0's and 1's. Well, sort of. Could you expand a bit on how the instructions are encoded as 0's and 1's? – Propositus 16/8, 2014 at 1:36

@AdamZerner As Dougvj indicated the operation is encoded into some bits (the opcode) and operands (numeric constants and register names) are encoded into other bits. For the opcode this might be viewed as similar to the index into a virtual function table; for registers, an array index is roughly analogous. Actual ISAs are more complex since opcodes can include implicit operands (e.g., most RISCs use an implicitly specified link register for the jump-and-link instruction, x86 POP implicitly uses sp as a register operand) and the number and placement of bits can vary among instructions. – Untutored 16/8, 2014 at 14:22

@AdamZerner Sure thing, I greatly expanded this section. Careful it's a little hairy but I think you'll get it. – Breadstuff 17/8, 2014 at 4:2

@Breadstuff Cool, thanks! So I'm learning about data structures right now and. I learned one of the things to consider is the overhead that goes into creating a data structure. Up until now I just thought about the storage device (ie Node vs. array) creating overhead, but now it seems like the functions would create a lot of overhead. It seems that you'd need a lot of bytes to store a long function. Is this true? – Propositus 17/8, 2014 at 14:41

@AdamZerner Actually, you don't need that much relative to, say, images or videos. For example, all the code in the windows kernel compiles to about 25 megabytes worth of instructions, so when we are talking about the size of video games or something, the data involving textures and images dwarf code. – Breadstuff 21/8, 2014 at 0:25

Might be useful to include a code block that shows the hexdump of the machine code on each line along with the disassembled instruction mnemonic + operands. e.g. from objdump -drwC -Mintel. e.g. like this codegolf answer in x86 machine code. The Godbolt compiler explorer has a binary mode that shows this for gcc/clang/icc output. – Oestrone 30/11, 2016 at 9:28

I added some stuff that I think is an improvement. It's your answer though, so please review and see if you like my changes, and roll back or edit if not. – Oestrone 30/11, 2016 at 9:58

Functions are made of instructions, such as bytecode or machine code. Instructions are numbers, which can be encoded in binary.

A good introduction to this is Charles Petzold's book Code.

Bract answered 16/8, 2014 at 0:12 Comment(0)

-1

I will explain how functions are stored in the easiest way possible. You will be surprised by the amazing simplicity of all this at the end of this contribution. This is the most fundamental explanation and any type of computer will work in somehow the same way.

The only part of a computer that can perform any operations on data ie addition, subtraction, multiplication and division. Every data manipulation(any sort of Math or just any formula) in human existence is made up of these operations.

Now let us look at the basic structure of an instruction in binary. If we are working on a 32 bit machine, an instruction will take the form of:

1 001 32-bit address 32-bit adress

1(if this bit is one then the instruction is diverted to logic unit for calculation and if zero, we are basically moving the data between the two memory addresses following this) 001(these 3 bits determine if we are adding(001), subtracting(010), multiplying(011), or dividing(100) in this instruction cycle) (32-bit memory address of first memory location) (32-bit memory adress of second memory adress)

A function is basically a string of instructions of how to manipulate data in defined memory locations.

Let us take a random function that adds a number, then multiplies. It’s string of instructions will be:

(let me use MA to mean memory adress)

1 001 MAone MAtwo (adds value in MAone to value in MAtwo and stores resultant in MAone )

1 011 MAtwo MAthree (multiplies value in MAtwo with value in MAthree and stores resultant in MAthree )

Returns value in MAthree

So the ony difference in how functions are stored is that they are stored with 1 in the left most bit so that the CPU knows its a function that needs logical operations and diverts it to the ALU

Ingenious answered 9/6, 2022 at 7:33 Comment(6)

Every data manipulation(any sort of Math or just any formula) in human existence is made up of these operations. - You're talking about add/sub/mul/div here? Bitwise operations like AND / OR / XOR are significantly different, and rather painful to emulate in terms of arithmetic operations. Popcount, count-leading-zero, and shuffle or left-pack (like x86 pext / pdep) are also separate primitive operations, not cheap to emulate. – Oestrone 9/6, 2022 at 9:12

Sub-and-branch is enough for a computer to be Turing-complete, meaning it can compute any other operation, but I wouldn't say that means add, multiply, and divide are "made up of" subtraction. But that's the only sense in which bitwise AND / OR / XOR are "made up of" the arithmetic operations you mentioned. Or if you meant all operations an ALU can do (including bitwise), then the phrasing is not clear. – Oestrone 9/6, 2022 at 9:12

If we are working on a 32 bit machine - very few 32-bit machines have two separate 32-bit addresses as part of each machine instruction. Probably no real-world designs. This is just a hypothetical example you made up, with 68-bit fixed-width instructions, very much unlike RISCs or x86. RISC machines would have 4-bit or 5-bit register numbers, not 32-bit memory addresses, because that's part of the point of registers. x86 with variable-length machine code can have instructions from 1 byte to 15 bytes, but only including up to 1 absolute address. Some like VAX may allow 2, but it's rare. – Oestrone 9/6, 2022 at 9:19

Anyway, an oversimplified hypothetical example is ok, but only if you say so, not imply that all 32-bit machines are like this, or that the details are anywhere near typical. Most ISAs have more opcode bits, and far fewer address bits as part of each instruction (whether that's register numbers or memory addresses). Many 32-bit machines use 32-bit instructions, so even a single absolute memory address can't be part of a single instructions. – Oestrone 9/6, 2022 at 9:20

I think a low-level explanation / example that ignores the existence of registers is unhelpful. In a lot of ways, it's very important to know that they exist separate from memory. (e.g. for understanding compile-time memory reordering, and how optimizations can matter for shared memory not using volatile or atomic.) – Oestrone 9/6, 2022 at 9:23

I was actually trying to simplify it to the most abstract and intuitive level, with no details. This is how I first understood the basic workings behind a computer and I am currently slowly diving into the details – Ingenious 27/7, 2022 at 12:36

Recommended topics

Hot tags