How can I compute the address for segments such as @DATA or like OFFSET/ADDR VarA?
There are 2 cases:
a) the assembler is generating a flat binary or executable file itself, and no linker is involved
b) the assembler is generating an object file to be sent to a linker later
Note that you can have a mixture. For example, in some assemblers (e.g, NASM) there's keywords to create a temporary section (e.g. absolute
) and structures are supported by internally using a temporary section (a field in a structure is an offset into a temporary section that begins at address zero).
For both cases; the assembler converts the source code into some kind of internal representation (e.g. maybe an "instruction data, operand 1 data, operand 2 data, ..." thing) where the internal representation for instructions like "jmp foo
" and "mov eax,bar/5+33
" can be simplified too much and needs to include some reference to a symbol in the symbol table.
For the symbol table itself, each entry has a symbol name (e.g. "foo"), which section it is in, the lowest possible offset within the section and the highest possible offset within the section. When the lowest possible offset and highest possible offset match, and the section has a known address, the assembler can replace references to that symbol in the internal representation with an actual value.
Note that there are cases where you can't know how large an instruction will be until later (e.g. for 80x86; "jmp foo
" could be a 2 byte instruction if the target address is close, but may need to be a 3 byte instruction or 5 byte instruction if the target address isn't close, and you can't decide until you know something about the value that "foo" will have); and when you can't know how large an instruction will be you can't know the offset of any symbols that occur later in the same section. This is why you end up wanting symbols to have both lowest possible offset and highest possible offset - so that even when you don't know the actual offset of a symbol you can still know that the offset will be small enough or too large and can still determine out how big an instruction will be (and get a better idea of the values of later symbols in that section).
More specifically; while assembling you want to do multiple passes, where each pass tries to convert the intermediate representations of each instruction into more specific/complete versions and tries to improve the lowest possible offset and highest possible offset values for symbols (so that you have more/better information that the next pass can use).
When you have finished doing the "multiple passes" and the assembler is generating a flat binary and no linker is involved; everything will be known (including the address of sections and offset of all symbols within sections, and will have converted all instructions into actual bytes) and you can generate the final file.
When you have finished doing the "multiple passes" and the assembler is generating an object file; some things will not be known (the address of sections) and some things will be known (the offset of all symbols within sections, the size of all instructions); and the object file format will provide a way for you to provide details of things you don't/can't know (e.g. a list of things that need fixing, and information the linker can use to fix them) that you can provide from what's left of the intermediate representation of instructions and the symbol table.
Note that there can be cases that are too complex for an object file format to support (e.g. probably the "mov eax,bar/5+33
" from earlier), where an instruction that can be assembled without any problem (if the assembler is generating a flat binary) has to be treated as an error (if the assembler is generating an object file). You will discover these cases (and generate appropriate error messages) when trying to create the object file.
Note that this all fits into a nice "3 phases" arrangement, where the "front-end" converts the "plain text" input into the intermediate representation, the "middle-end" (the multiple passes) refines the intermediate representation as much as possible, and the "back-end" generates a file. Only the back-end needs to care what the target file format is.
@data
segment address. In addition the linker emits a relocation entry in the metadata of theEXE
file pointing out that this particular immediate constant within the code needs to be patched up by theDOS
loader with the actual address of the data segment at run-time. – DeclassMOV AX,01234h
looks the same asMOV AX,@data
to the CPU, an opcode and immediate constant. The base segment whereDOS
loads a program isn't know at compile-time. Instead the assembler pretends the base segment is zero while including a relocation table listing all the places making absolute segment referencing. During loadDOS
goes walks the list adding the base segment to each. Forget about the funky x86 segmentation and imagine you're writing a multitasking OS with a shared linear address space. How do you go about fixing-up the addresses in the programs once loaded? – Declass