How does assembler compute segment and offset for symbol addresses?
Asked Answered
B

2

5

I have learned about compilers and assembly language, so I'd like to write my own assembler as an exercise. But there I have some questions;

How can I compute the address for segments such as @DATA or like OFFSET/ADDR VarA?

Take an easy assembly program as an example:

    .model small
    .stack 1024
    .data
          msg db 128 dup('A')
    .code
    start:
        mov ax,@data
        mov ax,ds
        mov dx, offset msg
                           ; DS:DX points at msg
        mov ah,4ch
        int 21h            ; exit program without using msg
    end

So how does the assembler calculate the segment address for the @data segment?

And how does it know what to put into the immediate for mov dx, offset msg?

Brittle answered 20/4, 2015 at 13:58 Comment(8)
The first variable has offset 0 in data segment, if the variable is 128 bytes long, the second variable will start at offset 128 (because the first one takes bytes from 0 to 127 = 128 bytes). If the second variable starts at offset 128 of the data segment, and its size is DW (2 bytes), it will take bytes 128 and 129, and so on. Variables names are nothing but friendly names for offsets.Age
so u mean "mov ax,@data" will actually recognized by assembler as "mov ax, 0x0h"?Brittle
No. One thing is offset and other thing is segment. Offset is an address inside a segment. @data gets the data segment address from the operating system. Most of programs have three segments : stack, data and code. Each of them have different segment address, but they all have offsets starting at 0 inside of them to address their contents. If you assign 0 to data segment, you will probably point the data segment to interrupt vector and your program will halt. Segments are assigned by operating system.Age
Ok, so how does assembler do when it sees "mov ax, @data"? does it use some other command to replace this one? or it calculate @data when compiling?Brittle
@user152531: The segment isn't know until run-time and the assembler/linker uses a dummy constant in place of the unknowable @data segment address. In addition the linker emits a relocation entry in the metadata of the EXE file pointing out that this particular immediate constant within the code needs to be patched up by the DOS loader with the actual address of the data segment at run-time.Declass
@Declass can u explain a bit more detail about this immediate constant?Brittle
@user152531: MOV AX,01234h looks the same as MOV AX,@data to the CPU, an opcode and immediate constant. The base segment where DOS loads a program isn't know at compile-time. Instead the assembler pretends the base segment is zero while including a relocation table listing all the places making absolute segment referencing. During load DOS goes walks the list adding the base segment to each. Forget about the funky x86 segmentation and imagine you're writing a multitasking OS with a shared linear address space. How do you go about fixing-up the addresses in the programs once loaded?Declass
Suggestion: writing an assembler that understands segmentation is a potentially-significant extra complication on top of just writing an assembler at all as an exercise. Segmentation is basically a dead technology that's obsoleted by CPUs with registers wide enough for a full address for a useful amount of memory (32 or 64 bits). x86 machine code is complex enough on its own. (Although the complexity of segmentation is mostly separate from the machine-encoding.)Recalcitrate
W
6

The assembler doesn't know where @data and msg will end up in memory so generates metadata called relocations (or "fixups") in the object (.OBJ) file that allow the linker and operating system to fill in the correct values.

Lets take a look at what happens with a slightly different example program:

.model small
.stack 1024
.data
    msg db 'Hello, World!,'$'
.code
start:
    mov ax,SEG msg
    mov ds,ax
    mov dx,OFFSET msg
    mov ah,09h
    int 21h              ; write string in DS:DX to stdout
    mov ah,4ch
    int 21h              ; exit(AL)
end start

When assembling this file the assembler has no way knowing where the linker will put anything defined by this example program. It may appear obvious to you, but the assembler can't assume it seeing a complete program. The assembler doesn't know if you'll link it with other object files or libraries which could cause the linker to put msg somewhere other than the start of the data segment.

So when this example program gets assembled into an object file, the assembler generates two relocation records. If you use MASM to assemble the file you can see this in listing file generated with the /Fl switch:

 ; listing of the .obj assembler output, before linking
 0000               start:
 0000  B8 ---- R            mov ax,SEG msg
 0003  8E D8                mov ds,ax
 0005  BA 0000 R            mov dx,OFFSET msg
 0008  B4 09                mov ah,09h

The R next to the operand in the machine code column of the listing indicates they have relocations the refer to them. When the linker creates the MS-DOS format executable from the object file it will able to supply correct offset from the start of the data segment for msg. That value is a link-time constant so only the .obj, not the .exe, needs a relocation for it.

However the linker won't be able to supply the location of the segment of msg (the data segment) because the linker doesn't know where MS-DOS will load the executable into memory. (Unlike under a modern mainstream OS where every process has its own virtual address space, real mode has only one address space that programs have to share with device drivers and TSRs, and the OS itself.)

So the linker will put a relocation in the generated executable that tells MS-DOS to adjust the immediate operand based on where it gets loaded.


Note that you might want to simply your assembler writing exercise by writing one that only works with complete programs and generates only .COM executables. That way you don't have to worry about relocations. Your assembler will decide where everything gets placed within the single segment allowed by the .COM format. Note that because .COM files don't support segment relocations, instructions like mov ax,@data or mov ax,SEG msg can't be used. Instead, CS=DS=ES=SS on program startup, with a value chosen by the OS's program loader. (And that value isn't known at assemble time.)

Womack answered 20/4, 2015 at 21:39 Comment(0)
L
3

How can I compute the address for segments such as @DATA or like OFFSET/ADDR VarA?

There are 2 cases:

a) the assembler is generating a flat binary or executable file itself, and no linker is involved

b) the assembler is generating an object file to be sent to a linker later

Note that you can have a mixture. For example, in some assemblers (e.g, NASM) there's keywords to create a temporary section (e.g. absolute) and structures are supported by internally using a temporary section (a field in a structure is an offset into a temporary section that begins at address zero).

For both cases; the assembler converts the source code into some kind of internal representation (e.g. maybe an "instruction data, operand 1 data, operand 2 data, ..." thing) where the internal representation for instructions like "jmp foo" and "mov eax,bar/5+33" can be simplified too much and needs to include some reference to a symbol in the symbol table.

For the symbol table itself, each entry has a symbol name (e.g. "foo"), which section it is in, the lowest possible offset within the section and the highest possible offset within the section. When the lowest possible offset and highest possible offset match, and the section has a known address, the assembler can replace references to that symbol in the internal representation with an actual value.

Note that there are cases where you can't know how large an instruction will be until later (e.g. for 80x86; "jmp foo" could be a 2 byte instruction if the target address is close, but may need to be a 3 byte instruction or 5 byte instruction if the target address isn't close, and you can't decide until you know something about the value that "foo" will have); and when you can't know how large an instruction will be you can't know the offset of any symbols that occur later in the same section. This is why you end up wanting symbols to have both lowest possible offset and highest possible offset - so that even when you don't know the actual offset of a symbol you can still know that the offset will be small enough or too large and can still determine out how big an instruction will be (and get a better idea of the values of later symbols in that section).

More specifically; while assembling you want to do multiple passes, where each pass tries to convert the intermediate representations of each instruction into more specific/complete versions and tries to improve the lowest possible offset and highest possible offset values for symbols (so that you have more/better information that the next pass can use).

When you have finished doing the "multiple passes" and the assembler is generating a flat binary and no linker is involved; everything will be known (including the address of sections and offset of all symbols within sections, and will have converted all instructions into actual bytes) and you can generate the final file.

When you have finished doing the "multiple passes" and the assembler is generating an object file; some things will not be known (the address of sections) and some things will be known (the offset of all symbols within sections, the size of all instructions); and the object file format will provide a way for you to provide details of things you don't/can't know (e.g. a list of things that need fixing, and information the linker can use to fix them) that you can provide from what's left of the intermediate representation of instructions and the symbol table.

Note that there can be cases that are too complex for an object file format to support (e.g. probably the "mov eax,bar/5+33" from earlier), where an instruction that can be assembled without any problem (if the assembler is generating a flat binary) has to be treated as an error (if the assembler is generating an object file). You will discover these cases (and generate appropriate error messages) when trying to create the object file.

Note that this all fits into a nice "3 phases" arrangement, where the "front-end" converts the "plain text" input into the intermediate representation, the "middle-end" (the multiple passes) refines the intermediate representation as much as possible, and the "back-end" generates a file. Only the back-end needs to care what the target file format is.

Leucocytosis answered 19/10, 2019 at 4:57 Comment(2)
The compact-encoding phases need to know if symbol offsets are assemble-time constants (and thus can maybe use [reg+disp8] addressing modes, or add reg, imm8 immediates) or whether it needs to leave a 16 or 32-bit slot with a relocation. But I guess that happens fairly naturally, and you're excluding that from the middle end needing to know about the file format.Recalcitrate
"a field in a structure is an offset into a temporary section that begins at address zero" -- In NASM I actually added support for structures with non-zero starting offset: sourceforge.net/p/nasm/feature-requests/160Edmund

© 2022 - 2024 — McMap. All rights reserved.