In some respects this is a massively broad question that may not survive for that reason. The information is all out there on the internet, keep looking, it is not complicated, and not worthy of a paper or video.
So you have a rough idea that a compiler takes a program written in one language and converts it to another language be that assembly language or machine code or whatever.
Then there are file formats and there are many different ones that we all use the term "binary" for but again, different formats. Ideally they contain, using some form of encoding, the machine code and data or information about the data.
Going to use ARM for now, fixed length instructions easy to disassemble and read, etc.
#define ONE 1
unsigned int x;
unsigned int y = 5;
const unsigned int z = 7;
unsigned int fun ( unsigned int a )
{
return(a+ONE);
}
and gnu gcc/binutils because it is very well know, widely used, you can use it to make programs on your wintel machine. I run Linux so you will see elf not exe, but that is just a file format for what you are asking.
arm-none-eabi-gcc -O2 -c so.c -save-temps -o so.o
This toolchain (chain of tools that are linked for example compiler -> assembler -> linker) is Unix style and modular. You are going to have an assembler for a target so not sure why you would want to re-invent that, and it is so much easier to debug a compiler by looking at the assembly output than trying to go straight to machine code. But there are folks that like to climb the mountain just because it is there rather than go around and some tools go straight for machine code just because its there.
This specific compiler has this save temps feature, gcc itself is a front end program that preps for the real compiler then if asked for (if you don't say not to) will call the assembler and linker.
cat so.i
# 1 "so.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "so.c"
unsigned int x;
unsigned int y = 5;
const unsigned int z = 7;
unsigned int fun ( unsigned int a )
{
return(a+1);
}
So at this point defines and includes are taken care of and its one big file to be sent to the compiler.
The compiler does its thing and turns it onto assembly language
cat so.s
.cpu arm7tdmi
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 1
.eabi_attribute 30, 2
.eabi_attribute 34, 0
.eabi_attribute 18, 4
.file "so.c"
.text
.align 2
.global fun
.arch armv4t
.syntax unified
.arm
.fpu softvfp
.type fun, %function
fun:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 0, uses_anonymous_args = 0
@ link register save eliminated.
add r0, r0, #1
bx lr
.size fun, .-fun
.global z
.global y
.comm x,4,4
.section .rodata
.align 2
.type z, %object
.size z, 4
z:
.word 7
.data
.align 2
.type y, %object
.size y, 4
y:
.word 5
.ident "GCC: (GNU) 9.3.0"
which then gets put into an object file, in this case, binutils, linux default, etc
file so.o
so.o: ELF 32-bit LSB relocatable, ARM, EABI5 version 1 (SYSV), not stripped
It is using an elf file format which is easy to find info on, easy to write programs to parse, etc.
I can disassemble this, note that because I am using the disassembler it tries to disassemble everything even if it isn't machine code, sticking to 32 bit arm stuff It can grind through that and when there are real instructions they are shown (aligned and not variable length as used here, so you can disassemble linearly which you cannot with a variable length instruction set and have a hope of success (like x86) you need to disassemble in execution order and then you often miss some due to the nature of the program)
arm-none-eabi-objdump -D so.o
so.o: file format elf32-littlearm
Disassembly of section .text:
00000000 <fun>:
0: e2800001 add r0, r0, #1
4: e12fff1e bx lr
Disassembly of section .data:
00000000 <y>:
0: 00000005 andeq r0, r0, r5
Disassembly of section .rodata:
00000000 <z>:
0: 00000007 andeq r0, r0, r7
Disassembly of section .comment:
00000000 <.comment>:
0: 43434700 movtmi r4, #14080 ; 0x3700
4: 4728203a ; <UNDEFINED> instruction: 0x4728203a
8: 2029554e eorcs r5, r9, lr, asr #10
c: 2e332e39 mrccs 14, 1, r2, cr3, cr9, {1}
10: Address 0x0000000000000010 is out of bounds.
Disassembly of section .ARM.attributes:
00000000 <.ARM.attributes>:
0: 00002941 andeq r2, r0, r1, asr #18
4: 61656100 cmnvs r5, r0, lsl #2
8: 01006962 tsteq r0, r2, ror #18
c: 0000001f andeq r0, r0, pc, lsl r0
10: 00543405 subseq r3, r4, r5, lsl #8
14: 01080206 tsteq r8, r6, lsl #4
18: 04120109 ldreq r0, [r2], #-265 ; 0xfffffef7
1c: 01150114 tsteq r5, r4, lsl r1
20: 01180317 tsteq r8, r7, lsl r3
24: 011a0119 tsteq r10, r9, lsl r1
28: Address 0x0000000000000028 is out of bounds.
and yes the tool put extra stuff in there, but note primarily that I created. some code, some initialized read/write data, some initialized read/write data and some initialized read only data. The toolchain authors can use whatever names they want, they don't even have to use the term section. But from decades of history and communication and terminology .text is generally used for code (as in read only machine code AND related data), .bss for zeroed read/write data although I have seen other names, .data for initialized read/write data and this generation of this tool .rodata for read only initialized data (technically that could land in .text)
And note that they all have an address of zero. they are not linked yet.
Now this is ugly but to avoid adding any more code and if the tool lets me do it, let's link it to make a completely unusable binary (no bootstrap, etc, etc):
arm-none-eabi-ld -Ttext=0x1000 -Tdata=0x2000 so.o -o so.elf
arm-none-eabi-ld: warning: cannot find entry symbol _start; defaulting to 0000000000001000
arm-none-eabi-objdump -D so.elf
so.elf: file format elf32-littlearm
Disassembly of section .text:
00001000 <fun>:
1000: e2800001 add r0, r0, #1
1004: e12fff1e bx lr
Disassembly of section .data:
00002000 <y>:
2000: 00000005 andeq r0, r0, r5
Disassembly of section .rodata:
00001008 <z>:
1008: 00000007 andeq r0, r0, r7
Disassembly of section .bss:
00002004 <x>:
2004: 00000000 andeq r0, r0, r0
And now it is linked. The read only items .text and .rodata landed in the .text address space in the order found in the file. The read/write items landed in the .data address space in the order found in the file.
Yes, where was .bss in the object? It is in there, it has no actual data as in bytes that are part of the object, instead it has a name and size and that it is .bss. And for whatever reason the tool does show it from the linked binary.
So back on the term binary. The so.elf binary has the bytes that go in memory that make up the program, but also file format infrastructure plus a symbol table to make the disassembly and debugging easier plus other stuff. Elf is a flexible file format gnu can use it and you get one result some other tool or version of a tool can use it and have a different file. And obviously two compilers can generate different machine code from the same source program not just due to optimizations, the job is to make a functional program in the target language and functional is the opinion of the compiler/tool author.
What about a memory image type file:
arm-none-eabi-objcopy so.elf so.bin -O binary
hexdump -C so.bin
00000000 01 00 80 e2 1e ff 2f e1 07 00 00 00 00 00 00 00 |....../.........|
00000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00001000 05 00 00 00 |....|
00001004
Now how the objcopy tool works is that it starts with the first defined loadable or whatever term you want to use byte and ends with the last one and uses (zero) padding to make the file size match so that the memory image matches from an address perspective. The asterisk means essentially 0 padding. Because we started at 0x1000 with .text and 0x2000 for .data but the first byte of this file (offset 0) is the beginning of .text and 0x1000 byte later which is offset 0x1000 in the file but we know it goes to 0x2000 in memory is the read/write stuff. Also note that the bss zeros are not in the output. The bootstrap is expected to zero those.
There is no information like where in memory this data from this file goes, etc. And if you think a bit about it what if I have one byte at a section I define goes to 0x00000000 and one byte at a section I define goes to 0x80000000 and output this file, yes that is a 0x80000001 byte file even though there are only two useful bytes of relevant information. A 2GB file to hold two bytes. This is why you don't want to output this file format until you have sorted out your linker script and tools.
Same data and two other equally old school formats with a little history of intel vs motorola
arm-none-eabi-objcopy so.elf so.hex -O ihex
cat so.hex
:08100000010080E21EFF2FE158
:0410080007000000DD
:0420000005000000D7
:0400000300001000E9
:00000001FF
arm-none-eabi-objcopy so.elf so.srec -O srec
cat so.srec
S00A0000736F2E7372656338
S10B1000010080E21EFF2FE154
S107100807000000D9
S107200005000000D3
S9031000EC
now these contain the relevant bytes, plus addresses, but not much other information, takes more than two bytes for every byte of data, but compared to a huge file with padding, a worthy trade-off. Both of these formats can be found in use today, not as much as the old days but still there.
And countless other binary file formats and a tool like objdump has a decent list of formats it can generate as well as other linkers and/or tools out there.
What is relevant about all of this is that there is a binary file format of some form that contains the bytes we need to run the program.
What format and what addresses you might ask...That is part of the operating system or the system design. In the case of Windows there are specific file formats and variations perhaps of those formats that are supported by the windows operating system, the specific version you are using. Windows has determined what the address space looks like. Operating systems like this take advantage of the MMU both for virtualizing addresses and protection. Having a virtual address space means every program can live in the same space. All programs can have an address that is zero based for example....
test.c
int main ( void )
{
return 1;
}
hello.c
int main ( void )
{
return 2;
}
gcc test.c -o test
objdump -D test
Disassembly of section .text:
00000000004003e0 <_start>:
4003e0: 31 ed xor %ebp,%ebp
4003e2: 49 89 d1 mov %rdx,%r9
4003e5: 5e pop %rsi
...
gcc hello.c -o hello
objdump -D hello
Disassembly of section .text:
00000000004003e0 <_start>:
4003e0: 31 ed xor %ebp,%ebp
4003e2: 49 89 d1 mov %rdx,%r9
same address, how is that possible won't they sit on top of each other? no virtual machine. And note this is built for a specific Linux on a specific day, etc. The toolchain has a default linker script (notice I didn't specify how to link) for this platform when the compiler was built for this target/platform.
arm-none-eabi-gcc -O2 test.c -c -o test.o
arm-none-eabi-ld test.o -o test.elf
arm-none-eabi-ld: warning: cannot find entry symbol _start; defaulting to 0000000000008000
arm-none-eabi-objdump -D test.elf
test.elf: file format elf32-littlearm
Disassembly of section .text:
00008000 <main>:
8000: e3a00001 mov r0, #1
8004: e12fff1e bx lr
same source code, same compiler, built for a different target and system different address.
So for Windows there are definitely going to be rules for the supported binary formats and rules for address spaces that can be used, how to define those spaces in the file.
Then it is a simple matter of the operating systems launcher to read the binary file and put the loadable items into memory at those addresses (in the virtual space that the os has created for this specific program) It is very possible that a feature of the loader is to zero bss for you since the information is there. The low level programmer needs to know that to possibly deal with zeroing .bss or not.
If not you will see and may need to create a solution, unfortunately this is where you get deeper into tool specific items. While C may be somewhat standardized there are tool specific things that are not or at least are standardized by the tool/authors but no reason to assume those cross over to other tools.
.globl _start
_start:
ldr sp,sp_init
bl fun
b .
.word __bss_start__
.word __bss_end__
sp_init:
.word 0x8000
Everything about assembly language is tool specific, the mnemonics for sanity reasons no doubt will resemble the ip/processor vendors documentation which uses syntax that the tool they paid to have developed uses. But beyond that assembly language is wholly defined by the tool not the target, x86 because of its age and other things is really bad about that and this is not the Intel vs AT&T thing, just in general. Gnu assembler is well known for I would assume perhaps intentionally not making compatible languages with other assembly languages. The above is gnu assembler for arm.
Using the fun() function above, C says it should be main() but the tool doesn't care I am already typing enough here.
add a simple ram based linker script
MEMORY
{
ram : ORIGIN = 0x1000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > ram
.rodata : { *(.rodata*) } > ram
.bss : {
__bss_start__ = .;
*(.bss*)
} > ram
__bss_end__ = .;
}
build it all
arm-none-eabi-as start.s -o start.o
arm-none-eabi-gcc -O2 -c so.c -o so.o
arm-none-eabi-ld -T sram.ld start.o so.o -o so.elf
examine
arm-none-eabi-nm so.elf
0000102c B __bss_end__
00001028 B __bss_start__
00001018 T fun
00001014 t sp_init
00001000 T _start
00001028 B x
00001024 D y
00001020 R z
arm-none-eabi-objdump -D so.elf
so.elf: file format elf32-littlearm
Disassembly of section .text:
00001000 <_start>:
1000: e59fd00c ldr sp, [pc, #12] ; 1014 <sp_init>
1004: eb000003 bl 1018 <fun>
1008: eafffffe b 1008 <_start+0x8>
100c: 00001028 andeq r1, r0, r8, lsr #32
1010: 0000102c andeq r1, r0, r12, lsr #32
00001014 <sp_init>:
1014: 00008000 andeq r8, r0, r0
00001018 <fun>:
1018: e2800001 add r0, r0, #1
101c: e12fff1e bx lr
Disassembly of section .rodata:
00001020 <z>:
1020: 00000007 andeq r0, r0, r7
Disassembly of section .data:
00001024 <y>:
1024: 00000005 andeq r0, r0, r5
Disassembly of section .bss:
00001028 <x>:
1028: 00000000 andeq r0, r0, r0
So now it is possible to add to the bootstrap a memory zeroing loop (do not use C/memset you don't create chicken and egg problems you write the bootstrap in asm) based on the start and end addresses.
Fortunately or unfortunately because the linker script is tool specific and assembly language is tool specific and they need to work together if you are letting the tools do the work for you (the sane way to do it, have fun figuring out where .bss is otherwise).
This can be done on an operating system but when you get into say microcontrollers where it all has to be on non-volatile storage (flash) well it is possible to have one that is downloaded from elsewhere (like your mouse firmware sometimes, sometimes keyboard, etc) into ram, assume flash, so how do you deal with .data??
MEMORY
{
rom : ORIGIN = 0x0000, LENGTH = 0x1000
ram : ORIGIN = 0x1000, LENGTH = 0x1000
}
SECTIONS
{
.text : { *(.text*) } > rom
.rodata : { *(.rodata*) } > rom
.data : {
*(.data*)
} > ram AT > rom
.bss : {
__bss_start__ = .;
*(.bss*)
} > ram
__bss_end__ = .;
}
With gnu ld this basically says that .data's home is in ram, but the output binary formats will put it in flash/rom
so.elf so.srec -O srec
cat so.srec
S00A0000736F2E7372656338
S11300000CD09FE5030000EBFEFFFFEA04100000A4
S11300100810000000800000010080E21EFF2FE1B4
S107002007000000D1 <- z variable at address 0020
S107002405000000CF <- y variable at 0024
S9030000FC
and you have to play with the linker script more to get the tool to tell you both the ram and flash starting addresses and ending addresses or length. then add code in the bootstrap (asm not C) to copy .data from flash to ram.
Also note here per another one of your many questions.
.word __bss_start__
.word __bss_end__
sp_init:
.word 0x8000
These items are technically data. but they live in .text first and foremost because they were defined in the code that was assumed to be .text (I didn't need to state that in the asm, but could have). you will see this in x86 as well, but for fixed length like arm, mips, risc-v, etc where you cant put any old immediate/constant/linked value you want in the instruction itself you put it nearby in a "pool" and do a pc relative read to get it. You will see this for linking externals too:
extern unsigned int x;
int main ( void )
{
return x;
}
arm-none-eabi-gcc -O2 -c test.c -o test.o
arm-none-eabi-objdump -D test.o
test.o: file format elf32-littlearm
Disassembly of section .text.startup:
00000000 <main>:
0: e59f3004 ldr r3, [pc, #4] ; c <main+0xc>
4: e5930000 ldr r0, [r3]
8: e12fff1e bx lr
c: 00000000 andeq r0, r0, r0 <--- the code gets the address of the
variable from here and then reads it from memory
once linked
Disassembly of section .text:
00008000 <main>:
8000: e59f3004 ldr r3, [pc, #4] ; 800c <main+0xc>
8004: e5930000 ldr r0, [r3]
8008: e12fff1e bx lr
800c: 00018010 andeq r8, r1, r0, lsl r0
Disassembly of section .data:
00018010 <x>:
18010: 00000005 andeq r0, r0, r5
for x86
gcc -c -O2 test.c -o test.o
dwelch-desktop so # objdump -D test.o
test.o: file format elf64-x86-64
Disassembly of section .text.startup:
0000000000000000 <main>:
0: 8b 05 00 00 00 00 mov 0x0(%rip),%eax # 6 <main+0x6>
6: c3 retq
00000000004003e0 <main>:
4003e0: 8b 05 4a 0c 20 00 mov 0x200c4a(%rip),%eax # 601030 <x>
4003e6: c3 retq
If you squint is it really different? there is data nearby that the processor reads to load into a register and or use. either way, due to the nature of the instruction sets the linker modifies the instruction or nearby pool data or both.
last one:
arm-none-eabi-gcc -S test.c
cat test.s
.cpu arm7tdmi
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 1
.eabi_attribute 30, 6
.eabi_attribute 34, 0
.eabi_attribute 18, 4
.file "test.c"
.text
.align 2
.global main
.arch armv4t
.syntax unified
.arm
.fpu softvfp
.type main, %function
main:
@ Function supports interworking.
@ args = 0, pretend = 0, frame = 0
@ frame_needed = 1, uses_anonymous_args = 0
@ link register save eliminated.
str fp, [sp, #-4]!
add fp, sp, #0
ldr r3, .L3
ldr r3, [r3]
mov r0, r3
add sp, fp, #0
@ sp needed
ldr fp, [sp], #4
bx lr
.L4:
.align 2
.L3:
.word x
.size main, .-main
.ident "GCC: (GNU) 9.3.0"
So can you see the assembly language, yes some tools will let you save the intermediate files and/or let you generate the assembly output of the file when compiling.
Can you have data in the code, yes there are times and reasons to have data values in the .text area not just target specific you will see this for various reasons and some toolchains put read only data there.
There are many file formats the ones used by modern operating systems have features not just for defining the bytes that make up the machine code and data values but also will include symbols and other debug information.
The file format and memory space for a program is operating system specific not language nor even target specific (Linux, Windows, MacOS on the same laptop are not expected to have the same rules despite the exact same target computer). A native toolchain for that platform has a default linker script and whatever other information required to build usable/working programs for that target. Including the supported file format.
The machine code and data items can be represented in different file formats in different ways, whether or not the operating system or loader of the target system can use that format depends on that target system.
Programs have bugs and nuances. File formats have versions and inconsistencies, you might find some elf file format reader only to find it doesn't work or prints out strange stuff when fed a perfectly good elf file that works on some system. Why are some flags being set? Perhaps those bytes got re-used or the flag to repurposed or the data structure changed or a tool is using it differently or in a non-standard way (think mov 20h,ax) and another tool that is not compatible can't understand or gets lucky and gets close enough.
Asking "why" questions at Stack Overflow is not very useful, the odds of finding the individual that wrote the thing are very low, better odds of asking the place you got the tool from and following that hoping the person is still alive and willing to be bothered. And 99.999(lots of 9s)% there is no global set of godly rules that the thing was written under/for. General it was some dude just felt like it that is why they did what they did, no real reason, laziness, a bug, intentionally trying to break someone else's tool. All the way up to a large committee of people with an opinion voted on it on a particular day in a particular room and that's why (and we know what we get when we design by committee or try to write specs that nobody conforms to).
I know you are on Windows and I don't have a Windows machine handy and am on Linux. But the gnu/binutils and clang/llvm tools are readily available and have a rich set of tools like readelf, nm, objdump, etc. That assist in examining things, a good tool is going to have that at least internally for the developers so they can debug the output of the tool to a certain quality level. gnu folks made tools and made them available for everyone, and while it takes time to sort through them and their features they are very powerful for the things you are trying to understand.
You are NOT going to find a good x86 disassembler, they are all crap simply because of the nature of the beast. It is a variable length instruction set, so by definition unless you are executing you cant sort it out correctly. You must disassemble in execution order from a known good entry point to have half a chance, and then for various reasons there are code paths you cannot see that way (think jump tables for example, or dlls or so files). The BEST solution is to have a very accurate/perfect emulator/simulator and run the code and perform all the actions/gyrations you need to do to get it to cover all the code paths, and have that tool record instructions from data and where each is located or each linear section without a branch.
The good side of this is that a lot of code is compiled today using tools that are not trying to hide anything. In the old days for various reasons you would see hand written asm that intentionally tried to prevent disassembly or due to other factors (hand editing a binary rom image for a video game the day before the trade show, go disassemble some of the classic roms).
mov r0,#0
cmp r0,#0
jz somewhere
.word 0x12345678
A disassembler isn't going to figure this out, some might add a case for that then
mov r0,#0
nop
nop
xor r0,#1
nop
nop
xor r0,#3
xor r0,#2
cmp r0,#0
jz somewhere
.word 0x12345678
and it thinks that data is an instruction, for variable length that is super hard for a disassembler to resolve a decent one will at least detect collisions where the non opcode part of the instruction is branched to and/or an opcode part of an instruction shows up later as additional bytes in some other instruction. The tool cant resolve it a human has to.
Even with arm and mips and having 32 and 16 bit instructions, risc-v with variable sized instructions, etc...
Very often gnu's disassembler will get tripped up with x86.