How are char arrays / strings stored in binary files?
Asked Answered
S

3

42

When I compile this code using different compilers and inspect the output in a hex editor I am expecting to find the string "Nancy" somewhere.

#include <stdio.h>

int main()
{
    char temp[6] = "Nancy";
    printf("%s", temp);

    return 0;
}
  1. The output file for gcc -o main main.c looks like this:

    sdf

  2. The output for g++ -o main main.c, I can't see to find "Nancy" anywhere.

  3. Compiling the same code in visual studio (MSVC 1929) I see the full string in a hex editor:

Why do I get some random bytes in the middle of the string in (1)?

Soddy answered 19/4, 2022 at 22:47 Comment(11)
It will be illuminating to look at the assembly codeGrivet
I did #2 on Debian and the result was similar to your #1.Enclitic
It is read from two separate variables: godbolt.org/z/4bKGrcf1G the compiler is free to choose its strategy, there is no guarantee that you can find the file in the binary compiled output as a single continuous string.Mantis
Note that you will always get "Nancy" verbatim in object file if the char temp[6] is outside a function, so that it gets allocated statically instead of stack allocation. Similarly if you make it static char temp[6] inside a function, though that could be subject to compiler optimizations.Graveyard
I am expecting to find the string "Nancy" somewhere. That seemingly makes sense. Except it doesn't because programming languages are defined in an "as-if" fashion. The program should work in a certain way, but if you don't push it to act in a particular way, the compiler is free to do other things to optimize it. Here, you never directly accessed the contents of the string. printf is an intrinsic and the compiler optimizes it in a special fashion, it's not literally a call of a function called printf, even though such a function exists from the programmer's perspective.Meristic
@jpa: Similarly, const char *temp = "Nancy"; would also (typically, aggressive optimizations can do funny things) store the verbatim "Nancy" (including NUL delimiter) in the constant data section. I was about to add that it might do so for const char temp[6] = "Nancy"; too, even without static, but on considering it, I'm not sure that's allowed (I think it would be under as-if rules, but I'm not 100%).Fiedling
It's also possible that the output file has been compressed in some way.Unguarded
@LeeDanielCrocker: That's possible in theory, but mainstream C implementations for mainstream OSes make executables that just memory-map their text and data sections into memory. (read-only shared mapping, and read/write private mapping respectively). It might even be a good idea for freestanding C implementations for embedded systems where .data has to get copied from flash ROM by code included in the "executable" image, but I don't think it happens there either. (So you have to do it manually in cases where it'd help)Starveling
Which binary format is created? ELF with gcc?Forepart
The question should say "in executable files" or "in object files" since "binary files" includes lots more types of files. Did you consider using a disassembler to see the assembly code that's involved?Foeticide
Another episode of "Programmers discover programming language constructs are abstractions." :DLeak
P
37

There is no single rule about how a compiler stores data in the output files it produces.

Data can be stored in a “constant” section.

Data can be built into the “immediate” operands of instructions, in which data is encoded in various fields of the bits that encode an instruction.

Data can be computed from other data by instructions generated by the compiler.

I suspect the case where you see “Nanc” in one place and “y” in another is the compiler using a load instruction (may be written with “mov”) that loads the bytes forming “Nanc” as an immediate operand and another load instruction that loads the bytes forming “y” with a trailing null character, along with other instructions to store the loaded data on the stack and pass its address to printf.

You have not provided enough information to diagnose the g++ case: You did not name the compiler or its version number or provide any part of the generated output.

Pixilated answered 19/4, 2022 at 22:56 Comment(8)
Yes this is confirmed via godbolt godbolt.org/z/Ph9cnrKEh. Although I am not sure how healthy what MSVC is doing since I assume there "Nancy" is in some read only constant section making it potentially harmful to modify, but I might be mistakenFence
@Fence It's presumably being copied from there into the local array, it's not the array itself.Enclitic
@Fence The compiler could also detect that temp is never modified, so it can be treated as a constant.Enclitic
@Enclitic Yeah nevermind. Missed those few lines in the assembly 25-29Fence
@Fence That's the difference between char temp[] = "Nancy" (writable local array initialized with a copy of that read-only string literal) and char *temp = "Nancy" (pointer set to point at the read-only string literal, no copy made). temp[0] = 'D' is legal in the former case but not in the latter.Crary
@Eric: It would have been better for the question to name versions for GCC and G++, but they did name the compiler: it's g++. On Godbolt, all versions of g++ except for the oldest (4.1) materialize the local non-const array with immediates, when compiling for x86-64 at any optimization level. (And the question did show a complete command, so we know optimization level was the default -O0.) godbolt.org/z/s5TPfxn8n shows C vs. C++ mode (-xc vs. -xc++; same code-gen, unsurprisingly.) Seems always a dword and word store, with the dword holding the first 4 bytes.Starveling
With optimization enabled, recent GCC will mov-immediate to a register like mov eax, 'y' to set up for a word store, to work around LCP stalls on Intel before Sandybridge-family. godbolt.org/z/szqf7Wozs (SnB fixed LCP stalls for mov specifically, not for other opcodes, but GCC doesn't know that so it still does it with -march=skylake :/). But GCC / G++ isn't doing that at -O0 so IDK why the querent couldn't find Nanc in the g++ machine code, especially when they did for gcc. Presumably user-error, like searching for the full "Nancy".Starveling
Anyway, How to remove "noise" from GCC/clang assembly output? shows how to look at compiler asm output, much easier to see what's going on than hexdumps.Starveling
I
18

I reproduced it, using gcc 9.3.0 (Linux Mint 20.2), on x86-64 system (Intel

Result of hexdump -C:

enter image description here

Note the byte sequence is the same.

So I use gcc -S -c:

    .file   "teststr.c"
    .text
    .section    .rodata
.LC0:
    .string "%s"
    .text
    .globl  main
    .type   main, @function
main:
.LFB0:
    .cfi_startproc
    endbr64
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    subq    $16, %rsp
    movq    %fs:40, %rax
    movq    %rax, -8(%rbp)
    xorl    %eax, %eax
    movl    $1668178254, -14(%rbp) # NOTE THIS PART HERE
    movw    $121, -10(%rbp)        # AND HERE
    leaq    -14(%rbp), %rax
    movq    %rax, %rsi
    leaq    .LC0(%rip), %rdi
    movl    $0, %eax
    call    printf@PLT
    movl    $0, %eax
    movq    -8(%rbp), %rdx
    xorq    %fs:40, %rdx
    je  .L3
    call    __stack_chk_fail@PLT
.L3:
    leave
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE0:
    .size   main, .-main
    .ident  "GCC: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0"
    .section    .note.GNU-stack,"",@progbits
    .section    .note.gnu.property,"a"
    .align 8
    .long    1f - 0f
    .long    4f - 1f
    .long    5
0:
    .string  "GNU"
1:
    .align 8
    .long    0xc0000002
    .long    3f - 2f
2:
    .long    0x3
3:
    .align 8
4:

The highlighted value 1668178254 is hex 636E614E or "cnaN" (which, due to the endian reversal as x86 is a little-endian system, becomes "Nanc") in ASCII encoding, and 121 is hex 79, or "y".

So it uses two move instructions instead of a loop copy from a byte string section of the file given it's a short string, and the intervening "garbage" is (I believe) the following movw instruction. Likely a way to optimize the initialization, versus looping byte-by-byte through memory, even though no optimization flag was "officially" given to the compiler - that's the thing, the compiler can do what it wants to do in this regard. Microsoft's compiler, then, seems to be more "pedantic" in how it compiles because it does, in fact, apparently forgo that optimization in favor of putting the string together contiguously.

Illuminate answered 20/4, 2022 at 23:33 Comment(4)
"is (I believe) the following movw instruction" — no need for "belief", it's definite: 66 C7 45 F6 79 00 is exactly mov word [rbp-10], 0x0079.Ellissa
@Ellissa yeah, thanks for confirming; I'm just not excellent at parsing x86 opcodes. I later did confirm it with an "objdump".Illuminate
Yeah, this is what tools like objdump are for. godbolt.org/z/nEfv98MbE even has a "binary mode" where it compiles+assembles and shows you disassembly along with the machine code. (see also my comments on Eric's answer for why GCC does mov eax, 'y' to avoid LCP stalls with optimization enabled). I wouldn't waste my time looking in a raw hexdump of the whole binary and trying to remember numeric opcodes. Normally all you need to know for optimization is opcode and prefix lengths, although I do remember some common ones like B8..FStarveling
Usually the kinds of optimizations you get with optimization off just depend more on the internal structure of the compiler than on the compiler writers being "pedantic" or not. Both are perfectly valid ways to put chars on the stack, but I could guess that perhaps gcc creates a picture of what it wants the stack to look like, then makes it look that way, while MSVC creates a picture of the instructions that put the data on the stack, then optimizes them later (if enabled)Foeticide
C
6

Generally a compiled program is split into different types of "section". The assembler file will use directives to switch between them.

  • Code (".text")
  • Static read-only data (".section .rodata")
  • Initialised global or static variables (".data")
  • Uninitialised (or zero-initialized) global or static variables (".bss")

String literals in C can be used in two different ways.

  • As a pointer to constant data.
  • As an initaliser for an array.

If a string literal is used as a pointer then it is likely the compiler will place the string data in the read only data section.

If a string literal is used to initialise a global/static array then it is likely the compiler will place the array in the initilised data section (or the read-only data section if the array is declared as const).

However in your case the array you are initialising is an automatic local variable. So it can't be pre-initialised before program start. The compiler must include code to initialise it each time your function runs.

The compiler might choose to do that by storing the string in a read-only data location and then using a copy routine (either inlined or a call) to copy it to the local array. It may chose to simply generate instructions to set the elements of the array one by one. It may choose to generate instructions that set several array elements at the same time.

In your example it looks like MSVC has chosen to use a copy routine, so the string appears sequentially in the file. gcc on the other hand has chosen to use a 4 byte move instruction followed by a two byte move instruction, both with literals as inputs. So the literal is split up into two parts.

P.S. I've noticed some people posting https//godbolt.org/ links on other answers to this question. The Compiler Explorer is a useful tool but be aware that it hides the section switching directives from the assembler output by default.

Carmelitacarmelite answered 20/4, 2022 at 22:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.