Why don't two binaries of programs with only comments changed exactly match in gcc?
Asked Answered
E

3

112

I created two C programs

  1. Program 1

    int main()
    {
    }
    
  2. Program 2

    int main()
    {
    //Some Harmless comments
    }
    

AFAIK, when compiling, the compiler(gcc) should ignore the comments and redundant whitepaces, and hence the output must be similar.

But when I checked the md5sums of the output binaries, they don't match. I also tried compiling with optimisation -O3 and -Ofast but they still didn't match.

What is happening here?

EDIT: the exact commands and there md5sums are(t1.c is program 1 and t2.c is program 2)

gcc ./t1.c -o aaa
gcc ./t2.c -o bbb
98c1a86e593fd0181383662e68bac22f  aaa
c10293cbe6031b13dc6244d01b4d2793  bbb

gcc ./t2.c -Ofast -o bbb
gcc ./t1.c -Ofast -o aaa
2f65a6d5bc9bf1351bdd6919a766fa10  aaa
c0bee139c47183ce62e10c3dbc13c614  bbb


gcc ./t1.c -O3 -o aaa
gcc ./t2.c -O3 -o bbb
564a39d982710b0070bb9349bfc0e2cd  aaa
ad89b15e73b26e32026fd0f1dc152cd2  bbb

And yes, md5sums match across multiple compilations with same flags.

BTW my system is gcc (GCC) 5.2.0 and Linux 4.2.0-1-MANJARO #1 SMP PREEMPT x86_64 GNU/Linux

Ellyellyn answered 4/9, 2015 at 14:48 Comment(14)
Please include your exact command line flags. For example, is debug information included in the binaries at all? If so, the line numbers changing would obviously affect it...Or
Is the MD5 sum consistent across multiple builds of the same code?Gondar
I can't reproduce this. I would've guessed that this is caused by the fact the GCC embeds a whole bunch of metadata into binaries when compiling them (including timestamps). If you could add the precise command line flags you used, that'll be useful.Alda
@JonSkeet I have edited the question to add information you askedEllyellyn
@Gondar yes they do matchEllyellyn
@Alda edit made. please read the edit.Ellyellyn
Compare the contents, not the MD5. Which part of the ELF exactly changed?Kazachok
Just for clarity, you can further isolate this by separating compilation and linking. ie. compile to .o files, compare those, then then link and compare those. Fwiw, clang doesn't exhibit any difference with the same code, options, etc, on my 3.6 toolchain (OSX 10.10.5).Kamilahkamillah
Instead of just checking MD5sums and getting stuck, hexdump and diff to see exactly which bytes differJobey
Though the answer to the question "what is different between the two compiler outputs?" is interesting, I note that the question has an unwarranted assumption: that the two outputs should be the same and that we require some explanation of why they are different. All the compiler promises you is that when you give it a legal C program, the output is a legal executable that implements that program. That any two executions of the compiler produce the same binary is not a guarantee of the C standard.Crisscross
Check this article for some information on why GCC toolchain outputs for the same inputs can be different and some techniques to help get deterministic output: blog.mindfab.net/2013/12/…Finny
@EricLippert: Y'know, it is possible - and often very useful - for a compiler to give stronger guarantees than the absolute bare minimum mandated by the C standard ;-)Sad
@psmears: Of course it is possible. My point is that a question could be thought of as having the form "my compiler fails to implement a feature it is neither required to implement nor documented as implementing; why does it not have this feature?" Questions about why a not-required feature was not implemented can be hard to answer.Crisscross
@EricLippert: I know what you mean in general—but in this case I think the question does deserve an answer, both from a naïve point of view ("why on earth wouldn't the compiler give the same output when given exactly the same input?), but also because there are compelling real-world motivations for having it do so—both practical (hard-to-track-down bugs where the behaviour depends on the exact bytes in the executable) and procedural (the implementation of ISO9000 or similar standards may require the ability to reproduce builds exactly). A compiler that can help with these is a good thing :)Sad
A
160

It's because the file names are different (although the strings output is the same). If you try modifying the file itself (rather than having two files), you'll notice that the output binaries are no longer different. As both Jens and I said, it's because GCC dumps a whole load of metadata into the binaries it builds, including the exact source filename (and AFAICS so does clang).

Try this:

$ cp code.c code2.c subdir/code.c
$ gcc code.c -o a
$ gcc code2.c -o b
$ gcc subdir/code.c -o a2
$ diff a b
Binary files a and b differ
$ diff a2 b
Binary files a2 and b differ
$ diff -s a a2
Files a and a2 are identical

This explains why your md5sums don't change between builds, but they are different between different files. If you want, you can do what Jens suggested and compare the output of strings for each binary you'll notice that the filenames are embedded in the binary. If you want to "fix" this, you can strip the binaries and the metadata will be removed:

$ strip a a2 b
$ diff -s a b
Files a and b are identical
$ diff -s a2 b
Files a2 and b are identical
$ diff -s a a2
Files a and a2 are identical
Alda answered 4/9, 2015 at 15:10 Comment(4)
EDIT: Updated to say that you can strip the binaries to "fix" the problem.Alda
And this is why you should compare the assembly output, not MD5 checksums.Granddaddy
I have asked a follow-up question here.Sarette
Depending on the object file format the compilation time is also stored in the object files. So using COFF files for example files a and a2 would not be identical.Whitecap
L
28

The most common reason are file names and time stamps added by the compiler (usually in the debug info part of the ELF sections).

Try running

 $ strings -a program > x
 ...recompile program...
 $ strings -a program > y
 $ diff x y

and you might see the reason. I once used this to find why the same source would cause different code when compiled in different directories. The finding was that the __FILE__ macro expanded to an absolute file name, different in both trees.

Lombardo answered 4/9, 2015 at 14:57 Comment(3)
According to gcc.gnu.org/ml/gcc-help/2007-05/msg00138.html (outdated, I know) they don't save timestamps and it might be a linker issue. Although, I do remember reading a story recently about how a security firm profiled the working habits of a hacking team using the GCC timestamp information in their binaries.Alda
And not to mention that OP states that "md5sums match across multiple compilations with same flags" which indicates it probably isn't timestamps that are causing the issue. It's probably caused by the fact that they're different file names.Alda
@Alda Different file names should be caught by the strings/diff approach as well.Lombardo
B
16

Note: remember that the source file name goes into the unstripped binary, so two programs coming from differently named source files will have different hashes.

In similar situations, should the above not apply, you can try:

  • running strip against the binary to remove some fat. If the stripped binaries are the same then it was some metadata that isn't essential to the program operation.
  • generating an assembly intermediate output to verify that the difference is not in the actual CPU instructions (or, however, to better pinpoint where the difference actually is)
  • use strings, or dump both programs to hex and run a diff on the two hex dumps. Once located the difference(s), you might try and see whether there's some rhyme or reason to them (PID, timestamps, source file timestamp...). For example you might have a routine storing the timestamp at compile time for diagnostic purposes.
Babb answered 4/9, 2015 at 15:7 Comment(3)
My system is gcc (GCC) 5.2.0 and Linux 4.2.0-1-MANJARO #1 SMP PREEMPT x86_64 GNU/LinuxEllyellyn
You should try actually making two separate files. I couldn't reproduce it with modifying a single file either.Alda
Yes, file names are culprit. I can get same md5sums if I compile the programs with same name.Ellyellyn

© 2022 - 2024 — McMap. All rights reserved.