Why does the library compiled on two slightly different machines behaves slightly different?
Asked Answered
F

6

6

Here's the setup:

My coworker has a Fedora x64_86 machine with a gcc 4.3.3 cross compiler (from buildroot). I have an Ubuntu 9.04 x64_86 machine with the same cross compiler.

My coworker built an a library + test app that works on a test machine, I compiled the same library and testapp and it crashes on the same test machine.

As far as I can tell, gcc built against buildroot-compiled ucLibc, so, same code, same compiler. What kinds of host machine differences would impact cross compiling?

Any insight appreciated.

Update: To clarify, the compilers are identical. The source code for the library and testapp is identical. The only difference is that the testapp + lib have been compiled on different machines..

Fazeli answered 12/8, 2009 at 19:5 Comment(4)
It's hard to know without being able to inspect the machines or the code. Can you tell us anything about the nature of the crash? What type of crash is it? Is it taking place in your code or in a library that it depends on?Spanos
If everything is identical, how different are the output files?Kilkenny
How did it go, did you find the problem?Brainy
Unfortunately, I did not find what was causing the crash. Something in the environment, or the bug in the source. I will never know :(Fazeli
K
7

If your code crashes (I assume you get a sigsegv), there seems to be a bug. It's most likely some kind of undefined behaviour, like using a dangling pointer or writing over a buffer boundary.

The unfortunate point of undefined behaviour is, that it may work on some machines. I think you are experiencing such an event here. Try to find the bug and you'll know what happens :-)

Kehr answered 12/8, 2009 at 19:22 Comment(4)
Just to clarify, this is the same code, that's been compiled on two different machines with the same compiler. It behaves differently however.Fazeli
That doesn't contradict what ebo is saying. Undefined behavior is undefined, and can be affected by absolutely anything at all, e.g. slightly different versions of dynamic libraries that the compiler uses on different OSes. I'd bet money that this bug gets traced back to undefined behavior.Sabinesabino
That's exactly what undefined behaviour looks like.Belligerent
Oh. Ah. Got it I think.. the fact that the original compile doesn't crash is an exception not the rule. Understood..Fazeli
D
3

In what way does it crash? Can you be more specific, provide output, return codes, etc... Have you tried plugging in some useful printf()'s?

And, I think we need a few more details here:

  1. Does the testapp link to the library?

  2. Is the library static or dynamic?

  3. Is the library in the library search path, or have you added its directory to ld.so.conf?

  4. Are you following any installation procedures for the library and testapp?

  5. Are the two libraries and testapps bit-for-bit compatible? Do you expect them to be?

  6. Are you running as the same user as your coworker, with same environment and permissions?

Directed answered 12/8, 2009 at 19:18 Comment(4)
App crashes with a segfault. I dont know if it's interesting at all, I'm more interested in understanding where this difference is coming form. 1. Yes, app links to the library. 2. It's a shared library. 3. The library is found via LD_LIBRARY_PATH 4. I compiled the library and testapp and leave it in their original dirs. 5. I'm not sure what bit for bit compatible means. They are compiled for the same arch, same compiler. 6. Yes, this is an embedded device, so I'm running as root.Fazeli
Have you used ldd to make sure both testapps are linking against the exact same libraries on the target? I'm not sure what you mean on pt. 4. What are the original directories? Being that this is a cross-compiled target you must be compiling on one machine and deploying to the target (test machine). Are you deploying in the exact same manner as your coworker?Directed
As for pt. 5, what does a binary diff between the two versions of each piece (library & testapp) return, i.e. are they exactly the same?Directed
Just to be clear, I'm not suggesting you diff library with testapp, but testapp (yours) with testapp (coworker). Ditto for library.Directed
G
3

Obviously, something isn't identical.

Try using objdump and its many options, especially -d, to determine what is different.

You didn't make a point of it, so I am going to guess binutils is the difference. That is the set of tools used in building binaries. It includes ld, as and objdump.

Cross-compilers need their own set of binutils for the target architecture. However, unlike GCC I do not believe the binutils tools do a double bootstrap build and verify step, so it is possible that some difference from the original x86_64 build environment made it into them.

I'd try building the binutils packages for ARM again, using the ARM crosscompiler. See if that makes a difference.

It's something I have seen in regular x86 Gentoo stage1 installs too: after getting the bootstrap system and compilers installed and updated, a Gentoo user is well-recommended to rebuild system again using the updated tools.

Ganister answered 12/8, 2009 at 19:51 Comment(1)
Binutils are identical on both machines as they've been compiled by buildroot. Interesting note about recompiling the tools, thanks.Fazeli
R
1

What arch is your target (the test machine)?

Are you using the distribution provided compilers? They usually have a quite large set of patches applied to gcc, for example on gentoo there are about 20 patches, fedora and ubuntu won't be that different. Not all patches are 100% fine, though :-( So the compilers may in reality differ.

You may look for a "vanilla" version of gcc on your distribution, maybe it does the trick.

Roman answered 12/8, 2009 at 19:16 Comment(1)
The target arch is arm. The code is compiled with an identical compiler that's provided by buildroot.Fazeli
G
1

I knew someone who had a similar experience in college. Basically, in a lab of identical machines, his project worked on his development box, but crashed horribly on the professors box. These were two machines which were the same arch, running the same version of the OS.

It boiled down to an uninitialized pointer somewhere.

He had code which looked like:

if(p == NULL) {
    p = f();
}

Since p was a member of a class which was allocated on the heap, it's value was effectively random and occasionally was in fact NULL, making thing works OK... The problem was that sometimes and on some machines, the memory for p was NULL on program startup, but on the prof's box, it was not. The fix was of course to properly initialize p tp NULL and all was well.

You may be experiencing something like this. Or some type of undefined behavior which is a fancy way of saying "it may or may not work as expected for any or no reason at all"

Gabrielegabriell answered 12/8, 2009 at 20:19 Comment(0)
R
1

As a stab in the dark, I'd look for uninitialized variables. Make sure all local and global variables are assigned a value. Double check that constructors have initializers for ALL data members.

Recumbent answered 13/8, 2009 at 4:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.