Floating point exception ( SIGFPE ) on 'int main(){ return(0); }'
Asked Answered
L

2

74

I am trying to build a simple C program for two different Linux environments. On one device the program runs fine, on the other device the program generates a floating point exception. The program does nothing but return 0 from main which leads me to believe there is some incompatibility with the start-up code perhaps ABI?

The program is compiled with gcc with the following build specs:

Using built-in specs. Target: i386-redhat-linux Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-libgcj-multifile --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --disable-plugin --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --with-cpu=generic --host=i386-redhat-linux Thread model: posix gcc version 4.1.2 20080704 (Red Hat 4.1.2-52)

The program source is the following:

int main()
{
        return(0);
}

On the Celeron device this program generates the following under GDB:

[root@n00200C30AA2F jrn]# /jrn/gdb fail GNU gdb Red Hat Linux (5.3post-0.20021129.18rh) (gdb) run Starting program: /jrn/fail 

Program received signal SIGFPE, Arithmetic exception. 0x40001cce in ?? () (gdb) bt
#0  0x40001cce in ?? ()
#1  0x4000c6b0 in ?? ()
#2  0x40000cb5 in ?? ()

Below are the details that I can think to gather to help find out what is happening:

CELERON:  ( fails on this device )
2.6.8 #21 Mon Oct 1 11:41:47 PDT 2007 i686 i686 i386 GNU/Linux
============
[root@n00200C30AA2F proc]# cat cpuinfo 
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 9
model name      : Intel(R) Celeron(R) M processor          600MHz
stepping        : 5
cpu MHz         : 599.925
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr mce cx8 sep mtrr pge mca cmov pat clflush dts acpi mmx fxsr sse sse2 tm pbe
bogomips        : 1179.64

GNU C Library stable release version 2.3.2, by Roland McGrath et al.
Compiled by GNU CC version 3.2.2 20030222 (Red Hat Linux 3.2.2-5).
Compiled on a Linux 2.4.20 system on 2003-03-13.
Available extensions:
        GNU libio by Per Bothner
        crypt add-on version 2.1 by Michael Glad and others
        linuxthreads-0.10 by Xavier Leroy
        BIND-8.2.3-T5B
        libthread_db work sponsored by Alpha Processor Inc
        NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk

ATOM:  ( works fine on this device )
2.6.35 #25 SMP Mon Mar 12 09:02:45 PDT 2012 i686 i686 i386 GNU/Linux
==========
[root@n00E04B36ECE5 ~]# cat /proc/cpuinfo 
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 28
model name      : Genuine Intel(R) CPU N270   @ 1.60GHz
stepping        : 2
cpu MHz         : 1599.874
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx constant_tsc up arch_perfmon pebs bts aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 xtpr pdcm movbe lahf_lm
bogomips        : 3199.74
clflush size    : 64
cache_alignment : 64
address sizes   : 32 bits physical, 32 bits virtual
power management:


GNU C Library stable release version 2.5, by Roland McGrath et al.
Compiled by GNU CC version 4.1.2 20080704 (Red Hat 4.1.2-44).
Compiled on a Linux 2.6.9 system on 2009-09-02.
Available extensions:
        The C stubs add-on version 2.1.2.
        crypt add-on version 2.1 by Michael Glad and others
        GNU Libidn by Simon Josefsson
        GNU libio by Per Bothner
        NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk
        Native POSIX Threads Library by Ulrich Drepper et al
        BIND-8.2.3-T5B
        RT using linux kernel aio
Thread-local storage support included.

What can I do to determine what is causing this problem? How about trying to statically link against a certain version of libc?

After failure occurs under GDB I execute:

(gdb) x/1i $eip
0x40001cce:     divl   0x164(%ecx)
(gdb) info reg
eax            0x6c994f 7117135
ecx            0x40012858       1073817688
edx            0x0      0
ebx            0x40012680       1073817216
esp            0xbffff740       0xbffff740
ebp            0xbffff898       0xbffff898
esi            0x8049580        134518144
edi            0x400125cc       1073817036
eip            0x40001cce       0x40001cce
eflags         0x10246  66118
cs             0x73     115
ss             0x7b     123
ds             0x7b     123
es             0x7b     123
fs             0x0      0
gs             0x0      0
(gdb) x/1wx 0x164+$ecx
0x400129bc:     0x00000000
(gdb) 

Based on the help I've received it appears that for some reason the libc startup code is dividing by 0.

The question now is, what is causing this obviously bad behavior? Something must be incompatible with something else?

Assembly output:

[jrn@localhost ~]$ more fail.s
        .file   "fail.c"
        .text
.globl main
        .type   main, @function
main:
        leal    4(%esp), %ecx
        andl    $-16, %esp
        pushl   -4(%ecx)
        pushl   %ebp
        movl    %esp, %ebp
        pushl   %ecx
        movl    $0, %eax
        popl    %ecx
        popl    %ebp
        leal    -4(%ecx), %esp
        ret
        .size   main, .-main
        .ident  "GCC: (GNU) 4.1.2 20080704 (Red Hat 4.1.2-52)"
        .section        .note.GNU-stack,"",@progbits
Lightning answered 24/9, 2012 at 18:2 Comment(24)
And even gcc 4.1.2 is really old. Current GCC version is 4.7!Crag
GCC has entered the revision hell Firefox brought up. You'll see soon that we'll face GCC 25.3 in days!Psychodiagnostics
It might be also libc or libstdc++ related. Both have non-trivial initialization code. I would install their -dbg packages and try to use gdb to debug the issue. Good luck, you'll need it.Crag
These are embedded devices that have been in the field for years. So the Linux environments are old. We are trying to upgrade to newer build tools etc. But one device running a Celeron is having issues. I just don't know what part is wrong. std c library? Kernel tweak? ABI? No idea...Lightning
You could consider using cross-development techniques, notably a recent GCC & binutils with the musl-libcCrag
@BasileStarynkevitch Thanks for the helpful pointers. I will see what I can come up with.Lightning
Can you do a x/1i $eip when you get the FPE under GDB?Astrionics
@Astrionics Thank you. Result: (gdb) x/1i $eip 0x40001cce: divl 0x164(%ecx)Lightning
@Chimera: That's an integer divide instruction, which can only fail if you try to divide anything by 0, or if the result doesn't fit in 32-bits (it divides the 64-bit quantity in the registers EDX:EAX by the 32-bit divisor operand). Can you print out the operands with the commands info reg and x/1wx 0x164+$ecx?Salify
@AdamRosenfield done .. question updated. Thank you.Lightning
Well something in your libc startup code is dividing by 0. To figure out what code that is and why, you'll need to figure out how to get debug symbols for your libc implementation. Good luck.Salify
can I go in with a binary file editor and just remove that instruction? :-)Lightning
@Lightning remove won't work, but you should be able to swap it for a nop. Who knows what side effects that will have though!Andante
@Chimera: No, that's a very bad idea. Most likely, it will just crash again very soon after that, but even if you can get it running, it could likely start causing random other failures in completely unrelated places.Salify
Yeah - stuff like that you have to get to the bottom of, else every time you get another bit of strange behaviour, you'll have this nasty, nagging doubt..Hamon
Please provide an ASM output from GCC, using the -s CLI argument.Sovereign
@Sovereign ASM output added to question. Thank you.Lightning
judging by your processor Genuine Intel(R) CPU N270 @ 1.60GHz and your version of GCC 4.1.2 20080704 (Red Hat 4.1.2-52) I don't think even the red hat or red hat based linux dev's would even want to dive in this attic to find the solution does an updated version of the linux OS work?Canescent
Maybe first get the basics right: int main(void) or int main(int c, char **a)Trencherman
@Trencherman Already tried that, makes no difference.Lightning
@Canescent The purpose of this whole mess is to be able to update our build environment to newer tools and still be compatible with our older devices out in the field. Having to go into the field to change the Compact Flashes for 10's of thousands of devices to upgrade them to a newer libc and kernel etc is pretty expensive. I think we may be stuck. Unless someone comes along with a solution other than "use matching libc versions" we will have to re-think our upgrade path.Lightning
@Lightning i would suggest a virtual machine but that might not even run on the older devices lolCanescent
Please also post the output of readelf fail. See my answer for my guess on what's happening.Puffin
@H2CO3 offtopic, but I'm pretty sure that it was Google Chrome that started the high frequency version updates rage. (I believe FF just updated to 'look' more up-to-date compared to major browsers - no matter about the official stance; of course that may coincide with other project lifecycle changes)Douche
P
128

This is going to sound like a really long shot...but can you try the following?

$ readelf -a fail

and look for a GNU_HASH dynamic tag? My guess is that the binary uses GNU_HASH, and your ld.so is too old to understand it. Support for the GNU hash section was added to glibc around 2006, and mainline distros began to be GNU-hash-only around 2007 or 2008. Your Centrino's glibc is from 2003, which predates GNU hashing.

If the ld.so doesn't understand GNU hash, it will try to use the old ELF hash section instead, which is empty. In particular, I suspect your crash is occurring at this line in elf/do-lookup.h:

for (symidx = map->l_buckets[hash % map->l_nbuckets];

Since the linker presumably doesn't understand GNU hashes, l_nbuckets would be 0, resulting in the crash. Note that map is a large structure with around 100 structure elements, and l_nbuckets is around the 90th member of the structure in newer ld.so (0x164 = 4*89, so in older ld.so it is probably precisely this member).

To see if this is conclusively the problem, build with -Wl,--hash-style=sysv or -Wl,--hash-style=both and see if the crash goes away.

Puffin answered 25/9, 2012 at 1:22 Comment(5)
Thanks... when I get into the office in the morning I will give it a try.Lightning
Wow. Just wow. Nice analysis. +1 regardless of whether it fixes the issue for the OPDouche
Yes - it's such a good try that it ought to be right, (even though the posters' name never seems to end). +1.Hamon
You, sir, have excellent psychic debugging skills. Raymond Chen would be proud.Salify
You rock! That was the problem. Compiling with -Wl,--hash-style=both generates an executable that works on both the older and newer environments.Lightning
M
4

Since it works on the ATOM but not on the older Celeron, I would think the problem could be with a compiler optimization generating code that the Celeron cannot execute. Try compiling with the flag -O0. Additionally, I would suggest adding -march=i686 to explicitly state the architecture. Also, to help isolate the problem I'd also suggest disabling linking to the C++ runtime and JAVA.

Did you build this test program once and run it on each device, or did you build a different executable for each device? If you are building one executable you may have differing versions of libc, libstdc++ on the two devices or on the devices vs your build machine.

Marriott answered 24/9, 2012 at 22:37 Comment(6)
The compile flags made no difference, same outcome. And yes, building on one machine and running the executable on the two different devices. The Atom device environment is running libc 2.5 ( which is the same as is on build machine ). However, the device that the executable fails on has libc 2.3.2. So perhaps there is some backwards compatibility issue with libc 2.3.2 and libc 2.5?Lightning
What about glibcxx or libstdc++? Also, are you statically linking any of these libraries? I would suggest trying to build against the lowest common denominator of libc 2.3.2 and whatever that device has for c++.Marriott
Not statically linking anything. [jrn@localhost ~]$ ldd fail linux-gate.so.1 => (0x0098f000) libc.so.6 => /lib/libc.so.6 (0x00bb0000) /lib/ld-linux.so.2 (0x00b91000)Lightning
Build against an older version of libc (ie, 2.3.2). This is most likely causing your problem.Marriott
Yep, that is my suspicion as well, however, we are trying to find a way to upgrade our build environment to newer libraries etc but still be able to create executables compatible with multiple devices that have different versions of libc. So we may be stuck and the ultimate way forward is to bite the bullet and upgrade the environment for the older legacy devices.Lightning
I will generally use an old Linux distribution for my build machine, makes it easier to support the software with a single binary across many versions of Linux. You can download the glibc 2.3.2 source (ftp.gnu.org/gnu/libc/glibc-2.3.2.tar.gz) and simply update your Makefiles. This might be helpful: tldp.org/HOWTO/Glibc2-HOWTO-6.htmlMarriott

© 2022 - 2024 — McMap. All rights reserved.