c++ program crashes when linked to two 3rd party shared libraries
Asked Answered
S

2

11

I have two outsourced shared libraries for linux platform (no source, no document). The libraries work fine when they are linked to program separately (g++ xx.cpp lib1.so, or g++ xx.cpp lib2.so).

However, when any c++ program is linked to these two shared libraries at the same time, the program inevitably crashes with "double free" error (g++ xx.cpp lib1.so lib2.so).

Even if the c++ program is an empty hello world program and has nothing to do with these libraries, it still crashes.

#include <iostream>
using namespace std;
int main(){
     cout<<"haha, I crash again. Catch me if you can"<<endl;
     return 0;
}

Makefile:

g++ helloword.cpp lib1.so lib2.so

I got some clue that these lib1.so lib2.so libraries might share some common global variable and they destroy some variable twice. I have tried gdb and valgrind, but cannot extract useful information from backtrace.

Is there any way that I could possibly isolate these two shared libraries and make them work in a sandbox manner?

EDITED (adding core dump and gdb backtrace):

I just linked the aforementioned toy empty helloword program with the two libraries (platform: centos 7.0 64bits with gcc4.8.2):

g++ helloworld.cpp  lib1.so lib2.so -o check

Valgrind:

==29953== Invalid free() / delete / delete[] / realloc()
==29953==    at 0x4C29991: operator delete(void*) (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==29953==    by 0x613E589: __cxa_finalize (in /usr/lib64/libc-2.17.so)
==29953==    by 0x549B725: ??? (in /home/fanbin/InventoryManagment/lib1.so)
==29953==    by 0x5551720: ??? (in /home/fanbin/InventoryManagment/lib1.so)
==29953==    by 0x613E218: __run_exit_handlers (in /usr/lib64/libc-2.17.so)
==29953==    by 0x613E264: exit (in /usr/lib64/libc-2.17.so)
==29953==    by 0x6126AFB: (below main) (in /usr/lib64/libc-2.17.so)
==29953==  Address 0x6afb780 is 0 bytes inside a block of size 624 free'd
==29953==    at 0x4C29991: operator delete(void*) (in /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==29953==    by 0x613E589: __cxa_finalize (in /usr/lib64/libc-2.17.so)
==29953==    by 0x4F07AC5: ??? (in /home/fanbin/InventoryManagment/lib2.so)
==29953==    by 0x5039900: ??? (in /home/fanbin/InventoryManagment/lib2.so)
==29953==    by 0x613E218: __run_exit_handlers (in /usr/lib64/libc-2.17.so)
==29953==    by 0x613E264: exit (in /usr/lib64/libc-2.17.so)
==29953==    by 0x6126AFB: (below main) (in /usr/lib64/libc-2.17.so)

gdb backtrace message:

(gdb) bt
#0  0x00007ffff677d989 in raise () from /lib64/libc.so.6
#1  0x00007ffff677f098 in abort () from /lib64/libc.so.6
#2  0x00007ffff67be197 in __libc_message () from /lib64/libc.so.6
#3  0x00007ffff67c556d in _int_free () from /lib64/libc.so.6
#4  0x00007ffff7414aa2 in __tcf_0 () from ./lib1.so
#5  0x00007ffff678158a in __cxa_finalize () from /lib64/libc.so.6
#6  0x00007ffff739f726 in __do_global_dtors_aux () from ./lib1.so
#7  0x0000000000600dc8 in __init_array_start ()
#8  0x00007fffffffe2c0 in ?? ()
#9  0x00007ffff7455721 in _fini () from ./lib1.so
#10 0x00007fffffffe2c0 in ?? ()
#11 0x00007ffff7debb98 in _dl_fini () from /lib64/ld-linux-x86-64.so.2
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

update

Thanks for @RaduChivu 's help, I found a very similar scenario: segmentation fault at __tcf_0 when program exits , looks like indeed there is a global variable collision between the two libraries. Considering I do not have the source files for these two external shared libraries, except for using two separate processes, is there any other way that I can resolve this conflict?

Scotch answered 31/7, 2014 at 5:52 Comment(9)
what is the # char all around?Biotic
Sounds to me like the two libraries are compiled with different versions of gcc (or different compilers in any case), nothing you can do in this case except talk to the devs to use the same compiler or make separate processes that talk through IPCUnsought
@Biotic "#" just represent names of the shared library. I have edited to make it more readable.Scotch
@RaduChivu It is possible. Forgive me but I sill want to give it a struggle. Would it also be possible that those two libraries share some same-name-global variable and they all try to release it? Because these two libraries belong to the same package that went down for over 5 versions, it would be rare they used different compilers all 5 times. Besides, the windows version works fine but all linux versions crash.Scotch
Only if it's a global variable that the linker can't find at compile time so the system loads it from another .so (extern void* globalVar), can you use gdb to get a stack trace and post it in your question? maybe we can get more context from thereUnsought
@RaduChivu Yep, I have added core dump and backtrace messages. The libraries are lib1.so and lib2.soScotch
Yeah, so the issue is like you said a name collision between two global variables, more on this at the following link: https://mcmap.net/q/1158703/-crash-in-__tcf_0Unsought
Is it possible you're building with a different version of the library that it's picking up on the target when running? This causes other libraries e.g. libstdc++ to malfunction. The double free/corruption is a typical symptom of this.Enrollment
I see this same issue when dlopen'ing a library twice (same lib, two different filenames) with RTLD_GLOBAL. The second dlclose causes segfault, traced to _dl_close_worker -> __do_global_dtors_aux -> __cxa_finalize -> ... -> free.Keane
S
4

I have solved this problem after a day's search and leave a note here in case anyone else encountering this in the future.

Explanation

It proves that @RaduChivn and my guess is correct: the two shared libraries may share a common global variable. Even when an empty program is linked to both the two shared libraries at the same time, as it exits, the common global variable would be attempted to be released twice, and thus, a double free corruption.

The clue comes from this message in gdb backtrace:

#4  0x00007ffff7414aa2 in __tcf_0 () from ./lib1.so

As described in this thread:

What is function __tcf_0? (Seen when using gprof and g++),

tcf_0 is a function generated by g++ to destruct static object when exit() is triggered. This message hints that the double free occurs when one shared library attempts to quit after another one.

Since these two libraries are designed to work together, the corruption is an unacceptable engineer disaster. How can such a low-quality-yet-obvious bug survive for five version releases? It is probably due to the majority of library users working on windows platform (whose package works fine). Yet this assumption provides another hint on the mistake's origin: the shared library works well on windows while crashes on linux; then it must be some OS-dependent behavior difference causing the bug. This thread provides some insight:

Global variable has multiple copies on Windows and a single on Linux when compiled in both exec and shared libaray.

In short, "extern globals" from shared libraries get single copy on linux, but multiple copies on windows.

Solution

(1) Naturally we would have a workaround as creating two processes, each linking to one library separately.

(2) @DavidSchwartz provides another workaround of using _exit(0) at the end of program, instead of the common "return 0" or "exit(0)", it works. According to

What is the difference between using _exit() & exit() in a conventional Linux fork-exec?

, one must manually flush files and check the atexit jobs; for the memory things, since program is exiting, OS reclaims all process memory anyway, nothing to worry about.

(3) Another way is to use dlopen(xx.so, RTLD_LOCAL), blinding all symbols first and then manually dlysm the function symbols you need

(@JonathanWakely notes here RTLD_LOCAL has side effects, see comment).

In this very case, the library coder even did not use "extern C" in their shared libraries, rendering the name mangling quite unreadable in the so files; If anyone else enjoys this, the following thread may help:

Getting undefined symbol error while dynamic loading of shared library

If your shared libraries are not well supported, just as in my case, solutions are still possible. I manually sorted out all the required functions, and used nm to find each corresponding symbol in the .so files, linked them one by one, and it worked.

Scotch answered 1/8, 2014 at 8:21 Comment(2)
RTLD_LOCAL will solve the immediate problem, but be aware that it means you won't be able to use C++ exceptions or RTTI across the shared library interface (which might not be a problem in your case). Also, you seem to imply that not using extern "C" is a bad thing, but unless you want to call the library from non-C++ programs there is no reason to use that, and even if you do want to call it from non-C++ programs you only need to use extern "C" in the public API. Name mangling is not meant to make readable names, that's why it's called "mangling"Riverhead
@JonathanWakely I will note your RTLD_LOCAL and extern "C" suggestions. Updated answer to reflect this. Thanks.Scotch
D
2

One possible solution would be to never call exit. To terminate your program, just call _exit. If there's anything specific you need to do that would normally be done by exit, just do it yourself before calling _exit.

Distance answered 7/8, 2014 at 6:6 Comment(1)
Checked and worked. Assuming smart pointers are used and clean work are manually done, this one looks good. Updated answer to reflect your suggestion. Thanks.Scotch

© 2022 - 2024 — McMap. All rights reserved.